unix perl: Merging strings, txt files

January 11, 2011 at 20:50:28
Specs: Win XP, Intel xeon
Hi... I'm working on a script for "text data formatting" with perl, but I'm stuck when merging them.

Let's say I have 3 txt files I want to merge (but not append) and each of the files is exactly 5000 lines long and the delimiter between elements in each line is a tab (\t).
Every new line in the resulting merged file should contain 5000 lines and it's something like this:

--------------------------------
file1.txt contains:
A B C D E F G
aa bb cc dd ee ff gg
--------------------------------
file2.txt contains:
1 2 3 4 5 6 7
me you he she we you they
--------------------------------
file3.txt contains:
apple pear orange lemon grape lemon
dog cat mouse bird
======================

output.txt should look like:
A B C 3 orange
aa bb cc he mouse

======================
so each line should be composed of:
1) the "first 3" elements of file1 and
2) "every 3rd" element of the corresponding line of files 2 and 3

I thought one way to do this would be to make an array (first, filtering the elements I need) and the arrays together from each file with a foreach loop.

The easy part of course is chomp, getting rid of tabs and filtering the first 3 elements of the file1, and every 3rd element of the rest of the files.
It goes like this:


#!/usr/bin/perl

open (IN1, "<file1.txt") or die $!;
open (OUT, ">output.txt") or die $!;

my @array;

while (<IN1>) {
  chomp;
  (my $first, my $second, my $third) = split("\t");
  @array = ($first, $second, $third);

  # to print them out for debugging or as the first file should look:
  # print OUT join("\t", @array), "\n";
}
close IN1;
close OUT;

To get the third elements of each of the rest of the lines in the other files we can use the same and only work with "$third" above.

Problem is when trying to merge all of them I have tried all sorts of arrays but I am running out of options here. I have tried something like this but doesn't work either:

#!/usr/bin/perl

use strict;

my $file1 = 'file1.txt';
my $file2 = 'file2.txt';
my $i;

open F1, $file1 or die $!;
chomp(my @file1 = <F1>);
close F1;

open F2, $file2 or die $!;
my @file2 = <F2>;
close F2;

open (F3, ">output.txt") or die $!;
for $i (0..$#file1){
print F3 $file1[$i];
print $file2[$i];
}

print F3 $file2[$_] for ($file2[$i]..$#file2);

close F1;
close F2;
close F3;



See More: unix perl: Merging strings, txt files

Report •

#1
January 11, 2011 at 23:49:15
I don't perl but if you can use a WinNT box:

=========================================
@echo off > newfile & setLocal enableDELAYedeXpansion

set N=
for /f "tokens=1-3" %%a in (file1.txt) do (
set /a N+=1
set X!N!=%%a %%b %%c
)

set N=
for /f "tokens=3" %%a in (file2.txt) do (
set /a N+=1
set Y!N!=%%a
)

set N=
for /f "tokens=3" %%a in (file3.txt) do (
set /a N+=1
set Z!N!=%%a
)

for /L %%i in (1 1 !N!) do (
>> newfile echo.!X%%i! !Y%%i! !Z%%i!
)


=====================================
Life is too important to be taken seriously.

M2


Report •

#2
January 12, 2011 at 02:29:13
thanks for the reply. I am afraid it has to be in perl since we are using unix. (In fact I run a bash shell in windows but it has to be done in unix-style.)

But by your example I see that you generate the arrays or something and then merge one after the another.
So it seems I still have to keep on cracking on this thing, but I am very stuck.

Report •

#3
January 12, 2011 at 07:52:17
Don't load the entire data into 3 arrays. Loop over all 3 line-by-line at the same time.

Use the 3 arg form of open and a lexical var for the filehandle and the die statement should include the filename.

The first arg to split is a pattern i.e., a regex, not a string (with 1 exception).

Use an array slice when splitting the line to extract only the desired fields.

I don't have time right now to work up any actual code, but start with that and post back.

BTW, I'm nearly the only Perl programmer on this site and due to the rarity of Perl question on this site, i don't poke my head in here as much as I used to.

There are a number of Perl specific forums that I post on.


Report •

Related Solutions

#4
January 13, 2011 at 09:08:12
Hi FM,

And I thought you weren't here much because you had a life.

;)


=====================================
Life is too important to be taken seriously.

M2


Report •

#5
January 13, 2011 at 10:41:46
Had a life? What's that?

Actually, I've been tied up on a major VoIP project at work and traveling to our stores to migrate them to the new VoIP system.


Report •

#6
July 24, 2011 at 19:58:39
I knew I started this thread half a year ago but never came back to post the answer so here it is. Better late than never.

#!/usr/bin/perl

use strict;
use warnings;
use FindBin qw/$Bin/;
use IO::File;

$|++;

my @files = @ARGV; 

@files = sort { compare_file($a, $b) } @files;
sub compare_file {
    my ($fileA, $fileB) = @_;
    $fileA =~ s/\D+//g; 
    $fileB =~ s/\D+//g;
    $fileA <=> $fileB;
}

foreach my $file (@files) {
    $fh{$file} = new IO::File "$Bin/$file", 'r';
}

#shift: shift removes the first element from the array
my @rest_files = @files;
my $first_file = shift @rest_files;

while (my $line = $fh{$first_file}->getline) {
    chomp($line); 
    my @ele = split(/\t/, $line);
    my @parts = splice(@ele, 0, 3);
    foreach my $file (@rest_files) {
        my $temp_line = $fh{$file}->getline;
        chomp($temp_line); my @temp_ele = split(/\t/, $temp_line);
        push @parts, $temp_ele[2];
    }
    print join("\t", @parts) . "\n";
}

foreach my $file (@files) {
    $fh{$file}->close();
}

1;


Report •

#7
July 27, 2011 at 07:58:30
Hello captaintacos,

That's a nice script and a world apart from your original.

There are a few very minor style adjustments that I'd recommend to have it come more in line with PBP (Perl Best Practices).

1) Your %fh hash wasn't declared. I'll assume that was in your production script, otherwise the script would not compile/run.

2) You should add a check to verify that @ARGV actually holds the expected filenames.

3) Move the subroutine definition to the end of the script.

4) It's better to not use the indirect object call.
Meaning, change this:

    $fh{$file} = new IO::File "$Bin/$file", 'r';

To this:
    $fh{$file} = IO::File->new("$Bin/$file", 'r');

5) Drop the splice statement and use an array slice on both of those split statements.

while (my $line = $fh{$first_file}->getline) {
    chomp($line); 
    my @parts = (split(/\t/, $line))[0..2];
    foreach my $file (@rest_files) {
        my $temp_line = $fh{$file}->getline;
        chomp($temp_line);
        push @parts, (split(/\t/, $temp_line))[2];
    }
    print join("\t", @parts) . "\n";
}


Report •

Ask Question