Tom's Guide | Tom's Hardware | Tom's Games
![]() |
![]() |
![]() |
Here's an interesting scenario. I need to extract rows from a file, File1 that matched certain columns in File2.
File1 has data separated by comma. File2 has data separated by comma. Let's say column 5 and column 7 from File1 has to be compared to column 2 and column 4 in File1. If the combination exist then, I would like the entire row into File3.
For matches with only one column I currently use cut command into a temp file and fgrep the pattern.Here it goes:
cut -f 3 -d, File1.dat > tempFile.dat
fgrep -f tempFile.dat File2.datThe problem is when I have multiple fields to be matched at different positions.
Any help is greatly appreciated.Shyam

In your example, I assumed that columns 5 and 7 are in File1, and columns 2 and 4 are in File2. The following code stores key info from File2 into an array for referencing when File1 is processed as the main file (and this should be the file that you want the entire row from).
On the concatenation, I stick a colon between the two columns. If I were to use the simple concatenation, then 25 and 68 from one file would match 2 and 568 from the other (whereas 25:68 will not match 2:568).
awk -F, 'BEGIN {
rc=getline < "File2"
while (rc==1)
{f2key=$2 ":" $4
f2keys[f2key]=1
rc=getline < "File2"}
}{col5and7=$5 ":" $7
if (col5and7 in f2keys)
}' File1 > File3

Shyam:
There's always more than one way of doing something. Here's my take on cross referencing two files:
#!/bin/ksh
rm -f file3
while read line; do
f1=$(echo $line | cut -d',' -f1);
f2=$(echo $line | cut -d',' -f2);# two step process required to execute dynamic commands
command=$(echo nawk -F \',\' \'\$1~/$f1/ \&\& \$2~/$f2/ {print 1} \' file2);
check=$(eval $command);if [ ${check:-xxx} != 1 ]; then
: # Failed. - No cross-reference"
else
# "Success"
echo "$line" >> file3
fi
done file1
# need a less than sign between done and file1. This forum
# wipes them out
# end scriptWhat I do is read in file 1 a line at a time, cut out the columns 1 want, and then I build an [n]awk command checking if the the columns are matched in file2.
In my example, I use columns 1 and 2. If I interpret what you want correctly,
file 1, columns 5 & 7 would be:
f1=$(echo $line | cut -d',' -f5);
f2=$(echo $line | cut -d',' -f7);and file 2, columns 2 and 4 would be:
command=$(echo nawk -F \',\' \'\$2~/$f1/ \&\& \$4~/$f2/ {print 1} \' file2);
The above line checks if field 2 and field 4 of file 2 matches the input from file 1, fields 5 & 7.
It would be an interesting technical exercise to determine whether my solution of James' is faster. I suspect, for small files, that James' is, and for very large files, mine is.
If closing, if file 1 has fixed number of fields, I'd consider doing this to eliminate the cut commands:
while IFS="," read f1 f2 f3 f4 f5 f6 f7
do
.
.Regards,
Nails

I thought I'd throw in a 3rd method; a Perl script. This will match the columns you want as long as they also fall on the same row number. If you wanted to match the columns but didn't care if they were on different rows, then we'd need to make a minor modification.
And as Nails said, "There's always more than one way of doing something."
#!/usr/bin/perluse strict;
open F1, 'file1.dat' or die "could not open file1 $!";
open F2, 'file2.dat' or die "could not open file2 $!";
open F3, ">>", 'file3.dat' or die "could not open file3 $!";my @file2 = ();
close F2;my $row = 0;
while () {
my @f1col = split ",", $_;
my @f2col = split ",", $file2[$row];
print F3 $_ if ($f1col[4] == $f2col[1] && $f1col[6] == $f2col[3]);
$row++;
}
close F1;
close F3;

Oops, I forgot that this forum doesn't like certain code fragments.
my @file2 = ()
while ()should be:
my @file2 = ( [F2] )
while ( [F1] )but replace the [ ] with the angle brackets.

It's easy to do a timed comparison - just run the scripts preceded by the word "time". I timed the two scripts, then posted the timings side by side.
*****************************
file1 = 7 lines
file2 = 11 lines
Boothe nails slower-by
real 0m0.05s 0m0.52s 10x
user 0m0.01s 0m0.16s 16x
sys 0m0.02s 0m0.34s 17x
*****************************
file1 = 7
file2 = 500
Boothe nails slower-by
real 0m0.16s 1m7.99s 425x
user 0m0.04s 0m20.02s 500x
sys 0m0.03s 0m42.37s 1,412x
*****************************
file1 = 1000 lines
file2 = 10 lines
Boothe nails slower-by
real 0m0.09s 1m8.24s 758x
user 0m0.05s 0m19.89s 398x
sys 0m0.02s 0m42.46s 2,123x
*****************************The first test with very small files shows the nails script to be far slower, by factors of 10 to 17 times. Main reason is because, for each file1 line processed, that script spawns several processes (nawk, cut etc).
The second test, with a small file1 and a larger file2, shows slower exponentially. The reason for this huge jump is because, for each line of file1, the nails script processes the entire file2 (which in this test was increased to 500 lines).
But when file1 is large instead of file2, it is also very bad, but this time because it has to spawn 2000 cut processes, 1000 nawk processes etc.
Compare this with my script ...
It spawns a single awk process, which reads the contents of file1 only once and the contents of file2 only once, and opens file3 (out) only once. The main file to be processed can be unlimited in size, but the array size is limited, so if file size might be an issue, I store the smaller set of data in the awk array.
For the nails script, each time that multiple lines in file2 match, the nails script has an error due to nawk printing multiple lines of output. When the script gets a hit on file2, it needs to print 1 and immediately exit. Even if no multiple matches are expected, this prevents nawk from needlessly scanning the rest of file2 for this one while loop.

James,
It would be interesting to compare the time between your awk script and my perl script. I can't test it on the system I'm currently on but, when I get home, I might run the test on my solaris system.

Your perl script will be extremely fast. But, as you state, it is a same row matchup, so it has not been burdened with having to do a matchup where the files are different sizes.
I do not use perl yet, but I'm sure that the common perl solution would be the same as my awk solution, which is to store key info from one file into an array for referencing while the other file is processed.

I truly appreciate all the responses. I need a different row match, in the sense, the record count in file1 and file2 is different. Infact file2 would be very small compared to file1.
Boothe, for the awk script, should I have any considerations for the array if file2 is huge?
Thx
Shyam

Yes, at some point, you would overflow the available memory that awk can allocate for the array. But we are storing only 2 columns from each file2 line. You will be able to store many thousands of entries.
If file2 is very large and file1 is smallish, you can store file1 in an array, then process a file2 of unlimited size. But in this case, we would have to store the entire file1 lines, not just the match columns, because on a match, you need to output the entire file1 line.
If both files are so huge that this array match approach cannot be used, then you need to sort both files in key-column order, then read both files doing a merge match-up (I use awk for this).

![]() |
![]() |
![]() |

This post is quite old and has been locked from receiving new replies. Please create a new posting instead.
| Ads by Google |