|
| Computing.Net: Over 1,000,000 posts about all things technology related! Over 90% answered within 24 hours! Click here to sign up now, it's free! |
UNIX cut & grep specific columns
|
Original Message
|
Name: Shyam
Date: June 23, 2003 at 10:24:59 Pacific
Subject: UNIX cut & grep specific columns OS: Sun Solaris 9.5 CPU/Ram: NA
|
Comment: Here's an interesting scenario. I need to extract rows from a file, File1 that matched certain columns in File2. File1 has data separated by comma. File2 has data separated by comma. Let's say column 5 and column 7 from File1 has to be compared to column 2 and column 4 in File1. If the combination exist then, I would like the entire row into File3. For matches with only one column I currently use cut command into a temp file and fgrep the pattern. Here it goes: cut -f 3 -d, File1.dat > tempFile.dat fgrep -f tempFile.dat File2.dat The problem is when I have multiple fields to be matched at different positions. Any help is greatly appreciated. Shyam
Report Offensive Message For Removal
|
|
Response Number 1
|
|
Reply: (edit)In your example, I assumed that columns 5 and 7 are in File1, and columns 2 and 4 are in File2. The following code stores key info from File2 into an array for referencing when File1 is processed as the main file (and this should be the file that you want the entire row from). On the concatenation, I stick a colon between the two columns. If I were to use the simple concatenation, then 25 and 68 from one file would match 2 and 568 from the other (whereas 25:68 will not match 2:568). awk -F, 'BEGIN { rc=getline < "File2" while (rc==1) {f2key=$2 ":" $4 f2keys[f2key]=1 rc=getline < "File2"} }{col5and7=$5 ":" $7 if (col5and7 in f2keys) print }' File1 > File3
Report Offensive Follow Up For Removal
|
|
Response Number 2
|
Name: nails
Date: June 23, 2003 at 17:00:26 Pacific
|
Reply: (edit)Shyam: There's always more than one way of doing something. Here's my take on cross referencing two files: #!/bin/ksh rm -f file3 while read line; do f1=$(echo $line | cut -d',' -f1); f2=$(echo $line | cut -d',' -f2); # two step process required to execute dynamic commands command=$(echo nawk -F \',\' \'\$1~/$f1/ \&\& \$2~/$f2/ {print 1} \' file2); check=$(eval $command); if [ ${check:-xxx} != 1 ]; then : # Failed. - No cross-reference" else # "Success" echo "$line" >> file3 fi done file1 # need a less than sign between done and file1. This forum # wipes them out # end script What I do is read in file 1 a line at a time, cut out the columns 1 want, and then I build an [n]awk command checking if the the columns are matched in file2. In my example, I use columns 1 and 2. If I interpret what you want correctly, file 1, columns 5 & 7 would be: f1=$(echo $line | cut -d',' -f5); f2=$(echo $line | cut -d',' -f7); and file 2, columns 2 and 4 would be: command=$(echo nawk -F \',\' \'\$2~/$f1/ \&\& \$4~/$f2/ {print 1} \' file2); The above line checks if field 2 and field 4 of file 2 matches the input from file 1, fields 5 & 7. It would be an interesting technical exercise to determine whether my solution of James' is faster. I suspect, for small files, that James' is, and for very large files, mine is. If closing, if file 1 has fixed number of fields, I'd consider doing this to eliminate the cut commands: while IFS="," read f1 f2 f3 f4 f5 f6 f7 do . . Regards, Nails
Report Offensive Follow Up For Removal
|
|
Response Number 3
|
Name: FishMonger
Date: June 23, 2003 at 22:58:14 Pacific
|
Reply: (edit)I thought I'd throw in a 3rd method; a Perl script. This will match the columns you want as long as they also fall on the same row number. If you wanted to match the columns but didn't care if they were on different rows, then we'd need to make a minor modification. And as Nails said, "There's always more than one way of doing something." #!/usr/bin/perl
use strict; open F1, 'file1.dat' or die "could not open file1 $!"; open F2, 'file2.dat' or die "could not open file2 $!"; open F3, ">>", 'file3.dat' or die "could not open file3 $!"; my @file2 = (); close F2; my $row = 0; while () { my @f1col = split ",", $_; my @f2col = split ",", $file2[$row]; print F3 $_ if ($f1col[4] == $f2col[1] && $f1col[6] == $f2col[3]); $row++; } close F1; close F3;
Report Offensive Follow Up For Removal
|
|
Response Number 4
|
Name: FishMonger
Date: June 23, 2003 at 23:06:50 Pacific
|
Reply: (edit)Oops, I forgot that this forum doesn't like certain code fragments. my @file2 = () while () should be: my @file2 = ( [F2] ) while ( [F1] ) but replace the [ ] with the angle brackets.
Report Offensive Follow Up For Removal
|
|
Response Number 5
|
|
Reply: (edit)It's easy to do a timed comparison - just run the scripts preceded by the word "time". I timed the two scripts, then posted the timings side by side. ***************************** file1 = 7 lines file2 = 11 lines Boothe nails slower-by real 0m0.05s 0m0.52s 10x user 0m0.01s 0m0.16s 16x sys 0m0.02s 0m0.34s 17x ***************************** file1 = 7 file2 = 500 Boothe nails slower-by real 0m0.16s 1m7.99s 425x user 0m0.04s 0m20.02s 500x sys 0m0.03s 0m42.37s 1,412x ***************************** file1 = 1000 lines file2 = 10 lines Boothe nails slower-by real 0m0.09s 1m8.24s 758x user 0m0.05s 0m19.89s 398x sys 0m0.02s 0m42.46s 2,123x *****************************The first test with very small files shows the nails script to be far slower, by factors of 10 to 17 times. Main reason is because, for each file1 line processed, that script spawns several processes (nawk, cut etc).The second test, with a small file1 and a larger file2, shows slower exponentially. The reason for this huge jump is because, for each line of file1, the nails script processes the entire file2 (which in this test was increased to 500 lines). But when file1 is large instead of file2, it is also very bad, but this time because it has to spawn 2000 cut processes, 1000 nawk processes etc. Compare this with my script ... It spawns a single awk process, which reads the contents of file1 only once and the contents of file2 only once, and opens file3 (out) only once. The main file to be processed can be unlimited in size, but the array size is limited, so if file size might be an issue, I store the smaller set of data in the awk array. For the nails script, each time that multiple lines in file2 match, the nails script has an error due to nawk printing multiple lines of output. When the script gets a hit on file2, it needs to print 1 and immediately exit. Even if no multiple matches are expected, this prevents nawk from needlessly scanning the rest of file2 for this one while loop.
Report Offensive Follow Up For Removal
|
|
Response Number 6
|
Name: FishMonger
Date: June 24, 2003 at 13:04:18 Pacific
|
Reply: (edit)James, It would be interesting to compare the time between your awk script and my perl script. I can't test it on the system I'm currently on but, when I get home, I might run the test on my solaris system.
Report Offensive Follow Up For Removal
|
|
Response Number 7
|
|
Reply: (edit)Your perl script will be extremely fast. But, as you state, it is a same row matchup, so it has not been burdened with having to do a matchup where the files are different sizes. I do not use perl yet, but I'm sure that the common perl solution would be the same as my awk solution, which is to store key info from one file into an array for referencing while the other file is processed.
Report Offensive Follow Up For Removal
|
|
Response Number 8
|
Name: Shyam
Date: June 25, 2003 at 09:09:54 Pacific
|
Reply: (edit)I truly appreciate all the responses. I need a different row match, in the sense, the record count in file1 and file2 is different. Infact file2 would be very small compared to file1. Boothe, for the awk script, should I have any considerations for the array if file2 is huge? Thx Shyam
Report Offensive Follow Up For Removal
|
|
Response Number 9
|
|
Reply: (edit)Yes, at some point, you would overflow the available memory that awk can allocate for the array. But we are storing only 2 columns from each file2 line. You will be able to store many thousands of entries. If file2 is very large and file1 is smallish, you can store file1 in an array, then process a file2 of unlimited size. But in this case, we would have to store the entire file1 lines, not just the match columns, because on a match, you need to output the entire file1 line. If both files are so huge that this array match approach cannot be used, then you need to sort both files in key-column order, then read both files doing a merge match-up (I use awk for this).
Report Offensive Follow Up For Removal
|

Post Locked
This post is quite old and has been locked from receiving new replies. Please create a new posting instead.
Go to Unix Forum Home
|
|
|