Tom's Guide | Tom's Hardware | Tom's Games
![]() |
![]() |
![]() |
Hello,
I have two files, file A has 6000 records and file B has 1.1 million records. File A contains a search code in field 1 that matches a code in field 1 in file B. I need to output the lines found in file B. Using an awk command it takes 1 minute to search file B. I'm just wondering if anyone can recommend a more efficient method for parsing through this large file? The script I am using is:
awk 'BEGIN {
while ((getline a < "fileA") > 0) {
while ((getline b < "fileB") > 0) {# ACTION #
split(a,A," ")
split(b,B," ")
if (A[1] == B[1]) {
print b
}
}
close("fileB")
}
close("fileA")}'

You seem to be opening and closing fileB 6000 times. You are also doing this 6.6 billion times: split(a,A," "). You can put that statement outside of the inner loop.
However, that won't make much of a difference. Instead, read the whole of fileA first, and stick the contents of field 1 of each record into an associative array:
while ((getline a < "fileA") > 0) {
split(a,A," ")
field1ofA[A[1]] = 1
}and then you only need to read through fileB just once, speeding up your script by a factor of 6000!
while ((getline b < "fileB") > 0) {
split(b,B," ")
if (B[1] in field1ofA) {
print b
}

As you probably guessed I'm fairly new to scripting with awk but I felt it was something to do with looping through file B! Thanks very much for the advice that works perfectly.

As you probably guessed I'm fairly new to scripting with awk but I felt it was something to do with looping through file B! Thanks very much for the advice that works perfectly.

Hi Klint,
I'm just wondering if there is a way of running this on Windows XP? I have a windows based awk program and with that I run the command awk -f awk_prog.txt. The txt file contains the awk script however when I run this command I get an error "out of memory". Is there another programming method I could use on Windows to achieve the same result?
Thanks,
Eddie.

That's strange. I made up a sample file containing 6000 lines, with the first word of each line being about 88-89 characters long, and each one unique. I tested the first loop:
while ((getline a < "fileA") > 0) {
split(a,A," ")
field1ofA[A[1]] = 1
}It ran ok. No memory errors. I didn't bother testing the second loop as that one doesn't use much memory.
So either you are reading a fileA that has many more than just 6000 lines, or it has very very long first fields in each line, or you are using a port of Awk that has very limited memory constraints.
Can you confirm the size of file A, and post the contents of a typical line from it? Also can you make sure you are using a Windows version of Awk, and not a DOS version.
My version of Awk was from either the GNUWin32 or the MinGW port, I can't remember which.

Thanks for the reply. It could be the version of awk I'm using, a colleague passed this onto me so I'm not sure what version of awk it is. I'll download the GNU version hopefully that will sort it out.

Klint,
Just following up I downloaded the gnu awk and the script now runs fine on Windows. I'm not sure what the other version was.
Thanks again for your help,
Eddie.

![]() |
![]() |
![]() |

This post is quite old and has been locked from receiving new replies. Please create a new posting instead.
| Ads by Google |