Computing.Net > Forums > Programming > comparing large files

Computer Problems? Computing.Net has over 1,000,000 posts about all things technology related! Over 90% answered within 24 hours! Click here to start participating now! Also, be sure to check out the New User Guide.

comparing large files

Reply to Message Icon

Name: eficbf
Date: November 20, 2008 at 04:34:14 Pacific
OS: HPUX 11.23
CPU/Ram: Itanium-64 16GB RAM
Product: RX3600
Comment:

Hello,

I have two files, file A has 6000 records and file B has 1.1 million records. File A contains a search code in field 1 that matches a code in field 1 in file B. I need to output the lines found in file B. Using an awk command it takes 1 minute to search file B. I'm just wondering if anyone can recommend a more efficient method for parsing through this large file? The script I am using is:

awk 'BEGIN {

while ((getline a < "fileA") > 0) {
while ((getline b < "fileB") > 0) {

# ACTION #
split(a,A," ")
split(b,B," ")
if (A[1] == B[1]) {
print b
}
}
close("fileB")
}
close("fileA")

}'



Sponsored Link
Ads by Google

Response Number 1
Name: klint
Date: November 20, 2008 at 07:37:17 Pacific
Reply:

You seem to be opening and closing fileB 6000 times. You are also doing this 6.6 billion times: split(a,A," "). You can put that statement outside of the inner loop.

However, that won't make much of a difference. Instead, read the whole of fileA first, and stick the contents of field 1 of each record into an associative array:


while ((getline a < "fileA") > 0) {
split(a,A," ")
field1ofA[A[1]] = 1
}

and then you only need to read through fileB just once, speeding up your script by a factor of 6000!

while ((getline b < "fileB") > 0) {
split(b,B," ")
if (B[1] in field1ofA) {
print b
}


0

Response Number 2
Name: eficbf
Date: November 24, 2008 at 06:05:30 Pacific
Reply:

As you probably guessed I'm fairly new to scripting with awk but I felt it was something to do with looping through file B! Thanks very much for the advice that works perfectly.


0

Response Number 3
Name: eficbf
Date: November 27, 2008 at 02:39:43 Pacific
Reply:

As you probably guessed I'm fairly new to scripting with awk but I felt it was something to do with looping through file B! Thanks very much for the advice that works perfectly.


0

Response Number 4
Name: eficbf
Date: November 27, 2008 at 02:45:19 Pacific
Reply:

Hi Klint,

I'm just wondering if there is a way of running this on Windows XP? I have a windows based awk program and with that I run the command awk -f awk_prog.txt. The txt file contains the awk script however when I run this command I get an error "out of memory". Is there another programming method I could use on Windows to achieve the same result?

Thanks,

Eddie.


0

Response Number 5
Name: klint
Date: November 27, 2008 at 07:03:51 Pacific
Reply:

That's strange. I made up a sample file containing 6000 lines, with the first word of each line being about 88-89 characters long, and each one unique. I tested the first loop:

while ((getline a < "fileA") > 0) {
split(a,A," ")
field1ofA[A[1]] = 1
}

It ran ok. No memory errors. I didn't bother testing the second loop as that one doesn't use much memory.

So either you are reading a fileA that has many more than just 6000 lines, or it has very very long first fields in each line, or you are using a port of Awk that has very limited memory constraints.

Can you confirm the size of file A, and post the contents of a typical line from it? Also can you make sure you are using a Windows version of Awk, and not a DOS version.

My version of Awk was from either the GNUWin32 or the MinGW port, I can't remember which.


0

Related Posts

See More



Response Number 6
Name: eficbf
Date: November 27, 2008 at 07:25:30 Pacific
Reply:

Thanks for the reply. It could be the version of awk I'm using, a colleague passed this onto me so I'm not sure what version of awk it is. I'll download the GNU version hopefully that will sort it out.


0

Response Number 7
Name: eficbf
Date: December 1, 2008 at 09:13:09 Pacific
Reply:

Klint,

Just following up I downloaded the gnu awk and the script now runs fine on Windows. I'm not sure what the other version was.

Thanks again for your help,

Eddie.


0

Sponsored Link
Ads by Google
Reply to Message Icon






Post Locked

This post is quite old and has been locked from receiving new replies. Please create a new posting instead.


Go to Programming Forum Home


Sponsored links

Ads by Google


Results for: comparing large files

Using fread with a large file www.computing.net/answers/programming/using-fread-with-a-large-file-/10254.html

How to compare 2 files using awk? www.computing.net/answers/programming/how-to-compare-2-files-using-awk/15598.html

Compare two files with an input fil www.computing.net/answers/programming/compare-two-files-with-an-input-fil/17089.html