Specialty Forums
Security and Virus
General Hardware
CPUs/Overclocking
Networking
Digital Photo/Video
Office Software
PC Gaming
Console Gaming
Programming
Database
Web Development
Digital Home

General Forums
Windows XP
Windows Vista
Windows 95/98
Windows Me
Windows NT
Windows 2000
Win Server 2008
Win Server 2003
Windows 3.1
Linux
PDAs
BeOS
Novell Netware
OpenVMS
Solaris
Disk Op. System
Unix
Mac
OS/2

Drivers
Driver Scan
Driver Forum

Software
Automatic Updates

BIOS Updates

My Computing.Net

Solution Center

Free IT eBook

Howtos

Site Search

Message Find

RSS Feeds

Install Guides

Data Recovery

About

Home
Reply to Message Icon Go to Main Page Icon

comparing two files using awk.

Original Message
Name: cskumar
Date: November 20, 2006 at 21:13:03 Pacific
Subject: comparing two files using awk.
OS: unix
CPU/Ram: 8 cpu
Model/Manufacturer: HP
Comment:
How to compare substring of col1,file 1 with
col2file2 and get file1contents+col3file2 as output.

file1
-----
kumarfghh,23,12000,5000
rajakumar,24,14000,2500
rajeshchauhan,25,16000,2600
manoj,26,17000,2300

file 2
--------
123,kumar,US,
123,sukumar,UK
123,raj,Germany
40,rajesh,Australia
40,jerome,swiss
40,rakesh,india

output
-----------------------
kumarfghh,23,12000,5000,US
rajakumar,24,14000,2500,Germany
rajeshchauhan,25,16000,2600,Australia
manoj,26,17000,2300,

so if nothing matched i should get a default value as null.

Please help me,as i have been trying very hard to acheive


cskumar


Report Offensive Message For Removal


Response Number 1
Name: James Boothe
Date: November 21, 2006 at 09:48:39 Pacific
Subject: comparing two files using awk.
Reply: (edit)
A file match-up with awk is easy if the key columns are an exact match.  But in this case we have a leading edge match, and it needs to be a best-case leading edge match.  When trying to find a match for rajeshchauhan in file2, raj matches, but rajesh matches even more characters.

So, for each line in file1, all file2 entries will be scanned, and the longest matching file2 entry will be matched.

awk -F, '\
BEGIN {
while ((getline < "file2") > 0)
   file2[$2]=$3
}

{longest=0
 for (name in file2)
    if (name == substr($1,1,length(name)))
       if (length(name)>longest)
          {holdname=name
           longest=length(name)}
 if (longest>0)
     loc=file2[holdname]
 else
     loc=""
 print $0 "," loc
}' file1

The output:
kumarfghh,23,12000,5000,US
rajakumar,24,14000,2500,Germany
rajeshchauhan,25,16000,2600,Australia
manoj,26,17000,2300,


Report Offensive Follow Up For Removal

Response Number 2
Name: cskumar
Date: November 22, 2006 at 02:19:42 Pacific
Subject: comparing two files using awk.
Reply: (edit)
Thanks james,
the script works great,I have a performance issue ,if my file1 contains more than 5000 records and file2 contains more than 6000 records,result is taking more than 1 min.
how can we acheieve with less time.
can u provide some tips to increase the performance.
And if my file 2 exceeds in future,what will be limit as we are taking it in array.

Please post your valuable inputs.


cskumar


Report Offensive Follow Up For Removal

Response Number 3
Name: James Boothe
Date: November 22, 2006 at 08:11:34 Pacific
Subject: comparing two files using awk.
Reply: (edit)
A simple 2-file match up between exact keys would be an indexed match up, and very fast. But your requirement is not exact key, and requires the array to be scanned. Further, I cannot stop the scan on the first match because I must scan the entire array each time for the best match.

It would be difficult to improve the performance. But I have seen other approaches that could take far longer (even 100 times longer or more).

You ask a good question regarding the maximum size of file2 that we can pull into the array. I think the answer is platform and memory dependent. I have created very large arrays. Rather than be surprised later on, I would suggest that you create a huge file2 and test it. But be aware that the names in file2 must all be unique. Since the array is keyed by the name, multiple name entries will simply keep using the same name slot, overlaying the latest country in that slot.

Also, when you do a 2-file match-up by storing one in an array, you should always store the smallest file in the array. I made the assumption that file2 would be the smallest. But if file1 will be the smallest, the solution can be recoded that way.

One possible way to improve the performance is to create the file2 array in sorted order. Then the scan could stop after it got beyond the scan point alphabetically.


Report Offensive Follow Up For Removal

Response Number 4
Name: cskumar
Date: November 22, 2006 at 22:54:43 Pacific
Subject: comparing two files using awk.
Reply: (edit)
thanx james,
for ur inputs.i have tested the file2 with 70,000 records and file1 with 10 records,it works fine except the performace issue again.if we can quit the loop after the first occurence,will it be faster.
If we take the above files,rajeshchauhan in file1 matches with raj(file2) and rajesh(file2),so if we quit the loop once if finds raj,will it be faster?

Your valuable inputs again?

cskumar


Report Offensive Follow Up For Removal

Response Number 5
Name: James Boothe
Date: November 26, 2006 at 17:33:03 Pacific
Subject: comparing two files using awk.
Reply: (edit)
Quitting the array scan on the first hit will improve the performance by about half, but if this is not going to give you proper matches, then why do that?  But I have posted that solution below.

awk -F, 'BEGIN {
while ((getline < "file2") > 0)
   file2[$2]=$3
}
{holdname="NotOnFile"
 for (name in file2)
    if (name == substr($1,1,length(name)))
       {holdname=name
        break}
 print $0 "," file2[holdname]
}' file1

But as I stated in my previous reply, your smallest file should be the one to store in the array for scanning.  You mentioned that file1 is 10 lines and file2 is 70,000 lines.  If your file2 is going to be the largest file, then we should be storing file1 in the array.  Let me know if that is the case, and we can recode the solution.


Report Offensive Follow Up For Removal


Response Number 6
Name: cskumar
Date: November 27, 2006 at 21:48:32 Pacific
Subject: comparing two files using awk.
Reply: (edit)
Hi James,

Can u give me the solution for file2 as a largest file than file1.And i want the output of file1(full)+file2 return value.


cskumar


Report Offensive Follow Up For Removal

Response Number 7
Name: James Boothe
Date: November 29, 2006 at 11:20:58 Pacific
Subject: comparing two files using awk.
Reply: (edit)
OK, this solution is quite different. Entire file1 is stored in an array. file2 is then processed as the main file. But since file2 is just our lookup file, we do not want to print a line for each line in file2. So the processing of file2 is just a match up phase. Matched lines will store the location from file2 into a second array. Once a file1 line is matched with a location, no other matchups will be allowed with that file1 line.

At the end, END processing will print the file1 array along with matched locations if any.

It would be easy to first sort file2 by longest keys first. Then, when a long key matches to a particular file1 line, subsequent shorter keys will not even attempt to match that line.

When file2 is matched with a file1 line, I do not break out of the matchup scan. I continue scanning because the particular file2 entry being processed may need to match up to multiple file1 lines, correct?

Now for the bad news ...

I am out of the office, and the filter that I normally run my posts through (to do things like preserve the indentation when posting) is back at the office and not on my laptop. So all of the nice indentation is lost. When I get back to the office next week, I will repost it with proper indentation.

Let me know how the performance goes.

awk -F, 'BEGIN {
while ((getline < "file1") > 0)
{i++
file1[i]=$0}
imax=i
}
{
for (i in file1)
if (file1loc[i]==0)
if ($2 == substr(file1[i],1,length($2)))
file1loc[i]=$3
}
END {
for (i=1;i<=imax;i++)
print file1[i] "," file1loc[i]
}' file2


Report Offensive Follow Up For Removal



Use following form to reply to current message:

   Name: From My Computing.Net Settings
 E-Mail: From My Computing.Net Settings

Subject: comparing two files using awk.

Comments:

 
  Homepage URL (*): 
Homepage Title (*): 
         Image URL: 
 


Data Recovery Software




XP Installed to G?

exessive internet traffic

ZoneAlarm Question. Blocked Connect

Windows Live Messenger Problem

Delete $Uninstall after SP3 updates


The information on Computing.Net is the opinions of its users. Such opinions may not be accurate and they are to be used at your own risk. Computing.Net cannot verify the validity of the statements made on this site. Computing.Net and Computing.Net, LLC hereby disclaim all responsibility and liability for the content of Computing.Net and its accuracy.
PLEASE READ THE FULL DISCLAIMER AND LEGAL TERMS BY CLICKING HERE

All content ©1996-2007 Computing.Net, LLC