A file match-up with awk is easy if the key columns are an exact match. But in this case we have a leading edge match, and it needs to be a best-case leading edge match. When trying to find a match for rajeshchauhan in file2, raj matches, but rajesh matches even more characters.
So, for each line in file1, all file2 entries will be scanned, and the longest matching file2 entry will be matched.
awk -F, '\ BEGIN { while ((getline < "file2") > 0) file2[$2]=$3 }
{longest=0 for (name in file2) if (name == substr($1,1,length(name))) if (length(name)>longest) {holdname=name longest=length(name)} if (longest>0) loc=file2[holdname] else loc="" print $0 "," loc }' file1
The output: kumarfghh,23,12000,5000,US rajakumar,24,14000,2500,Germany rajeshchauhan,25,16000,2600,Australia manoj,26,17000,2300,
Thanks james, the script works great,I have a performance issue ,if my file1 contains more than 5000 records and file2 contains more than 6000 records,result is taking more than 1 min. how can we acheieve with less time. can u provide some tips to increase the performance. And if my file 2 exceeds in future,what will be limit as we are taking it in array.
A simple 2-file match up between exact keys would be an indexed match up, and very fast. But your requirement is not exact key, and requires the array to be scanned. Further, I cannot stop the scan on the first match because I must scan the entire array each time for the best match.
It would be difficult to improve the performance. But I have seen other approaches that could take far longer (even 100 times longer or more).
You ask a good question regarding the maximum size of file2 that we can pull into the array. I think the answer is platform and memory dependent. I have created very large arrays. Rather than be surprised later on, I would suggest that you create a huge file2 and test it. But be aware that the names in file2 must all be unique. Since the array is keyed by the name, multiple name entries will simply keep using the same name slot, overlaying the latest country in that slot.
Also, when you do a 2-file match-up by storing one in an array, you should always store the smallest file in the array. I made the assumption that file2 would be the smallest. But if file1 will be the smallest, the solution can be recoded that way.
One possible way to improve the performance is to create the file2 array in sorted order. Then the scan could stop after it got beyond the scan point alphabetically.
thanx james, for ur inputs.i have tested the file2 with 70,000 records and file1 with 10 records,it works fine except the performace issue again.if we can quit the loop after the first occurence,will it be faster. If we take the above files,rajeshchauhan in file1 matches with raj(file2) and rajesh(file2),so if we quit the loop once if finds raj,will it be faster?
Quitting the array scan on the first hit will improve the performance by about half, but if this is not going to give you proper matches, then why do that? But I have posted that solution below.
awk -F, 'BEGIN { while ((getline < "file2") > 0) file2[$2]=$3 } {holdname="NotOnFile" for (name in file2) if (name == substr($1,1,length(name))) {holdname=name break} print $0 "," file2[holdname] }' file1
But as I stated in my previous reply, your smallest file should be the one to store in the array for scanning. You mentioned that file1 is 10 lines and file2 is 70,000 lines. If your file2 is going to be the largest file, then we should be storing file1 in the array. Let me know if that is the case, and we can recode the solution.
OK, this solution is quite different. Entire file1 is stored in an array. file2 is then processed as the main file. But since file2 is just our lookup file, we do not want to print a line for each line in file2. So the processing of file2 is just a match up phase. Matched lines will store the location from file2 into a second array. Once a file1 line is matched with a location, no other matchups will be allowed with that file1 line.
At the end, END processing will print the file1 array along with matched locations if any.
It would be easy to first sort file2 by longest keys first. Then, when a long key matches to a particular file1 line, subsequent shorter keys will not even attempt to match that line.
When file2 is matched with a file1 line, I do not break out of the matchup scan. I continue scanning because the particular file2 entry being processed may need to match up to multiple file1 lines, correct?
Now for the bad news ...
I am out of the office, and the filter that I normally run my posts through (to do things like preserve the indentation when posting) is back at the office and not on my laptop. So all of the nice indentation is lost. When I get back to the office next week, I will repost it with proper indentation.
Let me know how the performance goes.
awk -F, 'BEGIN { while ((getline < "file1") > 0) {i++ file1[i]=$0} imax=i } { for (i in file1) if (file1loc[i]==0) if ($2 == substr(file1[i],1,length($2))) file1loc[i]=$3 } END { for (i=1;i<=imax;i++) print file1[i] "," file1loc[i] }' file2
The information on Computing.Net is the opinions of its users. Such
opinions may not be accurate and they are to be used at your own risk.
Computing.Net cannot verify the validity of the statements made on this site. Computing.Net and Computing.Net, LLC hereby disclaim all responsibility and liability for the content of Computing.Net and its accuracy.
PLEASE READ THE FULL DISCLAIMER AND LEGAL TERMS BY CLICKING HERE