Tom's Guide | Tom's Hardware | Tom's Games
![]() |
![]() |
![]() |
I have a file (sme.txt) that contains 530,000 records.
Sample 5 lines of the file:
LBAB1 0000006982 11119026 01224624005 19880421 99991231 (end line)
LBAB1 0000006982 11119026 01224626366 19940915 99991231 (end line)
LBAB1 0000006982 11119026 01224627408 19990617 99991231 (end line)The file is tab-separated.
I have to find out duplicates from this file based on column 4 and write out to a different file.
Then again I have to find out the unique records based on $4 and write it out to a different file.I have done something like this.
(start)
awk '{print $4}' sme.txt | sort > sorted_sme.txtuniq -d sorted_sme.txt > dup_sme.txt
uniq -u sorted_sme.txt > unq_sme.txtcat dup_sme.txt | while read line;do
grep $line sme.txt >> final_dup_sme.txt
donecat unq_sme.txt | while read line;do
grep $line sme.txt >> final_unq_sme.txt
done
(end)But I am realizing this is going to take a long time. Since dup_sme.txt has 6500 lines and unq_sme.txt has around 513000 lines.
Is there a better and quicker way to do this?
Will be glad of any help
-Anukta

Actually the sorts work pretty fast.
It becomes abominably slow when it gets to the part where it pattern matches with the grep.
Basically what it is doing here is doing a grep on the Big file and doing that in a loop.
If the number of unique records found are 500000. Then it loops that many times to find a pattern match.
There must be a smarter way of doing this.
Now, what is it, is the question.... ;-))
-Anukta

Here is my solution. The script sorts the big file by word4, which is then piped into awk which writes two output files: one file containing groups of lines with the same word4's, the other file with lines having unique word4's. If your awk does not support functions, it is a simple recode.
To prevent this web page from removing my indentation, I used leading underscores in place of leading spaces. After copy/paste this to a script, you can change those underscores to spaces with:
sed "s/_/ /g" pasted_script > good_script
#!/bin/sh
sort -k4,4 sme.txt |
awk -F\t 'BEGIN {\
getline
holdline=$0
hold4=$4
k=1}
function printhold() {\
if (k>1)
___print holdline > "dupsme.txt"
else
___print holdline > "unqsme.txt"
}
{
if ($4==hold4)
__{k++
___printhold()}
else
__{printhold()
___hold4=$4
___k=1}
holdline=$0
}
END {printhold()}'exit 0

THAT was a terrific piece of code. And it works pretty fast as well. It sifted through 3.5 million rows in just 3 min !!!Thanks a lot. :-))
-Anukta

![]() |
![]() |
![]() |

This post is quite old and has been locked from receiving new replies. Please create a new posting instead.
| Ads by Google |