Computing.Net > Forums > Unix > Finding unique records from file

Computer Problems? Computing.Net has over 1,000,000 posts about all things technology related! Over 90% answered within 24 hours! Click here to start participating now! Also, be sure to check out the New User Guide.

Finding unique records from file

Reply to Message Icon

Name: anukta_c
Date: February 12, 2003 at 07:21:49 Pacific
OS: HP-UX
CPU/Ram: -
Comment:

I have a file (sme.txt) that contains 530,000 records.

Sample 5 lines of the file:
LBAB1 0000006982 11119026 01224624005 19880421 99991231 (end line)
LBAB1 0000006982 11119026 01224626366 19940915 99991231 (end line)
LBAB1 0000006982 11119026 01224627408 19990617 99991231 (end line)

The file is tab-separated.

I have to find out duplicates from this file based on column 4 and write out to a different file.
Then again I have to find out the unique records based on $4 and write it out to a different file.

I have done something like this.
(start)
awk '{print $4}' sme.txt | sort > sorted_sme.txt

uniq -d sorted_sme.txt > dup_sme.txt
uniq -u sorted_sme.txt > unq_sme.txt

cat dup_sme.txt | while read line;do
grep $line sme.txt >> final_dup_sme.txt
done

cat unq_sme.txt | while read line;do
grep $line sme.txt >> final_unq_sme.txt
done
(end)

But I am realizing this is going to take a long time. Since dup_sme.txt has 6500 lines and unq_sme.txt has around 513000 lines.

Is there a better and quicker way to do this?
Will be glad of any help
-Anukta




Sponsored Link
Ads by Google

Response Number 1
Name: David Perry
Date: February 13, 2003 at 04:16:41 Pacific
Reply:

Does the '-u' flag to sort help you in the first command?


0

Response Number 2
Name: anukta_c
Date: February 13, 2003 at 05:47:51 Pacific
Reply:


Actually the sorts work pretty fast.
It becomes abominably slow when it gets to the part where it pattern matches with the grep.
Basically what it is doing here is doing a grep on the Big file and doing that in a loop.
If the number of unique records found are 500000. Then it loops that many times to find a pattern match.
There must be a smarter way of doing this.
Now, what is it, is the question.... ;-))


-Anukta


0

Response Number 3
Name: James Boothe
Date: February 13, 2003 at 13:58:32 Pacific
Reply:

Here is my solution. The script sorts the big file by word4, which is then piped into awk which writes two output files: one file containing groups of lines with the same word4's, the other file with lines having unique word4's. If your awk does not support functions, it is a simple recode.

To prevent this web page from removing my indentation, I used leading underscores in place of leading spaces. After copy/paste this to a script, you can change those underscores to spaces with:

sed "s/_/ /g" pasted_script > good_script

#!/bin/sh

sort -k4,4 sme.txt |
awk -F\t 'BEGIN {\
getline
holdline=$0
hold4=$4
k=1}
function printhold() {\
if (k>1)
___print holdline > "dupsme.txt"
else
___print holdline > "unqsme.txt"
}
{
if ($4==hold4)
__{k++
___printhold()}
else
__{printhold()
___hold4=$4
___k=1}
holdline=$0
}
END {printhold()}'

exit 0


0

Response Number 4
Name: anukta_c
Date: February 14, 2003 at 03:14:14 Pacific
Reply:


THAT was a terrific piece of code. And it works pretty fast as well. It sifted through 3.5 million rows in just 3 min !!!

Thanks a lot. :-))

-Anukta


0

Sponsored Link
Ads by Google
Reply to Message Icon

Related Posts

See More







Post Locked

This post is quite old and has been locked from receiving new replies. Please create a new posting instead.


Go to Unix Forum Home


Sponsored links

Ads by Google


Results for: Finding unique records from file

Fetching records from oracle databa www.computing.net/answers/unix/fetching-records-from-oracle-databa/6147.html

find command and tar files www.computing.net/answers/unix/find-command-and-tar-files/5269.html

Deletion of header trailor records www.computing.net/answers/unix/deletion-of-header-trailor-records-/5779.html