Computing.Net > Forums > Unix > Identify duplicate records in UNIX

Computer Problems? Computing.Net has over 1,000,000 posts about all things technology related! Over 90% answered within 24 hours! Click here to start participating now! Also, be sure to check out the New User Guide.

Identify duplicate records in UNIX

Reply to Message Icon

Name: Braveking
Date: January 6, 2004 at 10:20:45 Pacific
OS: Windows 2000
CPU/Ram: 512MB
Comment:

Hi, I am looking for a script/command to identify duplicate records by certain columns in a given file and write them to an other file. Could anybody suggest me a solution?



Sponsored Link
Ads by Google

Response Number 1
Name: nails
Date: January 6, 2004 at 11:37:18 Pacific
Reply:

Hi:

I would use the unix sort command and use the -u option to eliminate duplicates.

Regards,

Nails


0

Response Number 2
Name: Braveking
Date: January 6, 2004 at 11:41:35 Pacific
Reply:

I am having trouble with it. Can you specify columns in sort -u ? Could u please let me know the syntax for the following example

123456|55555|678
234567|55555|654
123456|66666|745
345678|77777|654

So when if i want to find out duplicates in column one and want the following results in the output file listing the duplicates.
123456|55555|678
123456|66666|745


0

Response Number 3
Name: James Boothe
Date: January 6, 2004 at 11:43:39 Pacific
Reply:

That would eliminate duplicates based on entire line. I think Braveking wants to output all duplicate lines, with the duplication based not on the entire line, but only on certain columns.

Is that correct Braveking? How many and which columns, and how are the columns delimited, etc.


0

Response Number 4
Name: Braveking
Date: January 6, 2004 at 11:59:56 Pacific
Reply:

They are delimited by pipe charcters. The above example suits my requirements.

To find out duplicates, let's go with field 1. But here is what i plan to do seperately

1. Check duplicates on field 1
2. Check duplicates on field 3
3. Check duplicates for combined field for feild1 and field3


0

Response Number 5
Name: James Boothe
Date: January 8, 2004 at 12:52:53 Pacific
Reply:

Couple days ago, we posted at same time. I had not seen your second post at that time when I was asking about columns.

Lot of ways to do this. My solution involves sorting by desired key(s), then feed the sorted file into awk. awk will hold each line by one cycle so that it will know if it is dealing with single or multiple lines (per key), and print the group or not. First solution below finds duplicates based on field 3. Change all five 3's in that script to 1's to find duplicates on field 1:

#!/bin/sh

sort -t\| -k3,3 in.txt |
awk -F\| 'BEGIN {\
getline
holdline=$0
holdkey=$3
k=1}

function flushhold() {\
if (k>1)
    print holdline
#else
#    print "UNIQUE: " holdline
}

{
if ($3==holdkey)
  {k++
   flushhold()}
else
# closeout current key, start new key
  {flushhold()
   holdkey=$3
   k=1}
holdline=$0
}
END {flushhold()}' > duplicates.txt

Use same logic to check for duplicates based upon
multiple fields.  Change 4 lines in the script above:

sort -t\| -k3,3 -k1,1 in.txt |

holdkey=$3 $1

if ($3 $1==holdkey)

   holdkey=$3 $1

For awk, expressions separated by space means concatenation.

You will notice in the flushhold function that this script could just as easily output only the unique lines instead of the duplicates, or both.  And if outputting both, you could output unique lines to one file and duplicate lines to another.


0

Related Posts

See More



Sponsored Link
Ads by Google
Reply to Message Icon






Post Locked

This post is quite old and has been locked from receiving new replies. Please create a new posting instead.


Go to Unix Forum Home


Sponsored links

Ads by Google


Results for: Identify duplicate records in UNIX

identifying duplicate records www.computing.net/answers/unix/identifying-duplicate-records-/7318.html

Idendifying duplicate records www.computing.net/answers/unix/idendifying-duplicate-records/7141.html

unix script - find duplicate record www.computing.net/answers/unix/unix-script-find-duplicate-record/5260.html