Tom's Guide | Tom's Hardware | Tom's Games
![]() |
![]() |
![]() |
Hi, I am looking for a script/command to identify duplicate records by certain columns in a given file and write them to an other file. Could anybody suggest me a solution?

I am having trouble with it. Can you specify columns in sort -u ? Could u please let me know the syntax for the following example
123456|55555|678
234567|55555|654
123456|66666|745
345678|77777|654So when if i want to find out duplicates in column one and want the following results in the output file listing the duplicates.
123456|55555|678
123456|66666|745

That would eliminate duplicates based on entire line. I think Braveking wants to output all duplicate lines, with the duplication based not on the entire line, but only on certain columns.
Is that correct Braveking? How many and which columns, and how are the columns delimited, etc.

They are delimited by pipe charcters. The above example suits my requirements.
To find out duplicates, let's go with field 1. But here is what i plan to do seperately
1. Check duplicates on field 1
2. Check duplicates on field 3
3. Check duplicates for combined field for feild1 and field3

Couple days ago, we posted at same time. I had not seen your second post at that time when I was asking about columns.
Lot of ways to do this. My solution involves sorting by desired key(s), then feed the sorted file into awk. awk will hold each line by one cycle so that it will know if it is dealing with single or multiple lines (per key), and print the group or not. First solution below finds duplicates based on field 3. Change all five 3's in that script to 1's to find duplicates on field 1:
#!/bin/sh
sort -t\| -k3,3 in.txt |
awk -F\| 'BEGIN {\
getline
holdline=$0
holdkey=$3
k=1}function flushhold() {\
if (k>1)
print holdline
#else
# print "UNIQUE: " holdline
}{
if ($3==holdkey)
{k++
flushhold()}
else
# closeout current key, start new key
{flushhold()
holdkey=$3
k=1}
holdline=$0
}
END {flushhold()}' > duplicates.txtUse same logic to check for duplicates based upon
multiple fields. Change 4 lines in the script above:sort -t\| -k3,3 -k1,1 in.txt |
holdkey=$3 $1
if ($3 $1==holdkey)
holdkey=$3 $1
For awk, expressions separated by space means concatenation.
You will notice in the flushhold function that this script could just as easily output only the unique lines instead of the duplicates, or both. And if outputting both, you could output unique lines to one file and duplicate lines to another.

![]() |
![]() |
![]() |

This post is quite old and has been locked from receiving new replies. Please create a new posting instead.
| Ads by Google |