Computing.Net > Forums > Unix > Idendifying duplicate records

Idendifying duplicate records

Reply to Message Icon

Original Message
Name: anilcgowda
Date: September 20, 2005 at 18:01:39 Pacific
Subject: Idendifying duplicate records
OS: UNIX
CPU/Ram: 1GB
Comment:

Hi,

I have a file with duplicate records in it. I want to identify the duplicate records and put them into a separate file. Please let me know how to do it.

Please note that each record is 2 lines in length and has spaces in each record.

Thanks,
Anil


Report Offensive Message For Removal


Response Number 1
Name: nails
Date: September 21, 2005 at 08:57:12 Pacific
Subject: Idendifying duplicate records
Reply: (edit)

I see no clear method of doing this. My idea is to place each 2-line record into a file, compare the files and delete the duplicates. Here is a 3 step kludge:

1) Create a mytmp directory in the directory where the file resides. Then use the unix split command to break apart the file into 2-line files:

split -l 2 database.txt mytmp/m1

all the file names will start with "m1".

2) In ksh, find the unique check sum, using cksum, of each file. sort the file and any file that's unique, delete it:

#!/bin/ksh
fr=0
cksum $(find mytmp -type f -name "m1*" -print)|sort -k 1,1n |
while read c1 c2 n3
do
if [ fr -eq 0 ]; then
prevobj=$c1
fr=1
continue
fi
# save check sums and file sizes that are duplicate
if [[ $prevobj -eq $c1 ]]; then
rm $n3
else
prevobj=$c1
fi
done

Finally, put everything back together from mytmp:

cat $(find mytmp -name "m1*" -print) > newdatabase.txt

Sorry, it can't be cleaner.


Report Offensive Follow Up For Removal







Use following form to reply to current message:

   Name: From My Computing.Net Settings
 E-Mail: From My Computing.Net Settings

Subject: Idendifying duplicate records

Comments:

 


  Homepage URL (*): 
Homepage Title (*): 
         Image URL: 
 
Data Recovery Software




How often do you use Computing.Net?

Every Day
Once a Week
Once a Month
This Is My First Time!


View Results

Poll Finishes In 4 Days.
Discuss in The Lounge