|
|
|
Removing Duplicate Records
|
Original Message
|
Name: AlanTu
Date: July 10, 2004 at 02:15:11 Pacific
Subject: Removing Duplicate Records OS: UNIX CPU/Ram: Unknown
|
Comment: My file is unsorted, has fixed length fields with no delimiters or white space and I want to drop all records where a portion of the record (the unique identifier) is duplicated. Example records: 0001ABCD+1000 0002WXYZ-2000 0003ABCD+1500 0002WXYZ+2000 The unique identifier is character positions 1 through 8 inclusive and in the above file the 2nd and 4th records match. In these cases I want to drop BOTH records from the output file.
Any help much appreciated.
Report Offensive Message For Removal
|
|
Response Number 1
|
Name: nails
Date: July 10, 2004 at 09:15:22 Pacific
|
Reply: (edit)Hi: If your sort command supports the field start and stop syntax, this works: sort -k 1.1,1.8 -u file I know that solaris and Red Hat Linux versions of sort work. If it doesn't, I would sort the entire file. Then read the file a line at a time checking if the first 8 characters equals the previous line's 8 characters. If it does, ignore the line and if not send that line to a new file. Regards, Nails
Report Offensive Follow Up For Removal
|
|
Response Number 2
|
Name: Jim Boothe
Date: July 10, 2004 at 09:33:41 Pacific
|
Reply: (edit)Nails, AlanTu wants to drop the entire duplicate set. sort -u would leave one record per duplicate set. And if sorting and reading the file line by line, you will need to delay output of each line until you get the next record to see if it matches.
Report Offensive Follow Up For Removal
|
|
Response Number 5
|
Name: AlanTu
Date: July 12, 2004 at 02:03:28 Pacific
|
Reply: (edit)Thanks for the input everyone. The last solution looks good except the -w flag for the uniq command does not appear to be available to me. I am using the Korn shell under Aix 4.3 The sticking point appears to be the inability to specify a byte position range applicable to the unique command.
Report Offensive Follow Up For Removal
|
|
Response Number 6
|
Name: Wolfbone
Date: July 12, 2004 at 03:30:49 Pacific
|
Reply: (edit)awk '{k=substr($0,1,8) ; line[k]=$0 ; freq[k]++ } END { for (x in freq) if (freq[x]==1) print line[x]}' file will get rid of the dupes but won't necessarily preserve the order. If you need to preserve the order, just say so in your next reply and we'll try again!
Report Offensive Follow Up For Removal
|
|
Response Number 7
|
Name: Wolfbone
Date: July 12, 2004 at 04:39:32 Pacific
|
Reply: (edit)Well if there's no white space in your records it's easy to preserve the order: cat -n file | sed 's/\(.*\t\)\(.\{8\}\)\(.*\)/\1\3\t\2/' | sort -k3 | uniq -u -f2 | sort -n -k1 | sed 's/.*\t\(.*\)\t\(.*\)/\2\1/'
Report Offensive Follow Up For Removal
|
|
Response Number 8
|
Name: AlanTu
Date: July 12, 2004 at 08:13:56 Pacific
|
Reply: (edit)Thanks for the last 2 replies they are really helpful. The awk script works fine but, you are right, I do need to preserve the order. Not had success using the sed alternative. Can you summarise what it is doing?
Report Offensive Follow Up For Removal
|
|
Response Number 9
|
Name: Wolfbone
Date: July 12, 2004 at 08:40:43 Pacific
|
Reply: (edit)The 'cat -n' puts the line number (right justified) and a tab at the beginning of each line. Next, the first sed statement chops each line up into 3 fields; the first is up to and including the first tab, the second is the first 8 characters of the original line and the third is the remainder of the line.Then the fields are spat out in the order 1,3,tab,2 so that the 8 character field is the last and each field is separated by white space. Next, the lines are sorted on the last field (-k3) and piped into 'uniq -u', skipping the first two fields (-f2) so that only unique occurences of the 8 character field remain. Finally, the lines are resorted on the first field to put them back in order and then piped into sed to cut out the line numbers (the first field) and switch the other two fields back into the correct order. The main idea (to get round the fact you can't use '-w8') is to put the 8 character key field at the end of the line so that the uniq command can skip the two preceding fields. If you try cutting out pipe sections from the end backwards maybe you can see which one is going wrong and adjust it.
Report Offensive Follow Up For Removal
|
|
Response Number 10
|
Name: Wolfbone
Date: July 12, 2004 at 09:19:23 Pacific
|
Reply: (edit)Failing that, you could transplant the awk method into a similar pipeline: cat -n file | awk '{k=substr($0,8,8) ; line[k]=$0 ; freq[k]++ } END { for (x in freq) if (freq[x]==1) print line[x]}' | sort -n -k1 | sed 's/.*\t//' Note that these methods depend on 'cut -n' putting a tab at the end of the line number, not a space - maybe it's different with your version of cut?
Report Offensive Follow Up For Removal
|
|
Response Number 11
|
Name: AlanTu
Date: July 13, 2004 at 14:06:29 Pacific
|
Reply: (edit)Wolfbone. Thanks for all your assistance on this. Using a combination of your answers finally have the tools to crack this. Brilliant.
Report Offensive Follow Up For Removal
|
Use following form to reply to current message:
|
|

|