Solved removing duplicate rows based on a column !!

October 27, 2011 at 11:30:57
Specs: Windows Vista
Hey Guys, I have a file like this

ID|NAME|EMAIL|COUNTRY
1|Keith|k@K.com|us
2|Shelly|Shelly@shell.com|us
3|Kelly|k@K.com|us

As you can see there are two records with k@K.com. My script should find out if the number of duplicate email addresses are more than 10 (10 records with the same email address) then remove it, otherwise preserve it.

PleaseHELP!


See More: removing duplicate rows based on a column !!

Report •

#1
October 27, 2011 at 13:14:15
✔ Best Answer
A method using awk: read the data file and save the count of the third field into an array. Then, read the same file in the main loop and print only lines where the 3rd field count is less than 10:

#!/bin/ksh

rm -f mynewdata.txt
awk ' BEGIN { FS="|";
   while ( getline < "mydata.txt" > 0 )
      myarr[$3]=myarr[$3]+1
}
{
if($3 in myarr)
  if(myarr[$3] < 10)
     print $0 >> "mynewdata.txt"

} ' mydata.txt


Report •

#2
October 28, 2011 at 08:06:56
Thanks and it worked like a charm. I need one more thing if you can help me out..

ID|FNAME|LNAME|EMAIL|COUNTRY

1|SMITH|JOHN|SMITH@UA.COM|US
2|SMITH|JOHN|SMITH@US.COM|US

Similar like above, I need to check for First name + Last name duplicates in a file and print if they are < 10. Please let me know.


Report •
Related Solutions


Ask Question