UNIX sort Command

July 21, 2011 at 03:33:48
Specs: Windows XP
HI,

I have a pipe delimeted file with about a million records, here's a sample of the file:

acct #|service date|payment|
12345|10/01/10 |25.00|
12345|11/01/10 |50.00|
12345|01/11/11 |75.00|

12378|11/01/09 |44.00|
12378|02/11/11 |46.00|

I need to sort out the accounts and keep only the record with the most recent service date. i.e. for account 12345 I need to only keep record 3 which is 12345|01/11/11|75.00
and for acct 12378, I need to only keep 12378|02/11/11 |46.00|

Is there a way to sort out the records that I dont need in ksh.

Many Thanks


See More: UNIX sort Command

Report •

#1
July 21, 2011 at 06:46:56
sort -t \| -k1,2 file1.txt|sort -t \| -u -k1,1

As large as your data file is, you might want to consider splitting the file into pieces. Check out the unix split command.


Report •

#2
July 21, 2011 at 08:05:13
Hi Nails,

Thanks for the response, that worked on this file. However, when I tried it on a different file with many fields and the sort was on field 8 instead of 2, I changed the script from sort -t \| -k1,2 file1.txt|sort -t \| -u -k1,1
to sort -t \| -k1,8 file1.txt|sort -t \| -u -k1,1

for some reason I only got one record in the output file, any idea why? please note that field 8 had the same format mm/dd/yy.

thanks


Report •

#3
July 21, 2011 at 09:08:44
In your first example, -k1,2 works because you are sorting by two fields next to each other.

The -k1,8 sorts from field 1 to field 8; if you want to sort by field 1 and then only field 8, try this:

# untested
sort -t \| -k1,1 -k8,8 file1.txt|sort -t \| -u -k1,1


Report •

Related Solutions

#4
July 21, 2011 at 10:40:10
That did it Nails but it didn't keep the most recent date

Report •

#5
July 21, 2011 at 13:09:06
I am going back to your first data example where field 2 is the date.

First, date fields are typically sorted by YYYYMMDD format.

Since your date field is of the format:

MM/DD/YY,

field 2 has to be sorted by first the year - columns 7 and 8, then the month - columns 4 and 5 and finally the day - columns 1 and 2.

This is the revised script:

sort -t \| -r -nk1,1 -nk2.7,2.8 -nk2.4,2.5 -nk2.1,2.2 file1.txt|
awk ' BEGIN { FS="|" }
{
if (fr != $1)
   printf("%s\n", $0)

fr=$1
}'

Second, I forgot that the unix sort command does not guarantee the sort order of the data that is not ordered.

For example, sort -t \| -u -k1,1 will print out the line that contains a given value, but it does not guarantee what the rest of the line will be.

Since the file is sorted, I replaced the 2nd sort with an awk script that prints only the first line until field 1 changes.

As I said before, with a million rows, this will be slow.


Report •

#6
July 22, 2011 at 04:36:25
Thanks Nails, I'll try it.

I still have a lot to learn

Thanks again


Report •

Ask Question