Combine duplicate records

UNIX
November 17, 2008 at 05:15:33
Specs: UNIX, UNIX box
I have an extract of a csv file that looks like this:

5619,Marcus Lindsay Asher,2005/06,3500,525
5806,John Clifford,1999/00,3000,450
5806,John Clifford,2000/01,2000,300
5887,Geoffery William,2005/06,5000,750

I would like to combine duplicate records into a single record with the new single record containing additional fields (for example the last 3 fields). In the example file above, the first field is the unique field. So I would like my output to be like below:

5619,Marcus Lindsay Asher,2005/06,3500,525
5806,John Clifford,1999/00,3000,450,2000/01,2000,300
5887,Geoffery William,2005/06,5000,750

Can someone please please help me with a solution using Unix scripting, awk or C, whichever is simpler.

Thank you.


See More: Combine duplicate records

Report •


#1
November 17, 2008 at 09:58:00
One way is this Korn shell script. In a normal break point report such as this, the script assumes the datafile is sorted by the first field key:



#!/bin/ksh

fflag=0
while IFS="," read f1 f2 f3 f4 f5
do
if [[ $fflag = 0 ]]
then
fflag=1
sf1="$f1"
printf "%s,%s,%s,%s,%s" "$f1" "$f2" "$f3" "$f4" "$f5"
continue
fi

if [[ "$f1" != "$sf1" ]]
then # changes
printf "\n%s,%s,%s,%s,%s" "$f1" "$f2" "$f3" "$f4" "$f5"
else # does NOT change
printf ",%s,%s,%s" "$f3" "$f4" "$f5"
fi

sf1="$f1"
done < datafile
echo ""

Of course, this particular script only works for a comma-delimited record of 5 fixed fields.


Report •

#2
November 17, 2008 at 16:16:31
Thank you nails. Almost there. Only problem left is the third printf (when it does not change) writes the f3, f4 and f5 at the beginning of the line. So the output becomes as below:

5619,Marcus Lindsay Asher,2005/06,3500,525
,2000/01,2000,300d,1999/00,3000,450
5887,Geoffery William,2005/06,5000,750


Report •

#3
November 17, 2008 at 16:52:43
Actually nails, the output is correct. Only thing is the third printf (when it does not change) writes f3, f4 and f5 on a new line instead of appending at the end. So my output becomes as below:

5619,Marcus Lindsay Asher,2005/06,3500,525
5806,John Clifford,1999/00,3000,450
,2000/01,2000,300
5887,Geoffery William,2005/06,5000,750


Report •

Related Solutions

#4
November 17, 2008 at 19:43:45
I checked my code again, and I'm not having any trouble. The only way I can duplicate your error is placing a newline at the end of the 2nd printf:

printf "\n%s,%s,%s,%s,%s\n" "$f1" "$f2" "$f3" "$f4" "$f5"

This is incorrect as it should be this:

printf "\n%s,%s,%s,%s,%s" "$f1" "$f2" "$f3" "$f4" "$f5"

I don't know what else to tell you. BTW, what unix are you using? I'm using Solaris 9.

If your code is different from mine, you can post it and I'll take a look at it.


Report •

#5
November 17, 2008 at 23:00:07
I think you're right. I got that issue when I ran the script against a small test file. When I modified the script to process a bigger csv file with more fields, I didn't have the issue. So I'm not quite sure why the output behaves like that.
Anyway, for now thank you so much for your help! :)

Report •

#6
February 13, 2009 at 09:45:11

OK. I could not resist the temptation to write a script for Windows - I am sure that there are users who want to do similar things with CSV files in Windows also.

Assuming that the input CSV (Comma-Separated-Values) file is X.csv, and the output is to go to Y.csv, the following script will combine duplicate records in the input CSV file. (For biterscripting, go to http://www.biterscriptng.com . I think is free.)

Sen

# SATRT OF SCRIPT
# Read in the entire file
var str data, records, new_records, record, new_record, key ; cat X.csv > $records

# Process records one by one
while ($records <> "")
do
lex -e "1" $records > $record

# Get unique key and data
stex -p "]^,^2" $record > $key # The first field is key.
stex -p "[^,^2" $record > $data # The second comma onward is data.

# Check if this key exists in new_data.
var str key_search ; set $key_search = "^"+$key+"^"
if ( {sen $key_search $new_records} == 0)
# This key does not exist in new_data. Just add this entire row.
set $new_records = $new_records + $record + "\n"
else
do
# This key already exists in new_data. Just append data to that row.
var str key_arg ; set $key_arg = "["+$key_search
stex $key_arg $new_records > $new_record
lap "1" $data $new_record >null
set $new_records = $new_records + $new_record +"\n"

done
endif
done

# Write new csv to Y.csv
echo $new_records > Y.csv


Report •


Ask Question