Compare first column of 2 files

February 17, 2010 at 05:33:00
Specs: Cygwin-X
I am struggling with writing a script that compares the first column of 2 files.

file 1 file 2
0 a 20 j
10 b 30 h
20 c 40 i
30 d

Desired result is:
20 c j
30 d h

I have tried several things including join (but it seems that the files must start on the same value of the 1st column)
Tried a while
temp1 = awk '{print $1}' | head -1 file1
temp2 = awk '{print $1}' | head -1 file2

While [ $temp1 == $temp2 ]
awk {print $0}

Can't seem to get anywhere with it...


See More: Compare first column of 2 files

Report •

#1
February 17, 2010 at 08:08:19
Assuming file1:

a
10 b
20 c
30 d

file2:

0 j
30 h
40 i

This brute force korn shell script running on Solaris 9 works using an embedded while loop. Sorry, as I am not a Cygwin user, I can not vouch for that platform:

#!/bin/ksh

# NO ERROR CHECKING
while read o1 o2
do
   while read i1 i2
   do
      if [ "$o1" == "$i1" ]
      then
         echo "${o1} ${o2} ${i2}"
      fi
   done < file2
done < file1



Report •

#2
February 17, 2010 at 09:59:24
It is not clear to me what the o1, o2, i1, i2 represent...
Let me ask this way...

I want to append columns from the 2nd file when the first colums match. The first column in my case is time, other columns are just event ID's.

file1
1 a aa aaa
2 b bb bbb
3 c cc ccc
4 d dd ddd
5 e ee eee
6 f ff fff
7 g gg ggg
8 h hh hhh
9 i ii iii
9 j jj jjj
10 k kk kkk

file 2
6 f ff fff
7 g gg ggg
8 h hh hhh
9 i ii iii
Result contains appended columns that had match on row one
file 3
6 f ff fff f ff fff
7 g gg ggg g gg ggg
8 h hh hhh h hh hhh
9 i ii iii i ii iii




Report •

#3
February 17, 2010 at 10:43:56
The outer while loop reads each line, and, by default, parses the columns by space, tab, or newline:

o1 = column1
o2 = the rest of the line.

with the inner while loop it is:

i1 = column 1
i2 = the rest of the line.

If the first two columns match, o1 equals i1, echo the pieces that fit your criteria.

Of course, your spec doesn't state what to do with two lines with the first column is used more than once such as the 9. Using your data, my script output is:

6 f ff fff f ff fff
7 g gg ggg g gg ggg
8 h hh hhh h hh hhh
9 i ii iii i ii iii
9 j jj jjj i ii iii


Report •

Related Solutions

#4
February 17, 2010 at 10:54:39
The more I think about it, you probably want to compare the second parts also using an and, &&, clause:

#!/bin/ksh

while read o1 o2
do
   while read i1 i2
   do
      if [[ "$o1" == "$i1" && "$o2" == "$i2" ]]
      then

         echo "${o1} ${o2} ${i2}"
      fi
   done < file2
done < file1


Report •

#5
February 17, 2010 at 11:27:16
I am sorry, the second 9 was a typo. The files are just time histories with each value (time) in column 1 being unique for each file. Thanks for your inputs, and I will try to see if I can repeat what you did ASAP then reply back...

Report •

#6
February 17, 2010 at 16:25:07
This was an interesting problem. Another way is an awk script which saves one of the files into an associative array, and then compares:

awk ' BEGIN {
   # build the arrray
   while ( getline < "file2" > 0 )
      {
      saveit=$1 # save the first column
      $1=""     # wipe it out
      myarr[saveit]=$0 # build the array minus the first col.
      }
}
{
if($1 in myarr)
    printf("%s %s\n", $0, myarr[$1])
} ' file1


Report •

Ask Question