Tricky Awk/Gawk Programming

June 25, 2010 at 12:57:30
Specs: Windows XP
Hello, I am at a new job that requires some programming using awk/gawk. I have also never done any computer programming before so I am having a bit of trouble.
I need to use awk to sort and select certain lines of massive text files so I can organize data much easier.
I have simplified them, but the lines I need from the file typically are listed like this...

//
ID : 1BC1
Compound : annexin v
Compound : anticoagulant protein
Chain : A
Sequence : AQKEQLG
Cryst-Cont: _+____+_+|
//
ID : 1BC2
Compound : metallo-beta-lactamase ii
Chain: A
Sequence : TRQEVL
Cryst-Cont : ___++__|
Chain: B
Sequence : RGQKTV
Cryst-Cont : __+++__|
//
ID : 1BC3 . . . and the same type of pattern repeats

So I have managed to write a program that finds all the lines that start with ID, Compound, Chain, Sequence, and Cryst-Cont, and print the third field, $3, sequentially.

So the output for the above bit of info would be:

1BC1
annexin v
anticoagulant protein
A
AQKEQLG
_+____+_+|
1BC2
metallo-beta-lactamase ii
A
TRQEVL
___++__|
B
RGQKTV
__+++__|
1BC3

Now this is close to what I need but not exact.
For the Compound, Chain, Sequence, and Cryst-Cont lines I only care about the first entry. So I want my out put to look like this. . .

1BC1
annexin v
A
AQKEQLG
_+____+_+|
1BC2
metallo-beta-lactamase ii
A
TRQEVL
___++__|
1BC3

I don't care about the second compound line, or the other Sequence lines, just the first. And these repeats are often in random order, not just sequential like this example.
So basically I need my output to consist of the first line of each category, and ignore the rest until I get to another ID line.

I know this is a very long question, and I most likely haven't explained very well, but any help would be great.
Thanks.


See More: Tricky Awk/Gawk Programming

Report •


#1
June 27, 2010 at 19:34:56
Although awk can be told (via FS, RS system variables) to process multiple lines, I think it would be awkward (ha ha) to handle an indefinite number of lines, but you might be able to define variables to be set that you've already "seen" a certain kind of record. Skimming through my O'Reilly "sed & awk" book, it appears that awk does have a general enough language to do this. Maybe you would read groups (RS = "//" FS = "\n") and treat each line as $n field. Assuming you can call awk recursively to then split up each line (I don't know if $n gets reset!) you can test the type (keyword) of each line, and if it hasn't been "seen" yet, output the third field in the line.

Frankly, awk looks pretty hairy to do this (but it looks like it's possible). Still, you might be better off using something like Perl to process line by line (using awk to break up each line into keyword and value). Test the keyword's flag and if it hasn't been seen before, output the value and set the flag for that keyword. At the group separator (//) reset all the flags.


Report •

#2
June 27, 2010 at 23:06:18
This is a case of setting up a series of flags in awk to make sure the field only prints once per block. When the block changes, set all the flags back to zero.

What was tricky was the field values. Classic awk only allows one character for a field seperator, FS. Note the use of gsub to remove the offending spaces.

Also note that I am using nawk with Solaris. You'll probably have to use awk.

#!/bin/ksh

nawk ' BEGIN { FS=":" ; comp_flg=0; chain_flg=0; seq_flg=0; cryst_flg=0; }
{
# get rid of all spaces in field 1
gsub(" ","",$1)
# get rid of leading spaces in field 2
gsub("^[ ]*","",$2)
if($1 == "ID" )
   {
   print $2
   comp_flg=0
   chain_flg=0
   seq_flg=0
   cryst_flg=0
   continue
   }
if($1 == "Compound" && comp_flg == 0)
   {
   print $2
   comp_flg=1
   continue
   }
if($1 == "Chain" && chain_flg == 0)
   {
   print $2
   chain_flg=1
   continue
   }
if($1 == "Cryst-Cont" && cryst_flg == 0)
   {
   print $2
   cryst_flg=1
   continue
   }

if($1 == "Sequence" && seq_flg == 0)
   {
   print $2
   seq_flg=1
   continue
   }

} ' datafile.txt


Report •

#3
June 28, 2010 at 07:26:36
Thanks for the help so far.
I am using awk on windows, so I'm sure there is a bit of difference in syntax and the like.
But the program I have put in at the moment consists of the following

awk " BEGIN { FS = \":\" ; cmpd_flg = 0; chain_flg = 0; seq_flg = 0; crystal_flag = 0} { gsub(\" \",\"\",$1) ;
gsub(\"^[ ]*\",\"\",$2) ; if( $1 == \"ID\") ;
{print $2; cmpd_flg=0; chain_flg=0; seq_flg=0; crystal_flg=0; continue} ; if( $1 == \"Compound\" && cmpd_flg == 0) ;
{print $2; cmpd_flg=1; continue} ;
if( $1 == \"Chain\" && chain_flg == 0) ;
{print $2; chain_flg=1; continue} ;
if($1 == \"Sequence\" && seq_flg == 0) ;
{print $2; seq_flg=1; continue} ;
if( $1 == \"Cryst-Cont\" && crystal_flg == 0) ;
{print $2; crystal_flg=1; continue} }" data.file

Now the only error message im getting is:
" fatal: `continue' outside a loop is not allowed "

I cant seem to figure out what is wrong with the continue function . . .


Report •

Related Solutions

#4
June 28, 2010 at 08:04:41
If I substitue each continue command with next however, I do get an output, but it just prints the 2nd field of every line, not just the select lines that I need.

Report •

#5
June 28, 2010 at 08:58:40
I think Perl would be better/easier.

#!/usr/bin/perl

use strict;
use warnings;
use v5.10;

$/ = "//";

RECORD:
while ( <DATA> ) {
    chomp;
    s/^\n//;
    my @rows = split /\n/;
    foreach my $row ( @rows ) {
        my $field = (split(/\s?:\s?/, $row))[1];
        say $field;
        next RECORD if $row =~ /Cryst-Cont/;
    }
}

__DATA__
//
ID : 1BC1
Compound : annexin v
Compound : anticoagulant protein
Chain : A
Sequence : AQKEQLG
Cryst-Cont: _+____+_+|
//
ID : 1BC2
Compound : metallo-beta-lactamase ii
Chain: A
Sequence : TRQEVL
Cryst-Cont : ___++__|
Chain: B
Sequence : RGQKTV
Cryst-Cont : __+++__|
//

*********
Test Output:

D:\perl>test.pl
1BC1
annexin v
anticoagulant protein
A
AQKEQLG
_+____+_+|
1BC2
metallo-beta-lactamase ii
A
TRQEVL
___++__|


Report •

#6
June 28, 2010 at 11:08:06
I'm sure perl is easier, but I would rather use awk if possible, since it is the only program I have sat down and tried to learn.

Report •

#7
June 28, 2010 at 12:28:05
J.Castiglia:

I can not help you much with Window's awk. Actually, the continue/next really isn't required. I just didn't want to continue the series of if statements once an instance was found:

I don't understand why you are escaping each double quote: \" Is this a requirement of the awk version you are using? It certainly isn't required in any Unix/Linux version of awk that I have seen.

How does your awk version handle strings? Maybe it uses something else like single quotes.

Sorry I can not help more.


Report •

#8
June 28, 2010 at 12:39:40
Yeah the version of awk I am using is a bit odd from what I can tell compared to others. It doesn't like single quotes for some reason, and I need to escape any double quote that isn't one of the two that outlines the entire program.
But thank you for your help anyway.
I will keep trying, and maybe try perl if need be.

Report •

#9
June 28, 2010 at 17:43:39
If you decide to use Perl, here's the updated version that requires you to pass the data filename as a parameter.

#!/usr/bin/perl

use strict;
use warnings;
use v5.10;

@ARGV or die "usage: $0 <filename>\n";
open my $fh, '<', $ARGV[0] or die "failed to open '$ARGV[0]' $!";

$/ = "//";

RECORD:
while ( <$fh> ) {
    chomp;
    s/^\n//;
    my @rows = split /\n/;
    foreach my $row ( @rows ) {
        my $field = (split(/\s?:\s?/, $row))[1];
        say $field;
        next RECORD if $row =~ /Cryst-Cont/;
    }
}
close $fh;


Report •

#10
June 30, 2010 at 06:41:15
So I have decided to continue with awk since I can't get Perl to work properly on my computer.
I have gone back and taken the suggestion of response #1 and have defined RS as '//' and FS as '\n'.

Now this allows me to print the ID and compound columns very nicely, but it makes things very tricky.

By changing the record and field separators I am not sure how to implement search patterns anymore. If I want to search for a line with with the string 'sequence' and print the whole line, that doesn't really work anymore since that line in its entirety is one field. And if I want to print $0 for anything it will of course print the entire record (which isnt just one line), which is not what I want.

Can I somehow search for "if a field contains 'this' print that field," or something similar. For the default RS and FS that would seem redundant but not in this case.


Report •

#11
June 30, 2010 at 07:08:46
Classic awk contains an internal match function:

match(s,r)

tests whether s contains a substring matched by r. It returns the index or 0 if the substring is not found.

if(match($1, "USERNAME") > 0)
printf("%s ", $1)

In the above example, if field 1 contains the string USERNAME, field 1 is printed.

Maybe your version of awk supports it??



Report •

#12
June 30, 2010 at 08:22:27
But what if I don't know what the field 'USERNAME', for example, should be in.
My file now consists of blocks of text as records with, on average, 50 lines/fields within each record.
So out of these 50 lines, the "Sequence" line could be line 18, or 32 or whatever. So the match function wouldn't really work because im not testing to see if only $1 contains it or not.
So basically I need to search for the field first in each of my records for the line (which is itself a field) that contains the string 'Sequence,' and print that line/field.

I realize this is getting more complicated and unecessary because im using awk, but for someone who has never programmed before, awk has the simplest language and syntax to understand, for me at least.


Report •

#13
July 1, 2010 at 10:47:26
If you go back to #3, it's complaining because "continue" is only used for exiting from a loop. All you have is a series of "if" statements. You can use "else if" to avoid testing repeatedly. awk doesn't seem to have a multiway "switch" or "case" statement. Finally, you've got too many semicolons -- you don't need them after } and you don't want them after the if (...) expression. If you use ' around the outer awk command, you won't have to escape \ all the inner " characters.

BEGIN initialization is done

Loop through the input file, reading in one record at a time per RS setting (your record is multiline). The record is broken up into fields per FS setting (your fields are each one line).

Within this implied loop, you have $1 is field (line) 1, $2 is field (line) 2, etc. At this point, you can break up each field (line) using some sort of pattern matching (gsub?) and test on the keyword ("ID", etc.). You might use an explicit loop to go through the fields and process them, branching (if) on the keyword.

END procedure is processed.

I would really strongly suggest that if you want to use awk, you get the O'Reilly book "sed & awk". It's full of examples and is quite easy to read. Any well-stocked Barnes and Noble in the area should have it in the Computer section.


Report •

#14
July 1, 2010 at 11:16:56
Holy crap I got it to work.
Thank you everyone for your help.
I would have probably spent years on this before getting it by myself.

Report •

#15
July 2, 2010 at 17:08:38
Just for the benefit of future generations, could you share with us what you came up with? Thanks!

Report •


Ask Question