|Hello, I am at a new job that requires some programming using awk/gawk. I have also never done any computer programming before so I am having a bit of trouble.|
I need to use awk to sort and select certain lines of massive text files so I can organize data much easier.
I have simplified them, but the lines I need from the file typically are listed like this...
ID : 1BC1
Compound : annexin v
Compound : anticoagulant protein
Chain : A
Sequence : AQKEQLG
ID : 1BC2
Compound : metallo-beta-lactamase ii
Sequence : TRQEVL
Cryst-Cont : ___++__|
Sequence : RGQKTV
Cryst-Cont : __+++__|
ID : 1BC3 . . . and the same type of pattern repeats
So I have managed to write a program that finds all the lines that start with ID, Compound, Chain, Sequence, and Cryst-Cont, and print the third field, $3, sequentially.
So the output for the above bit of info would be:
Now this is close to what I need but not exact.
For the Compound, Chain, Sequence, and Cryst-Cont lines I only care about the first entry. So I want my out put to look like this. . .
I don't care about the second compound line, or the other Sequence lines, just the first. And these repeats are often in random order, not just sequential like this example.
So basically I need my output to consist of the first line of each category, and ignore the rest until I get to another ID line.
I know this is a very long question, and I most likely haven't explained very well, but any help would be great.