Tom's Guide | Tom's Hardware | Tom's Games
![]() |
![]() |
![]() |
Hi, in the test file below there are 2 columns, 1st column followed by a comma ',' and then the 2nd column starting with a ".
the 2nd column can have a list of words.
I need a script that will check the following 2 things in the file:1) The word in the first column must appear in the list of words in the right column, if it's not there, print that line and the worrd on the left.
2) If any of the word repeats itself on any other line, it should be reported.
So ideally, one word can appear on only 1 line, it should be unique per line, same word on more than 1 line , we need to report it.Note: "new york", "newyork" ,"new" on different lines are fine and shouldnt be reported, tehy are unique.
The test file is:
new york, "new York, capital, nice"
Newyork, "New york, good"
new city, "new, new city, old'
new year, "new year"
dealers, "deal, DL"
SEATTLE , "seattle, Wash"
BANK, "BANK, BANC, BANCO, BCO, BANKS"
corporate,"corporate, new year"new york, "new York, capital, nice" --> Good
Newyork, "New york, good" --> Good
new city, "new, new city, old' --> Good
new year, "new year" --> BAD ("new year" is there in another line down there)
dealers, "deal, DL" --> BAD ("dealers" in the left column is missing in right side)
SEATTLE , "seattle, Wash" --> Good (case doenst matter)
BANK, "BANK, BANC, BANCO, BCO, BANKS" -> Good
corporate,"corporate, new year" --> BAD ("new year" is there in another line up there)Thanks
Amrit

The 2nd line in the test file is
Newyork, "Newyork, good" , and its fine as Newyork appears on the right hand side and it's unique.--amrit

Regarding your second requirement for words duplicated on other lines: does that apply only for words in column 2?

Yes that word should not be present on any other line, in either column.
So the word should only appear on the same line.

The gsub commands are to get rid of extra spaces that would mess up the comparisons. I made an assumption that you do not need two or more spaces in a row.
Any word appearing in the same tag list more than once will be reported, whether it matches the base word or not. cowboys appears twice in the same tag list and is reported. giants appears twice in the same taglist, and matches the base word, and is reported. But I could suppress this.
I use an array to store all the words, so there is a limit to how much data can be processed with this approach. If you have too much data for this approach, I would use awk to extract all the tag words, sort them, then check the sorted list for duplicates.
new york, "new York, capital, nice"
Newyork, "Newyork, good"
new city, "new, new city, old"
new year, "new year"
dealers, "deal, DL"
SEATTLE , "seattle, Wash"
BANK, "BANK, BANC, BANCO, BCO, BANKS"
giants, "Cowboys, dealer, giants, cowboys, giants"
corporate,"corporate, new year"awk -F, '\
BEGIN {q="\042"}
{
origline=$0
gsub("[ ][ ][ ]*"," ")
gsub(" ,",",")
gsub(", ",",")
baseword=toupper($1)
word2=toupper(substr($0,index($0,q)+1))
gsub(q,"",word2)
split(word2,taglist)
matchflag=0
errorflag=0
for (i in taglist)
{tagx=taglist[i]
if (baseword == tagx)
matchflag=1
if (tagx in all_words)
{if (errorflag == 0)
{print origline
errorflag=1}
lno=all_words[tagx]
if (lno == NR)
whichline="this line"
else
whichline="line " lno
print " ERROR in above line: tag " i " (" tagx \
") already encountered in " whichline
}
else
all_words[tagx]=NR
}
if (matchflag == 0)
{print origline
print " ERROR in above line: Base word not matched in tag list"
}
all_words[baseword]=NR
}' file.in
dealers, "deal, DL"
ERROR in above line: Base word not matched in tag list
giants, "Cowboys, Patriots, dealer, dealers, giants, cowboys, miami dolphins, giants"
ERROR in above line: tag 4 (DEALERS) already encountered in line 5
ERROR in above line: tag 8 (GIANTS) already encountered in this line
ERROR in above line: tag 1 (COWBOYS) already encountered in this line
corporate,"corporate, new year"
ERROR in above line: tag 2 (NEW YEAR) already encountered in line 4

Oops, I grabbed the wrong output for the above posting.
Following is the corrected output:dealers, "deal, DL"
ERROR in above line: Base word not matched in tag list
giants, "Cowboys, dealer, giants, cowboys, giants"
ERROR in above line: tag 1 (COWBOYS) already encountered in this line
ERROR in above line: tag 3 (GIANTS) already encountered in this line
corporate,"corporate, new year"
ERROR in above line: tag 2 (NEW YEAR) already encountered in line 4

![]() |
![]() |
![]() |

This post is quite old and has been locked from receiving new replies. Please create a new posting instead.
| Ads by Google |