Name: amritpalsingh Date: January 11, 2008 at 09:10:45 Pacific Subject: Find duplicate words in a file OS: SunOS CPU/Ram: n/a Model/Manufacturer: SUNW,Sun-Fire-480R
Comment:
Hi, in the test file below there are 2 columns, 1st column followed by a comma ',' and then the 2nd column starting with a ". the 2nd column can have a list of words. I need a script that will check the following 2 things in the file:
1) The word in the first column must appear in the list of words in the right column, if it's not there, print that line and the worrd on the left. 2) If any of the word repeats itself on any other line, it should be reported. So ideally, one word can appear on only 1 line, it should be unique per line, same word on more than 1 line , we need to report it.
Note: "new york", "newyork" ,"new" on different lines are fine and shouldnt be reported, tehy are unique.
The test file is:
new york, "new York, capital, nice" Newyork, "New york, good" new city, "new, new city, old' new year, "new year" dealers, "deal, DL" SEATTLE , "seattle, Wash" BANK, "BANK, BANC, BANCO, BCO, BANKS" corporate,"corporate, new year"
new york, "new York, capital, nice" --> Good Newyork, "New york, good" --> Good new city, "new, new city, old' --> Good new year, "new year" --> BAD ("new year" is there in another line down there) dealers, "deal, DL" --> BAD ("dealers" in the left column is missing in right side) SEATTLE , "seattle, Wash" --> Good (case doenst matter) BANK, "BANK, BANC, BANCO, BCO, BANKS" -> Good corporate,"corporate, new year" --> BAD ("new year" is there in another line up there)
The gsub commands are to get rid of extra spaces that would mess up the comparisons. I made an assumption that you do not need two or more spaces in a row.
Any word appearing in the same tag list more than once will be reported, whether it matches the base word or not. cowboys appears twice in the same tag list and is reported. giants appears twice in the same taglist, and matches the base word, and is reported. But I could suppress this.
I use an array to store all the words, so there is a limit to how much data can be processed with this approach. If you have too much data for this approach, I would use awk to extract all the tag words, sort them, then check the sorted list for duplicates.
new york, "new York, capital, nice" Newyork, "Newyork, good" new city, "new, new city, old" new year, "new year" dealers, "deal, DL" SEATTLE , "seattle, Wash" BANK, "BANK, BANC, BANCO, BCO, BANKS" giants, "Cowboys, dealer, giants, cowboys, giants" corporate,"corporate, new year"
awk -F, '\ BEGIN {q="\042"} { origline=$0 gsub("[ ][ ][ ]*"," ") gsub(" ,",",") gsub(", ",",") baseword=toupper($1) word2=toupper(substr($0,index($0,q)+1)) gsub(q,"",word2) split(word2,taglist) matchflag=0 errorflag=0 for (i in taglist) {tagx=taglist[i] if (baseword == tagx) matchflag=1 if (tagx in all_words) {if (errorflag == 0) {print origline errorflag=1} lno=all_words[tagx] if (lno == NR) whichline="this line" else whichline="line " lno print " ERROR in above line: tag " i " (" tagx \ ") already encountered in " whichline } else all_words[tagx]=NR } if (matchflag == 0) {print origline print " ERROR in above line: Base word not matched in tag list" } all_words[baseword]=NR }' file.in
dealers, "deal, DL" ERROR in above line: Base word not matched in tag list giants, "Cowboys, Patriots, dealer, dealers, giants, cowboys, miami dolphins, giants" ERROR in above line: tag 4 (DEALERS) already encountered in line 5 ERROR in above line: tag 8 (GIANTS) already encountered in this line ERROR in above line: tag 1 (COWBOYS) already encountered in this line corporate,"corporate, new year" ERROR in above line: tag 2 (NEW YEAR) already encountered in line 4
Oops, I grabbed the wrong output for the above posting. Following is the corrected output:
dealers, "deal, DL" ERROR in above line: Base word not matched in tag list giants, "Cowboys, dealer, giants, cowboys, giants" ERROR in above line: tag 1 (COWBOYS) already encountered in this line ERROR in above line: tag 3 (GIANTS) already encountered in this line corporate,"corporate, new year" ERROR in above line: tag 2 (NEW YEAR) already encountered in line 4
The information on Computing.Net is the opinions of its users. Such
opinions may not be accurate and they are to be used at your own risk.
Computing.Net cannot verify the validity of the statements made on this site. Computing.Net and Computing.Net, LLC hereby disclaim all responsibility and liability for the content of Computing.Net and its accuracy.
PLEASE READ THE FULL DISCLAIMER AND LEGAL TERMS BY CLICKING HERE