Computing.Net > Forums > Unix > Find duplicate words in a file

Computer Problems? Computing.Net has over 1,000,000 posts about all things technology related! Over 90% answered within 24 hours! Click here to start participating now! Also, be sure to check out the New User Guide.

Find duplicate words in a file

Reply to Message Icon

Name: amritpalsingh
Date: January 11, 2008 at 09:10:45 Pacific
OS: SunOS
CPU/Ram: n/a
Product: SUNW,Sun-Fire-480R
Comment:


Hi, in the test file below there are 2 columns, 1st column followed by a comma ',' and then the 2nd column starting with a ".
the 2nd column can have a list of words.
I need a script that will check the following 2 things in the file:

1) The word in the first column must appear in the list of words in the right column, if it's not there, print that line and the worrd on the left.
2) If any of the word repeats itself on any other line, it should be reported.
So ideally, one word can appear on only 1 line, it should be unique per line, same word on more than 1 line , we need to report it.

Note: "new york", "newyork" ,"new" on different lines are fine and shouldnt be reported, tehy are unique.

The test file is:

new york, "new York, capital, nice"
Newyork, "New york, good"
new city, "new, new city, old'
new year, "new year"
dealers, "deal, DL"
SEATTLE , "seattle, Wash"
BANK, "BANK, BANC, BANCO, BCO, BANKS"
corporate,"corporate, new year"

new york, "new York, capital, nice" --> Good
Newyork, "New york, good" --> Good
new city, "new, new city, old' --> Good
new year, "new year" --> BAD ("new year" is there in another line down there)
dealers, "deal, DL" --> BAD ("dealers" in the left column is missing in right side)
SEATTLE , "seattle, Wash" --> Good (case doenst matter)
BANK, "BANK, BANC, BANCO, BCO, BANKS" -> Good
corporate,"corporate, new year" --> BAD ("new year" is there in another line up there)

Thanks
Amrit



Sponsored Link
Ads by Google

Response Number 1
Name: amritpalsingh
Date: January 11, 2008 at 09:15:08 Pacific
Reply:

The 2nd line in the test file is
Newyork, "Newyork, good" , and its fine as Newyork appears on the right hand side and it's unique.

--amrit


0

Response Number 2
Name: James Boothe
Date: January 11, 2008 at 15:16:59 Pacific
Reply:

Regarding your second requirement for words duplicated on other lines: does that apply only for words in column 2?


0

Response Number 3
Name: amritpalsingh
Date: January 11, 2008 at 16:10:44 Pacific
Reply:

Yes that word should not be present on any other line, in either column.

So the word should only appear on the same line.


0

Response Number 4
Name: James Boothe
Date: January 15, 2008 at 08:53:06 Pacific
Reply:

The gsub commands are to get rid of extra spaces that would mess up the comparisons.  I made an assumption that you do not need two or more spaces in a row.

Any word appearing in the same tag list more than once will be reported, whether it matches the base word or not.  cowboys appears twice in the same tag list and is reported.  giants appears twice in the same taglist, and matches the base word, and is reported.  But I could suppress this.

I use an array to store all the words, so there is a limit to how much data can be processed with this approach.  If you have too much data for this approach, I would use awk to extract all the tag words, sort them, then check the sorted list for duplicates.


new york, "new York, capital, nice"
Newyork, "Newyork, good"
new city, "new, new city, old"
new year, "new year"
dealers, "deal, DL"
SEATTLE , "seattle, Wash"
BANK, "BANK, BANC, BANCO, BCO, BANKS"
giants, "Cowboys, dealer, giants, cowboys, giants"
corporate,"corporate, new year"

awk -F, '\
BEGIN {q="\042"}
{
origline=$0
gsub("[ ][ ][ ]*"," ")
gsub(" ,",",")
gsub(", ",",")
baseword=toupper($1)
word2=toupper(substr($0,index($0,q)+1))
gsub(q,"",word2)
split(word2,taglist)
matchflag=0
errorflag=0
for (i in taglist)
   {tagx=taglist[i]
    if (baseword == tagx)
       matchflag=1
    if (tagx in all_words)
       {if (errorflag == 0)
          {print origline
           errorflag=1}
        lno=all_words[tagx]
        if (lno == NR)
           whichline="this line"
        else
           whichline="line " lno
        print "   ERROR in above line: tag " i " (" tagx \
          ") already encountered in " whichline
       }
    else
       all_words[tagx]=NR
   }
if (matchflag == 0)
   {print origline
    print "   ERROR in above line: Base word not matched in tag list"
   }
all_words[baseword]=NR
}' file.in


dealers, "deal, DL"
   ERROR in above line: Base word not matched in tag list
giants, "Cowboys, Patriots, dealer, dealers, giants, cowboys, miami dolphins, giants"
   ERROR in above line: tag 4 (DEALERS) already encountered in line 5
   ERROR in above line: tag 8 (GIANTS) already encountered in this line
   ERROR in above line: tag 1 (COWBOYS) already encountered in this line
corporate,"corporate, new year"
   ERROR in above line: tag 2 (NEW YEAR) already encountered in line 4


0

Response Number 5
Name: James Boothe
Date: January 15, 2008 at 10:26:15 Pacific
Reply:

Oops, I grabbed the wrong output for the above posting.
Following is the corrected output:

dealers, "deal, DL"
   ERROR in above line: Base word not matched in tag list
giants, "Cowboys, dealer, giants, cowboys, giants"
   ERROR in above line: tag 1 (COWBOYS) already encountered in this line
   ERROR in above line: tag 3 (GIANTS) already encountered in this line
corporate,"corporate, new year"
   ERROR in above line: tag 2 (NEW YEAR) already encountered in line 4


0

Related Posts

See More



Sponsored Link
Ads by Google
Reply to Message Icon






Post Locked

This post is quite old and has been locked from receiving new replies. Please create a new posting instead.


Go to Unix Forum Home


Sponsored links

Ads by Google


Results for: Find duplicate words in a file

finding consecutive lines in a file www.computing.net/answers/unix/finding-consecutive-lines-in-a-file/3996.html

Remove Last Word in File www.computing.net/answers/unix/remove-last-word-in-file/7535.html

Replacing a word in a file www.computing.net/answers/unix/replacing-a-word-in-a-file-/7137.html