Specialty Forums
Security and Virus
General Hardware
CPUs/Overclocking
Networking
Digital Photo/Video
Office Software
PC Gaming
Console Gaming
Programming
Database
Web Development
Digital Home

General Forums
Windows XP
Windows Vista
Windows 95/98
Windows Me
Windows NT
Windows 2000
Win Server 2008
Win Server 2003
Windows 3.1
Linux
PDAs
BeOS
Novell Netware
OpenVMS
Solaris
Disk Op. System
Unix
Mac
OS/2

Drivers
Driver Scan
Driver Forum

Software
Automatic Updates

BIOS Updates

My Computing.Net

Solution Center

Free IT eBook

Howtos

Site Search

Message Find

RSS Feeds

Install Guides

Data Recovery

About

Home
Reply to Message Icon Go to Main Page Icon

Find duplicate words in a file

Original Message
Name: amritpalsingh
Date: January 11, 2008 at 09:10:45 Pacific
Subject: Find duplicate words in a file
OS: SunOS
CPU/Ram: n/a
Model/Manufacturer: SUNW,Sun-Fire-480R
Comment:

Hi, in the test file below there are 2 columns, 1st column followed by a comma ',' and then the 2nd column starting with a ".
the 2nd column can have a list of words.
I need a script that will check the following 2 things in the file:

1) The word in the first column must appear in the list of words in the right column, if it's not there, print that line and the worrd on the left.
2) If any of the word repeats itself on any other line, it should be reported.
So ideally, one word can appear on only 1 line, it should be unique per line, same word on more than 1 line , we need to report it.

Note: "new york", "newyork" ,"new" on different lines are fine and shouldnt be reported, tehy are unique.

The test file is:

new york, "new York, capital, nice"
Newyork, "New york, good"
new city, "new, new city, old'
new year, "new year"
dealers, "deal, DL"
SEATTLE , "seattle, Wash"
BANK, "BANK, BANC, BANCO, BCO, BANKS"
corporate,"corporate, new year"

new york, "new York, capital, nice" --> Good
Newyork, "New york, good" --> Good
new city, "new, new city, old' --> Good
new year, "new year" --> BAD ("new year" is there in another line down there)
dealers, "deal, DL" --> BAD ("dealers" in the left column is missing in right side)
SEATTLE , "seattle, Wash" --> Good (case doenst matter)
BANK, "BANK, BANC, BANCO, BCO, BANKS" -> Good
corporate,"corporate, new year" --> BAD ("new year" is there in another line up there)

Thanks
Amrit


Report Offensive Message For Removal


Response Number 1
Name: amritpalsingh
Date: January 11, 2008 at 09:15:08 Pacific
Subject: Find duplicate words in a file
Reply: (edit)
The 2nd line in the test file is
Newyork, "Newyork, good" , and its fine as Newyork appears on the right hand side and it's unique.

--amrit


Report Offensive Follow Up For Removal

Response Number 2
Name: James Boothe
Date: January 11, 2008 at 15:16:59 Pacific
Subject: Find duplicate words in a file
Reply: (edit)
Regarding your second requirement for words duplicated on other lines: does that apply only for words in column 2?

Report Offensive Follow Up For Removal

Response Number 3
Name: amritpalsingh
Date: January 11, 2008 at 16:10:44 Pacific
Subject: Find duplicate words in a file
Reply: (edit)
Yes that word should not be present on any other line, in either column.

So the word should only appear on the same line.


Report Offensive Follow Up For Removal

Response Number 4
Name: James Boothe
Date: January 15, 2008 at 08:53:06 Pacific
Subject: Find duplicate words in a file
Reply: (edit)
The gsub commands are to get rid of extra spaces that would mess up the comparisons.  I made an assumption that you do not need two or more spaces in a row.

Any word appearing in the same tag list more than once will be reported, whether it matches the base word or not.  cowboys appears twice in the same tag list and is reported.  giants appears twice in the same taglist, and matches the base word, and is reported.  But I could suppress this.

I use an array to store all the words, so there is a limit to how much data can be processed with this approach.  If you have too much data for this approach, I would use awk to extract all the tag words, sort them, then check the sorted list for duplicates.


new york, "new York, capital, nice"
Newyork, "Newyork, good"
new city, "new, new city, old"
new year, "new year"
dealers, "deal, DL"
SEATTLE , "seattle, Wash"
BANK, "BANK, BANC, BANCO, BCO, BANKS"
giants, "Cowboys, dealer, giants, cowboys, giants"
corporate,"corporate, new year"

awk -F, '\
BEGIN {q="\042"}
{
origline=$0
gsub("[ ][ ][ ]*"," ")
gsub(" ,",",")
gsub(", ",",")
baseword=toupper($1)
word2=toupper(substr($0,index($0,q)+1))
gsub(q,"",word2)
split(word2,taglist)
matchflag=0
errorflag=0
for (i in taglist)
   {tagx=taglist[i]
    if (baseword == tagx)
       matchflag=1
    if (tagx in all_words)
       {if (errorflag == 0)
          {print origline
           errorflag=1}
        lno=all_words[tagx]
        if (lno == NR)
           whichline="this line"
        else
           whichline="line " lno
        print "   ERROR in above line: tag " i " (" tagx \
          ") already encountered in " whichline
       }
    else
       all_words[tagx]=NR
   }
if (matchflag == 0)
   {print origline
    print "   ERROR in above line: Base word not matched in tag list"
   }
all_words[baseword]=NR
}' file.in


dealers, "deal, DL"
   ERROR in above line: Base word not matched in tag list
giants, "Cowboys, Patriots, dealer, dealers, giants, cowboys, miami dolphins, giants"
   ERROR in above line: tag 4 (DEALERS) already encountered in line 5
   ERROR in above line: tag 8 (GIANTS) already encountered in this line
   ERROR in above line: tag 1 (COWBOYS) already encountered in this line
corporate,"corporate, new year"
   ERROR in above line: tag 2 (NEW YEAR) already encountered in line 4


Report Offensive Follow Up For Removal

Response Number 5
Name: James Boothe
Date: January 15, 2008 at 10:26:15 Pacific
Subject: Find duplicate words in a file
Reply: (edit)
Oops, I grabbed the wrong output for the above posting.
Following is the corrected output:

dealers, "deal, DL"
   ERROR in above line: Base word not matched in tag list
giants, "Cowboys, dealer, giants, cowboys, giants"
   ERROR in above line: tag 1 (COWBOYS) already encountered in this line
   ERROR in above line: tag 3 (GIANTS) already encountered in this line
corporate,"corporate, new year"
   ERROR in above line: tag 2 (NEW YEAR) already encountered in line 4


Report Offensive Follow Up For Removal




Use following form to reply to current message:

   Name: From My Computing.Net Settings
 E-Mail: From My Computing.Net Settings

Subject: Find duplicate words in a file

Comments:

 
  Homepage URL (*): 
Homepage Title (*): 
         Image URL: 
 


Data Recovery Software




XP Installed to G?

exessive internet traffic

ZoneAlarm Question. Blocked Connect

Windows Live Messenger Problem

Delete $Uninstall after SP3 updates


The information on Computing.Net is the opinions of its users. Such opinions may not be accurate and they are to be used at your own risk. Computing.Net cannot verify the validity of the statements made on this site. Computing.Net and Computing.Net, LLC hereby disclaim all responsibility and liability for the content of Computing.Net and its accuracy.
PLEASE READ THE FULL DISCLAIMER AND LEGAL TERMS BY CLICKING HERE

All content ©1996-2007 Computing.Net, LLC