Remove Duplicate lines without sort

Sparc
September 1, 2008 at 06:13:45
Specs: Solaris 5.8, 2GB

Hi all,
I get files which sometimes have unsorted unordered repeated lines (Note:I cannot sort/order those).How do I remove those duplicate lines without sorting/reordering lines??Sample file is(Note A20080108509 and A20080610338 are repeated):-
bash-2.03# cat list
Application No Work Order No Work Order Date Status Application Date
A20071208001 W20071207183 2007-12-20 WaitCompliance 2007-12-19
A20080610338 W20080609491 2008-06-25 WaitCompliance 2008-06-20
A20080108509 W20080107667 2008-01-29 WaitCompliance 2008-01-25
A20080208788 W20080207935 2008-02-20 WaitCompliance 2008-02-20
A20080610339 W20080609492 2008-06-25 WaitCompliance 2008-06-20
A20080309161 W20080308289 2008-03-27 WaitCompliance 2008-03-27
A20080108509 W20080107667 2008-01-29 WaitCompliance 2008-01-25
A20080108481 W20080508890 2008-05-26 WaitCompliance 2008-01-23
A20080108507 W20080508930 2008-05-26 WaitCompliance 2008-01-25
A20080309162 W20080308290 2008-03-27 WaitCompliance 2008-03-27
A20080108509 W20080107667 2008-01-29 WaitCompliance 2008-01-25
A20080710613 W20080709660 2008-07-08 WaitCompliance 2008-07-08
A20080610338 W20080609491 2008-06-25 WaitCompliance 2008-06-20
I know i can do:-
#cat list|sort -u OR
#cat list|sort|uniq -u [OR -d],but case is that I cannot change the order.Any solutions???

See More: Remove Duplicate lines without sort

Report •


#1
September 3, 2008 at 11:00:22

One way is to assign each line to a variable skipping any line previously assigned. I think it will be very inefficient for a large file:


#!/bin/ksh

cnt=0
while read line
do
# set the variable contents
((cnt+=1))
eval a${cnt}="\$line"

if [[ $cnt -eq 1 ]]
then
continue
fi

i=0
((snum=cnt-1))
v2=$(eval echo \"\$a$cnt\")
while (($i < $snum))
do
((i+=1))
v1=$(eval echo \"\$a${i}\")

# skip the line if it's equal to another
if [[ "$v1" = "$v2" ]]
then
((cnt-=1))
fi
done
done < thefile.txt

# display the variables a1, a2,...a10
x=1
while (($x <= $cnt))
do
eval echo \"\$a${x}\"
((x+=1))
done



Report •

#2
September 4, 2008 at 00:03:57

Hi Nails,
As always you have been very quick to respond.
Thanx a ton for the solution.

Great Going Mate

Sujan Banerjee


Report •

#3
September 4, 2008 at 02:09:48

Hi Nail,
I am a novice in Shell Scripting,so I tried to write this one-liner.
Original File is:-
bash-2.03# cat list
Application No Work Order No Work Order Date Status Application Date
A20071208001 W20071207183 2007-12-20 WaitCompliance 2007-12-19
A20080610338 W20080609491 2008-06-25 WaitCompliance 2008-06-20
A20080108509 W20080107667 2008-01-29 WaitCompliance 2008-01-25
A20080208788 W20080207935 2008-02-20 WaitCompliance 2008-02-20
A20080610339 W20080609492 2008-06-25 WaitCompliance 2008-06-20
A20080309161 W20080308289 2008-03-27 WaitCompliance 2008-03-27
A20080108509 W20080107667 2008-01-29 WaitCompliance 2008-01-25
A20080108481 W20080508890 2008-05-26 WaitCompliance 2008-01-23
A20080108507 W20080508930 2008-05-26 WaitCompliance 2008-01-25
A20080309162 W20080308290 2008-03-27 WaitCompliance 2008-03-27
A20080108509 W20080107667 2008-01-29 WaitCompliance 2008-01-25
A20080710613 W20080709660 2008-07-08 WaitCompliance 2008-07-08
A20080610338 W20080609491 2008-06-25 WaitCompliance 2008-06-20

bash-2.03# sed '=' list|sed 'N;s/\n/ /g'|tr -s ' '|sort -t" " -k2,2|uniq -f 2|sort -n|cut -d" " -f2- |sed 'w list'
Application No Work Order No Work Order Date Status Application Date
A20071208001 W20071207183 2007-12-20 WaitCompliance 2007-12-19
A20080208788 W20080207935 2008-02-20 WaitCompliance 2008-02-20
A20080610339 W20080609492 2008-06-25 WaitCompliance 2008-06-20
A20080309161 W20080308289 2008-03-27 WaitCompliance 2008-03-27
A20080108481 W20080508890 2008-05-26 WaitCompliance 2008-01-23
A20080108507 W20080508930 2008-05-26 WaitCompliance 2008-01-25
A20080309162 W20080308290 2008-03-27 WaitCompliance 2008-03-27
A20080108509 W20080107667 2008-01-29 WaitCompliance 2008-01-25
A20080710613 W20080709660 2008-07-08 WaitCompliance 2008-07-08
A20080610338 W20080609491 2008-06-25 WaitCompliance 2008-06-20

It first asigns line-number to each Line(to eensure Order can be kept the same),Sorts by Second Field(so that uniq can remove

duplication in next step),then again sort numerically by line number,cut from second field onwards,and write into same file

with sed.
ONLY CATCH IS THAT I HAD TO USE "tr -s ' '" WHICH REMOVES SUCCESSIVE SPACES.
Any Idea on how to use "uniq -f 2" when field separator is not space ,but any other character like (say) @, ~,etc.

Any Idea how to use enhanced functionality of "uniq -f 2" to identify fields.
Any suggestions are welcome.
And Thanx in Advance

Sujan Banerjee


Report •

Related Solutions

#4
September 4, 2008 at 02:18:54

Hi Nail,
I am a novice in Shell Scripting,so I tried to write this one-liner.
Original Sample File is(say):-
bash-2.03# cat list
Application No Work Order No Work Order Date Status Application Date
A20071208001 W20071207183 2007-12-20 WaitCompliance 2007-12-19
A20080610338 W20080609491 2008-06-25 WaitCompliance 2008-06-20
A20080108509 W20080107667 2008-01-29 WaitCompliance 2008-01-25
A20080208788 W20080207935 2008-02-20 WaitCompliance 2008-02-20
A20080610339 W20080609492 2008-06-25 WaitCompliance 2008-06-20
A20080309161 W20080308289 2008-03-27 WaitCompliance 2008-03-27
A20080108509 W20080107667 2008-01-29 WaitCompliance 2008-01-25
A20080108481 W20080508890 2008-05-26 WaitCompliance 2008-01-23
A20080108507 W20080508930 2008-05-26 WaitCompliance 2008-01-25
A20080309162 W20080308290 2008-03-27 WaitCompliance 2008-03-27
A20080108509 W20080107667 2008-01-29 WaitCompliance 2008-01-25
A20080710613 W20080709660 2008-07-08 WaitCompliance 2008-07-08
A20080610338 W20080609491 2008-06-25 WaitCompliance 2008-06-20

bash-2.03# sed '=' list|sed 'N;s/\n/ /g'|tr -s ' '|sort -t" " -k2,2 -k1,1n|uniq -f 2|sort -n|cut -d" " -f2- |sed 'w list'
Application No Work Order No Work Order Date Status Application Date
A20071208001 W20071207183 2007-12-20 WaitCompliance 2007-12-19
A20080610338 W20080609491 2008-06-25 WaitCompliance 2008-06-20
A20080108509 W20080107667 2008-01-29 WaitCompliance 2008-01-25
A20080208788 W20080207935 2008-02-20 WaitCompliance 2008-02-20
A20080610339 W20080609492 2008-06-25 WaitCompliance 2008-06-20
A20080309161 W20080308289 2008-03-27 WaitCompliance 2008-03-27
A20080108481 W20080508890 2008-05-26 WaitCompliance 2008-01-23
A20080108507 W20080508930 2008-05-26 WaitCompliance 2008-01-25
A20080309162 W20080308290 2008-03-27 WaitCompliance 2008-03-27
A20080710613 W20080709660 2008-07-08 WaitCompliance 2008-07-08

It first asigns line-number to each Line(to eensure Order can be kept the same),Sorts by Second Field(so that uniq can remove duplication in next step),then again sort numerically by line number,cut from second field onwards,and write into same file

with sed.
ONLY CATCH IS THAT I HAD TO USE "tr -s ' '" WHICH REMOVES SUCCESSIVE SPACES.
Any Idea on how to use "uniq -f 2" when field separator is not space ,but any other character like (say) @, ~,etc.

Any Idea how to use enhanced functionality of "uniq -f 2" to identify fields.
Any suggestions are welcome.
And Thanx in Advance

Sujan Banerjee


Report •

#5
September 4, 2008 at 20:51:47

I don't understand your question about the uniq command. The sort command's -u switch does the same.

Anyway, here is my take on your problem: (Since I'm using Solaris, I'm using nawk instead of awk)

cat -n list|sort -k2,2 -u |sort -k1,1 -n|nawk ' { $1=""; gsub("^ ",""); print $0 } '


Report •

#6
September 4, 2008 at 23:26:38

Hi Nails,

Thanx for the quick response.
U r real genius,man!!
What my question is that I saw in "man uniq" following two options,which remove duplication after certain number of fields or characters(e.g. in some log,as first few fields are timestamp,dir,hostname,etc,so 1 might need to monitor only from say 4th field),but catch is that it recognises as delimiter only single space.So is there any way to make uniq identify some other delimiter like say @,| or any other character?Thanx in Advance
--Sujan Banerjee
------------------
Uniq Options:-
-f fields
Ignore the first fields fields on each input line when
doing comparisons, where fields is a positive decimal
integer. A field is the maximal string matched by the
basic regular expression:

[[:blank:]]*[^[:blank:]]*

If fields specifies more fields than appear on an
input line, a null string will be used for comparison.

-s chars
Ignore the first chars characters when doing comparis-
ons, where chars is a positive decimal integer. If
specified in conjunction with the -f option, the first
chars characters after the first fields fields will be
ignored. If chars specifies more characters than
remain on an input line, a null string will be used
for comparison.


Report •

#7
September 5, 2008 at 00:03:50

Hi Nails,

I have two more questions.
I have written a small script that gives output of "ls -l" in KB and MB.Now I do not want to show the client the script,so I embed those in alias,as below:-
bash-2.03# alias "lsKB=`find /elite -name "*lsKB*"`"
bash-2.03# alias "lsMB=`find /elite -name "*lsMB*"`"
bash-2.03# alias
alias lsKB='/elite/prof/all/lsKB.sh'
alias lsMB='/elite/prof/all/lsMB.sh'

and fire the commands as shown:-
bash-2.03# cd /elite/prof
bash-2.03# lsKB
-rw-r--r-- 1 root other Aug 13 12:43 old_stat.tar.gz.13-Aug-08 1.474 KB
-rw-r--r-- 1 root other Sep 5 11:00 statusrep_1 2.090 KB
-rw-r--r-- 1 root other Sep 5 11:00 statusrep_2 13.887 KB
-rwxr-xr-x 1 root other Sep 2 12:25 newexec.sh 0.293 KB
-rwxr-xr-x 1 root other Jul 9 13:29 sujanstatus.sh 0.512 KB
drwxr-xr-x 2 root other Aug 27 10:54 manpages 0.500 KB
drwxr-xr-x 2 root other Sep 2 12:46 all 0.500 KB
drwxr-xr-x 2 root other Sep 4 18:09 track 1.000 KB
drwxr-xr-x 2 root other Sep 4 18:10 scripts 1.000 KB
Total estimated size: 21.2549 KB or 42.5098 blocks.
bash-2.03# pwd
/opt/jboss-3.2.6/server/default/log
bash-2.03# lsMB
-rw-r--r-- 1 root other Aug 5 12:28 boot.log 0.064 MB
-rw-r--r-- 1 root other Aug 4 15:42 cd 0.023 MB
-rw-r--r-- 1 root other Sep 5 12:13 jboss.log 3.165 MB
-rw-r--r-- 1 root other Sep 5 12:13 server.log 0.117 MB
-rw-r--r-- 1 root other Sep 1 23:55 server.log.2008-09-01 0.273 MB
-rw-r--r-- 1 root other Sep 2 23:53 server.log.2008-09-02 0.300 MB
-rw-r--r-- 1 root other Sep 3 23:58 server.log.2008-09-03 0.229 MB
-rw-r--r-- 1 root other Sep 4 23:56 server.log.2008-09-04 0.228 MB
-rw-r--r-- 1 root other Sep 5 12:13 server.log_Aug.tar.gz 0.009 MB
-rw-r--r-- 1 root other Aug 5 12:35 startup.log 0.000 MB
Total estimated size: 4514.33 KB or 4.40853 MB or 9028.66 blocks.
So evidently both are working.
But case is that whenever I log out or open another terminal,alias gives blank output(see below):-

bash-2.03# exit
exit
# bash
bash-2.03# alias
bash-2.03#

How can I set those aliases on a permanent basis???
Another question is I have similar scripts,which I want to set in alias through some script(see below):-
bash-2.03# cd /elite/prof/all
bash-2.03# ls -l
total 8
-rwxr-xr-x 1 root other 289 Aug 7 16:37 dfGB.sh
-rwxr-xr-x 1 root other 274 Sep 1 16:00 dfMB.sh
-rwxr-xr-x 1 root other 232 Aug 13 17:47 lsKB.sh
-rwxr-xr-x 1 root other 235 Aug 7 16:38 lsMB.sh.

How do I achieve this???
Thanx in Advance

Sujan Banerjee


Report •

#8
September 5, 2008 at 22:49:38

Probably the problem is that you haven't defined your aliases in one of the bash startup files. Each time a new bash shell executes, the aliases must be in one of the startup files

Typically, (although it might vary for different Linux versions) unix commands go in the .bashrc file in the user's home direectory.

What I like to do is to place my personal aliases and commands in a separate file, and then source that file: (that's a period, a space and the file name)

. ~./bash_aliases

If you have a number of users you could place all your unix aliases and commands in a file call it:

/etc/aliases_profile

Then in each users home directory source the file in .bashrc:

. ./etc/aliases_profile

There are plenty of of resources online that describe bash setup files. Here are two of them:

http://www.linuxfromscratch.org/blf...

http://sunsite.ualberta.ca/Document...


Report •

#9
September 5, 2008 at 23:26:30

Hi Nails,
First of all, thanx a lot for all the guidance and help.
I was just thinking to myself that probably I am bothering this chap more than my fair share.
So if I am bugging you more than I should,just leave a hint;).

Those 2 links you gave were of great help.

Here I am having one more peculiar situation.
(Please see the script above that u had sent-Response Number 1 Dated:September 3, 2008 at 11:00:22 Pacific)
There u have first line of script as "#!/bin/ksh",so no matter which shell I am in,it shall create a builtin ksh and fire the script there.
But when I fire the script as below,it fails:-
bash-2.03# sh nailscript.sh
nailscript.sh: syntax error at line 16: `v2=$' unexpected

But when I do "bash-2.03# ksh nailscript.sh" , it works fine!!!????
I also changed that first line to "#!/usr/bin/ksh" and fired it again, again it failed.
Why is it so? I am puzzled.Below is some data I thought might be useful.(Feel free to ask for any other input if required)

bash-2.03# which ksh sh
/usr/bin/ksh
/usr/bin/sh
bash-2.03# echo $PATH
/usr/sbin:/usr/bin:/opt/java:/elite:/usr/local/lib:/usr/ccs/bin:/opt/sfw/sbin/:/usr/local/lib:/usr/local/bin:/opt/apache-ant-1.6.1/bin:/opt/OV/bin/OpC:/opt/OV/bin:/opt/OV/bin/OpC/install:/opt/sfw/bin


Thanx for everything
Regards
Sujan Banerjee


Report •

#10
September 6, 2008 at 12:59:36

First, feel free to ask your questions; I don't mind answering them.

Second, executing this command:

sh nailscript.sh

says to execute nailscript.sh using the bourne shell,sh. You are right that #!/bin/ksh should over ride the bourne shell, but for some reason it isn't.

This line fails because sh uses different command substitution for bash/ksh:

v2=$(eval echo \"\$a$cnt\")

bash uses $() while sh uses ` `

That is the back tic and not the single quote:

v2=`eval echo \"\$a$cnt\"`

So why is your script ignoring your ksh invocation? It's probably because you don't have the string #!/bin/ksh on the first line and in column 1.

Read about it in this "Korn Shell Nuiances" article:

http://www.samag.com/documents/s=10...



Report •

#11
September 10, 2008 at 00:01:18

Hi Nails,
Thanx for the response and the link.

I had used #!/bin/ksh as first line of script and also tried to remove any space or
unprintable character before #!/bin/ksh using vi,but the shell is not being overriden.

I also tried #!/usr/bin/ksh as first line,still result the same.

BTW,Here is the output of octal display for first three lines of the script:
bash-2.03# od -bc nailscript.sh |head -3
0000000 043 041 057 142 151 156 057 153 163 150 012 143 156 164 075 060
# ! / b i n / k s h \n c n t = 0
0000020 012 167 150 151 154 145 040 162 145 141 144 040 154 151 156 145

What's this "0000000" before the hash(#)character and is it the root cause of the problem?

Thanx

Sujan
============================================
PS:Not sending the script to save space,I have just done copy-paste from Response Number 1.


Report •

#12
September 11, 2008 at 14:47:58

First, let's talk about the od -bc command:

The od command counts the number of characters with "0000000" being the start. Consider the contents of file ff:

1234567

Now, execute od on the file:


od -bc ff

0000000 061 062 063 064 065 066 067 012
1 2 3 4 5 6 7 \n
0000010

eight characters are displaying - 7 numbers and a newline. Add characters to the file and see how the last number increases.

Second, concerning executing with sh command, I mislead you. If you execute a script with a leading sh such as this:

sh ab.ss

It runs the script as a bourne shell no matter what is in line 1. Suppose script ab.ss contains this:

#!/bin/ksh

a=z
cnt=5
v2=$(eval echo $a$cnt)

echo $v2
# end script

executing:

sh ab.ss

explicitly overrides #!/bin/ksh and the script fails with a syntax error.

What #!/bin/ksh allows you to the script to override the current shell that executed the script. Here is how to emulate it:

From your command line, execute:

sh

Your shell is now the bourne shell (although the parent that executed it might be the korn or the bash shell).

Now, execute ab.ss and it should run with no failures.

Let me know if you have any questions or comments.


Report •

#13
September 12, 2008 at 04:10:59

Hi Nails,
Real problem is that it is a multiuser system and I cannot govern which Shell other people use.So my only saving grace should have been "#!/bin/ksh" in the first line of my script and if ppl fire it from say Bourne or Bash Shell,then the script is bound to fail.

Any suggestions??
Regards,
Sujan


Report •

#14
September 12, 2008 at 07:09:34

I must respectively disagree with you. It doesn't matter what your parent/current shell is. The act of placing #!/bin/ksh in line 1/column 1 says to run my script with the korn shell. It will do so no matter what the parent/calling shell is.

So, if your home shell is bash, and you execute myscript.ksh and #!/bin/ksh is in line 1, column 1, the script executes as a korn script.

My suggestion is: do NOT execute scripts like this:

sh myscript.ksh

As I said, if you do this it says to run my script as bourne shell script no matter what. It ignores #!/bin/ksh

Think of:

sh myscript.ksh

as executing a bourne shell and myscript.ksh being an argument to the bourne. Don't do it.

Thinking out loud, there is nothing special about the bourne. You could do something like

ksh myscript.ksh

But, as I said, there is no reason to do this.


Report •

#15
September 24, 2008 at 04:25:49

Hi,
I have a very large XML file(Lets say file1.xml) and I have a one more XML file(file2.xml) which will contain some entries same as file1.xml
My purpose is to remove the same entries from file1.xml and store it to file3.xml

I can't sort the file.
One solution is to append the contents of file2.xml to file1.xml and then remove the duplicate lines. But this doesn't work with sed command found on net (sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P' ) and I do not why.

Please help


Report •


Ask Question