Computing.Net > Forums > Unix > find duplicate files

find duplicate files

Reply to Message Icon

Original Message
Name: uribo
Date: May 15, 2003 at 22:22:44 Pacific
Subject: find duplicate files
OS: solaris
CPU/Ram: -
Comment:

hi, i want to find all possible duplicate files in a directory. they need not to have the same filename. anyone can help?

thanks.


Report Offensive Message For Removal


Response Number 1
Name: Swapan Satpathi
Date: May 16, 2003 at 14:38:29 Pacific
Reply: (edit)

You can try this..


#!/bin/sh
x=0
ls -lrt|awk '!($5 == ""){print $5"|"$9}'>file_list
for i in `cat file_list`
do
isize=`echo $i|awk -F'|' '{print $1}'`
ifile=`echo $i|awk -F'|' '{print $2}'`
for j in `cat file_list`
do
jsize=`echo $j|awk -F'|' '{print $1}'`
jfile=`echo $j|awk -F'|' '{print $2}'`
if [ $isize = $jsize ]
then
if [ $ifile != $jfile ]
then
dix=`diff $ifile $jfile|wc -l`
if [ $dix -eq 0 ]
then
echo "file $jfile is duplicate of $ifile"
x=1
fi
fi
fi
done
done
if [ $x -eq 0 ]
then
echo " There is no duplicate file"
fi
exit 0


Report Offensive Follow Up For Removal

Response Number 2
Name: nails
Date: May 16, 2003 at 16:33:37 Pacific
Reply: (edit)

Hi:

There's always more than one way of doing something. Solaris has a cksum command which returns a 32-bit CRC of a file. You can almost guarantee that files with the same CRC in this case will be identical. Here's what I do:
1) save the checksum of everything in the directory into a file
2) determine from the file which checksums have more than one value saving to file dupchksum.txt.
3) sort dupchksum.txt eliminating the duplicates saving to udup.txt
4) for each entry in udup.txt grep the cksum.file and print field 3 the file
name.

This works well if you have more than two copies of the same file.

Regards,

Nails

#!/bin/ksh
fr=0
cksum *|sort -k 1,2 > cksum.file
while read c1 c2 n3
do
if [ fr -eq 0 ]; then
ck1=$c1
fr=1
continue
fi
# save check sums and file sizes that are duplicate
if [[ $ck1 -eq $c1 ]]; then
echo $c1 >> dupcksum.txt
fi
ck1=$c1
done cksum.file
# need less than arrow between done and cksum.txt

# what if there's more than one copy
sort -u -o udup.txt dupcksum.txt
rm -r dupcksum.txt

if [ -f udup.txt ]; then
while read nck
do
echo "duplicate files:"
grep $nck cksum.file|awk ' { print $3 }'

done udup.txt
# need less than arrow between done and dupcksum.txt
rm udup.txt
fi


Report Offensive Follow Up For Removal

Response Number 3
Name: hoichoi
Date: May 19, 2003 at 18:35:42 Pacific
Reply: (edit)

HI Nails
you code looks fine but is the possible for you to do it in bourne shell cus i'm having difficulties running this script


Report Offensive Follow Up For Removal

Response Number 4
Name: nails
Date: May 21, 2003 at 17:42:23 Pacific
Reply: (edit)

Hi:

Only two little changes:
1) Forgot a $ on the fr -eq 0 if statement. ksh doesn't care.
2) removed the double [[

Regards,

Nails

#!/bin/sh

fr=0
cksum *|sort -k 1,2 > cksum.file
while read c1 c2 n3
do
if [ $fr -eq 0 ]; then
ck1=$c1
fr=1
continue
fi
# save check sums and file sizes that are duplicate
if [ $ck1 -eq $c1 ]; then
echo $c1 >> dupcksum.txt
fi
ck1=$c1
done cksum.file
# need less than arrow between done and cksum.txt

# what if there's more than one copy
sort -u -o udup.txt dupcksum.txt
rm -r dupcksum.txt

if [ -f udup.txt ]; then
while read nck
do
echo "duplicate files:"
grep $nck cksum.file|awk ' { print $3 }'

done udup.txt
# need less than arrow between done and dupcksum.txt
rm udup.txt
fi


Report Offensive Follow Up For Removal

Response Number 5
Name: hoichoi
Date: May 23, 2003 at 21:46:17 Pacific
Reply: (edit)

Hi
Nails
Thanks for you help. But I just want to do another option like it will count which file have how many duplicates instead of showing the duplicatates file. Is that possible to do ? It means if file a.txt has 3 duplicataes file it will show a.txt has 3 duplicates file instead of showing the file names.

Regards
HoiChoi


Report Offensive Follow Up For Removal


Response Number 6
Name: nails
Date: May 24, 2003 at 12:41:42 Pacific
Reply: (edit)

HoiChoi:

I'm not certain of your question. Pull out the -u switch on the sort command above, and you'll have all the duplicates all in a row.

Which file is the one where you say, this "file" has 3 duplicates?

Regards,


Nails


Report Offensive Follow Up For Removal

Response Number 7
Name: Aki
Date: May 27, 2003 at 21:44:19 Pacific
Reply: (edit)

Hi Nails,
How can we modify the above program if we want to go through each sub direcotry in that tree and check whether there exist any duplicates down the tree and then show all the possible duplicate copies. Also, what is the usage: of this program. Also, If you have time, can you please go through the code in more detail( I am just a beginner). Also what does the line, "done >cksum.file" do. I understand it directs output to cksum.file but how. You have removed the temporary files. Also if more than one instance of the same program is running at the same time, how can we make sure that they do not write each other's temporary files.
I hope they are not too many questions. If so please, excuse me for that.
Thanks.
Aki.


Report Offensive Follow Up For Removal

Response Number 8
Name: nails
Date: May 28, 2003 at 13:31:22 Pacific
Reply: (edit)

Hi:

I might be able to look at this tomorrow, but I'll answer some questions now:

>Also what does the line, "done >cksum.file"
>do. I understand it directs output to >cksum.file but how. You have removed the >temporary files.

In the example above, don't use the greater than sign use the less than sign for input - not output:

This forum doesn't print the less than sign:

done, less than sign, cksum.file

>Also if more than one instance of the same >program is running at the same time, how >can we make sure that they do not write >each other's temporary files

One of the things you can do is use the process id identifier, $$, to create temp files:

fn=chksum.file.$$

now fn will have a 5 digit extension of whatever the process id is of the script executing.

Regards,


Nails


Report Offensive Follow Up For Removal

Response Number 9
Name: nails
Date: May 29, 2003 at 09:56:44 Pacific
Reply: (edit)

# execute cksum on all files, sort numeric on the first field.
# since all duplicates are in order check each one in turn.
#!/bin/ksh
fr=0
cksum *|sort -k 1,1n |
while read c1 c2 n3
do
if [ fr -eq 0 ]; then
prevobj=$c1
prevfile=$n3
fr=1
continue
fi
# write out file sizes that are duplicate
if [[ $prevobj -eq $c1 ]]; then
echo "file $prevfile has duplicate $n3"
else
prevobj=$c1
prevfile=$n3
fi
done



Report Offensive Follow Up For Removal

Response Number 10
Name: Aki
Date: May 29, 2003 at 19:18:19 Pacific
Reply: (edit)

Hi Everyone,
Nails, I have tried your script with (set -x) option. It comapres each file with the next file and after that does not make any comaprisons with the rest of the files. I mean it will comapre file1 with file2 and file2 with file3 and so on. But it will not compare file1 with the rest of files except file2. So it will output sets of two identical files. How can I modify it to output all the identical files together. How can the script be made to search the whole search tree of the given directory.
Thanks.
Aki.



Report Offensive Follow Up For Removal

Response Number 11
Name: nails
Date: May 30, 2003 at 09:46:26 Pacific
Reply: (edit)

Aki:

To answer your first question:
It's not accurate to say it "compares each file with the next file". It compares a file, call it filea, with the next file until the check sum changes. filea doesn't change until the checksum changes; that way you pick up multiple duplicates.

This works because the check sums are numerically sorted so there's no reason to check each file with every other one.

To answer your second question:
To compare all the files in the directory structure, Perform a find and checksum all the files:

# execute cksum on all files, sort numeric on the first field.
# since all duplicates are in order check each one in turn. Find
# all the duplicates in the tree
#!/bin/ksh
fr=0
cksum $(find . -type f -print)|sort -k 1,1n |
while read c1 c2 n3
do
if [ fr -eq 0 ]; then
prevobj=$c1
prevfile=$n3
fr=1
continue
fi
# save check sums and file sizes that are duplicate
if [[ $prevobj -eq $c1 ]]; then
echo "file $prevfile has duplicate $n3"
else
prevobj=$c1
prevfile=$n3
fi
done
exit 0

Regards,

Nails


Report Offensive Follow Up For Removal






Use following form to reply to current message:

   Name: From My Computing.Net Settings
 E-Mail: From My Computing.Net Settings

Subject: find duplicate files

Comments:

 


  Homepage URL (*): 
Homepage Title (*): 
         Image URL: 
 
Data Recovery Software




Have you ever used OpenOffice?

Yes, as my main suite.
Yes, occationally.
Yes, but only once.
No, never.


View Results

Poll Finishes In 6 Days.
Discuss in The Lounge