Tom's Guide | Tom's Hardware | Tom's Games
![]() |
![]() |
![]() |
hi, i want to find all possible duplicate files in a directory. they need not to have the same filename. anyone can help?
thanks.

You can try this..
#!/bin/sh
x=0
ls -lrt|awk '!($5 == ""){print $5"|"$9}'>file_list
for i in `cat file_list`
do
isize=`echo $i|awk -F'|' '{print $1}'`
ifile=`echo $i|awk -F'|' '{print $2}'`
for j in `cat file_list`
do
jsize=`echo $j|awk -F'|' '{print $1}'`
jfile=`echo $j|awk -F'|' '{print $2}'`
if [ $isize = $jsize ]
then
if [ $ifile != $jfile ]
then
dix=`diff $ifile $jfile|wc -l`
if [ $dix -eq 0 ]
then
echo "file $jfile is duplicate of $ifile"
x=1
fi
fi
fi
done
done
if [ $x -eq 0 ]
then
echo " There is no duplicate file"
fi
exit 0

Hi:
There's always more than one way of doing something. Solaris has a cksum command which returns a 32-bit CRC of a file. You can almost guarantee that files with the same CRC in this case will be identical. Here's what I do:
1) save the checksum of everything in the directory into a file
2) determine from the file which checksums have more than one value saving to file dupchksum.txt.
3) sort dupchksum.txt eliminating the duplicates saving to udup.txt
4) for each entry in udup.txt grep the cksum.file and print field 3 the file
name.This works well if you have more than two copies of the same file.
Regards,
Nails
#!/bin/ksh
fr=0
cksum *|sort -k 1,2 > cksum.file
while read c1 c2 n3
do
if [ fr -eq 0 ]; then
ck1=$c1
fr=1
continue
fi
# save check sums and file sizes that are duplicate
if [[ $ck1 -eq $c1 ]]; then
echo $c1 >> dupcksum.txt
fi
ck1=$c1
done cksum.file
# need less than arrow between done and cksum.txt# what if there's more than one copy
sort -u -o udup.txt dupcksum.txt
rm -r dupcksum.txtif [ -f udup.txt ]; then
while read nck
do
echo "duplicate files:"
grep $nck cksum.file|awk ' { print $3 }'done udup.txt
# need less than arrow between done and dupcksum.txt
rm udup.txt
fi

HI Nails
you code looks fine but is the possible for you to do it in bourne shell cus i'm having difficulties running this script

Hi:
Only two little changes:
1) Forgot a $ on the fr -eq 0 if statement. ksh doesn't care.
2) removed the double [[Regards,
Nails
#!/bin/sh
fr=0
cksum *|sort -k 1,2 > cksum.file
while read c1 c2 n3
do
if [ $fr -eq 0 ]; then
ck1=$c1
fr=1
continue
fi
# save check sums and file sizes that are duplicate
if [ $ck1 -eq $c1 ]; then
echo $c1 >> dupcksum.txt
fi
ck1=$c1
done cksum.file
# need less than arrow between done and cksum.txt# what if there's more than one copy
sort -u -o udup.txt dupcksum.txt
rm -r dupcksum.txtif [ -f udup.txt ]; then
while read nck
do
echo "duplicate files:"
grep $nck cksum.file|awk ' { print $3 }'done udup.txt
# need less than arrow between done and dupcksum.txt
rm udup.txt
fi

Hi
Nails
Thanks for you help. But I just want to do another option like it will count which file have how many duplicates instead of showing the duplicatates file. Is that possible to do ? It means if file a.txt has 3 duplicataes file it will show a.txt has 3 duplicates file instead of showing the file names.Regards
HoiChoi

HoiChoi:
I'm not certain of your question. Pull out the -u switch on the sort command above, and you'll have all the duplicates all in a row.
Which file is the one where you say, this "file" has 3 duplicates?
Regards,
Nails

Hi Nails,
How can we modify the above program if we want to go through each sub direcotry in that tree and check whether there exist any duplicates down the tree and then show all the possible duplicate copies. Also, what is the usage: of this program. Also, If you have time, can you please go through the code in more detail( I am just a beginner). Also what does the line, "done >cksum.file" do. I understand it directs output to cksum.file but how. You have removed the temporary files. Also if more than one instance of the same program is running at the same time, how can we make sure that they do not write each other's temporary files.
I hope they are not too many questions. If so please, excuse me for that.
Thanks.
Aki.

Hi:
I might be able to look at this tomorrow, but I'll answer some questions now:
>Also what does the line, "done >cksum.file"
>do. I understand it directs output to >cksum.file but how. You have removed the >temporary files.In the example above, don't use the greater than sign use the less than sign for input - not output:
This forum doesn't print the less than sign:
done, less than sign, cksum.file
>Also if more than one instance of the same >program is running at the same time, how >can we make sure that they do not write >each other's temporary files
One of the things you can do is use the process id identifier, $$, to create temp files:
fn=chksum.file.$$
now fn will have a 5 digit extension of whatever the process id is of the script executing.
Regards,
Nails

# execute cksum on all files, sort numeric on the first field.
# since all duplicates are in order check each one in turn.
#!/bin/ksh
fr=0
cksum *|sort -k 1,1n |
while read c1 c2 n3
do
if [ fr -eq 0 ]; then
prevobj=$c1
prevfile=$n3
fr=1
continue
fi
# write out file sizes that are duplicate
if [[ $prevobj -eq $c1 ]]; then
echo "file $prevfile has duplicate $n3"
else
prevobj=$c1
prevfile=$n3
fi
done

Hi Everyone,
Nails, I have tried your script with (set -x) option. It comapres each file with the next file and after that does not make any comaprisons with the rest of the files. I mean it will comapre file1 with file2 and file2 with file3 and so on. But it will not compare file1 with the rest of files except file2. So it will output sets of two identical files. How can I modify it to output all the identical files together. How can the script be made to search the whole search tree of the given directory.
Thanks.
Aki.

Aki:
To answer your first question:
It's not accurate to say it "compares each file with the next file". It compares a file, call it filea, with the next file until the check sum changes. filea doesn't change until the checksum changes; that way you pick up multiple duplicates.This works because the check sums are numerically sorted so there's no reason to check each file with every other one.
To answer your second question:
To compare all the files in the directory structure, Perform a find and checksum all the files:# execute cksum on all files, sort numeric on the first field.
# since all duplicates are in order check each one in turn. Find
# all the duplicates in the tree
#!/bin/ksh
fr=0
cksum $(find . -type f -print)|sort -k 1,1n |
while read c1 c2 n3
do
if [ fr -eq 0 ]; then
prevobj=$c1
prevfile=$n3
fr=1
continue
fi
# save check sums and file sizes that are duplicate
if [[ $prevobj -eq $c1 ]]; then
echo "file $prevfile has duplicate $n3"
else
prevobj=$c1
prevfile=$n3
fi
done
exit 0Regards,
Nails

![]() |
![]() |
![]() |

This post is quite old and has been locked from receiving new replies. Please create a new posting instead.
| Ads by Google |