Solved Eliminate Duplicates and Extract Partial Matches From List

March 9, 2013 at 11:45:31
Specs: Windows Vista, 1.6 gb 4gb
Again desperately requiring some help and direction! I was given a list of songs titles with their performers and asked if I could come up with something which could remove any duplicates from the list and hive off another list which would show all cases where the same song was sung by a number of different performers. Mistakenly, I thought this would be relatively straightforward but I have ended up more than a little confused.

I thought I would eliminate the duplicates by reading the data into an array, exporting it to the "dictionary" and then writing it back. That seems to have worked OK. I then read the data (title and performer) into an array and ran a Title match. This should have resulted in a "match" against each entry in the New_Array.txt file, and it did, except that it kept omitting one of the titles from the output file ie the differences.txt output file should have had 12 entries but only ever produced 11 and nothing I did made any difference (it omitted Rivers of Babylon if processed in the same order as I did). What I was trying to do was to modify this "match" in some way so that the code identified only those cases where the same song was sung by different performers and wrote these to the differences file.

I would be very grateful if someone could give me some direction

Thank you

The list is way too long to include so here is a "sample" of the type of entries and the code

I have so far:

Audio_List.txt
Rivers Of Babylon~Boney M*
Bohemian Rhapsody~Queen*
Mull Of Kintyre~Wings*
Loving Her Was Easier Than Anything I'll Ever Do Again~Jose Feliciano*
Country Roads~Olivia Newton John*
Loving Her Was Easier Than Anything I'll Ever Do Again~Lloyd Charmers*
Bohemian Rhapsody~Queen*
She Loves You~Beatles*
Loving Her Was Easier Than Anything I'll Ever Do Again~Tompall And The Glaser Brothers*
Do They Know It's Christmas~Band Aid*
She Loves You~Beatles*
Elizabethan Serenade~Boris Gardiner*
I Want To Wake Up With You~Boris Gardiner*
She Loves You~Beatles*
Country Roads~Toots And The Maytalls*
Loving Her Was Easier Than Anything I'll Ever Do Again~Lloyd Charmers*

As you will see there are some duplicates which the Dictionary object resolved but the
Difference file should only have shown:

Differences.txt
Loving Her Was Easier Than Anything I'll Ever Do Again~Jose Feliciano
Loving Her Was Easier Than Anything I'll Ever Do Again~Lloyd Charmers
Loving Her Was Easier Than Anything I'll Ever Do Again~Tompall And The Glaser Brothers
Country Roads~Olivia Newton John
Country Roads~Toots And The Maytalls


[Code
Const ForReading = 1
Const ForWriting = 2
Const ForAppending = 8

Set objFSO = CreateObject("Scripting.FileSystemObject")

'Read in file
Set objFile1 = objFSO.OpenTextFile("C:\Test\Audio_List.txt", ForReading)
strCurrentFile = objFile1.ReadAll

'Read string into array
arrItems = Split(strCurrentFile,"*",-1,0)

objFile1.Close

'Create Dictionary
Set objDictionary = CreateObject("Scripting.Dictionary")

'Add array to dictionary
For Each strItem in arrItems

If Not objDictionary.Exists(strItem) Then
objDictionary.Add strItem, strItem
End If
Next

'Determine number of entries
intItems = objDictionary.Count - 1

'Redimension array
ReDim arrItems(intItems)

'Read dictionary contents back into redim array
i = 0

For Each strKey in objDictionary.Keys
arrItems(i) = strKey
i = i + 1
Next

'Originally used as a check Point to see if duplicates were eliminated
Set objFile2 = objFSO.OpenTextFile("C:\Test\New_Array.txt", ForWriting,True,0)
For Each strItem in arrItems
strItem = Replace(strItem,vbCRLF,"",1,-1,0)

objFile2.writeLine strItem

Next

'Read Data into variable
Set objFile3 = objFSO.OpenTextFile("C:\Test\New_Array.txt",ForReading)

strCurrent = objFile3.ReadAll

objFile3.Close

'Read data into variable for title extraction
Set objFile4 = objFSO.OpenTextFile("C:\Test\New_Array.txt", ForReading)

Do Until objFile4.AtEndOfStream

strAddress = objFile4.ReadLine
strPoint = Instr(1,strAddress,"~",0)

strTitle = left(strAddress,strPoint

'Match title against variable
If InStr(strCurrent, strTitle) > 1 Then

strNowCurrent = strNowCurrent & strAddress & vbCRLF

End If

Loop

'Write out differences to file
Set objFile5 = objFSO.OpenTextFile("C:\Test\Differences.txt",ForWriting,True,0)

objFile5.Write strNowCurrent
objFile5.Close

[/Code]


See More: Eliminate Duplicates and Extract Partial Matches From List

Report •

✔ Best Answer
March 13, 2013 at 17:11:44
Oh, sorry. Missed that part:
Set fso = CreateObject("Scripting.FileSystemObject")
Set songs = CreateObject("Scripting.Dictionary")

Set inFile = fso.OpenTextFile("audio_list.txt")
Do Until inFile.AtEndOfStream
  line = Split(Trim(inFile.ReadLine), "~", 2)
  If UBound(line) = 1 Then
    If Not songs.Exists(line(0)) Then _
      Set songs(line(0)) = CreateObject("System.Collections.ArrayList")
    If Not songs(line(0)).Contains(line(1)) Then _
      songs(line(0)).Add(line(1))
  End If
Loop

Set outFile = fso.OpenTextFile("diffs.txt", 2, True)
Set newFile = fso.OpenTextFile("audio_list.new.txt", 2, True)
For Each key In songs.keys
  line = key & "~" & Join(songs(key).ToArray(), vbNewLine & key & "~")
  newFile.WriteLine line
  If songs(key).Count > 1 Then _
    outFile.WriteLine line
Next

And since I already did PS:

$songs = @()
Get-Content audio_list.txt | ForEach-Object { 
  $t, $a = $_ -split "~", 2
  $songs += , (New-Object psobject -Property @{ Title = $t; Artist = $a })
}
$songs | Select-Object -Unique Title, Artist | Group-Object Title `
       | Where-Object { $_.Count -gt 1 } | ForEach-Object `
       { $_.Group | ForEach-Object { $_.Title + "~" + $_.Artist } } `
       > diffs.txt
$songs | Select-Object -Unique Title, Artist | ForEach-Object `
       { $_.Title + "~" + $_.Artist } > audio_list.new.txt

How To Ask Questions The Smart Way



#1
March 9, 2013 at 21:01:29
This seems to be one of those very rare instances where batch might be better suited (although it was still a b[a/i]tch to straighten out!):
::===== begin batchscript
@echo off>diffs & setlocal enabledelayedexpansion
for /f "tokens=1,2 delims=~" %%a in ('sort audio_list.txt') do (
if %%a~%%b neq !prevline! (
if %%a neq !prevtitle! (
set prevartist=%%b
set prevtitle=%%a
) else (
if %%b neq !prevartist! (
if defined prevline (
if !last! neq %%a~!prevartist! >> diffs echo %%a~!prevartist!
>>diffs echo %%a~%%b
set last=%%a~%%b
)
set prevartist=%%b
)
)
)
set prevartist=%%b
set prevline=%%a~%%b
)
::==== end batch

After tweaking your vbscript quite a bit (and mine even more!)
Set fso = CreateObject("Scripting.FileSystemObject")
arritems = split(fso.OpenTextFile("junip.txt", 1).readall,vbcrlf)
Set Dict1 = CreateObject("Scripting.Dictionary")
for i=0 to ubound(arritems)
bad="*** incomplete data: line "+cstr(i)+" ***"
a=split(arritems(i),"~")
if ubound(a)=1 then
if a(1)="" then a(1)=bad
If Not Dict1.Exists(a(0)) Then
Dict1.Add a(0),a(1)
else
test=dict1.item(a(0))
p=instr(test,a(1))
if p=0 then dict1.item(a(0))=test+"~"+a(1)
End If
else
if ubound(a)=0 then dict1.add a(0),bad
end if
Next

set z=fso.opentextfile("diffs",2,true)
set y=fso.opentextfile("audio",2,true)
for each key in dict1.keys
a=split(dict1.item(key),"~")
n=ubound(a)
if n>0 then
for i=0 to n
z.writeline key+"~"+a(i)
'enable the foll line IF you also want the "diffs" in the main file as well
y.writeline key+"~"+a(i)
next
else
y.writeline key+"~"+a(0)
end if
next
z.close
y.close

edited to save face. This one really made me look dumb, and i hate to leave turds in the path of future walkers. :-(


Report •

#2
March 11, 2013 at 01:54:04
Your batch file works perfectly - as I was acquainted with the much simpler sort of batch file I am always amazed what you can get them to do! Unfortunately the VBS script produces error 800A0009 at line 21:1 - key=a(0) - subscript out of range 'number: 0'. Can't see anything obvious but then I don't think I could have worked out coding like yours with my limited knowledge. Any ideas what it could be?

Thank you once again for your support


Report •

#3
March 11, 2013 at 08:23:56
For contrast, here's a Perl solution.

#!/usr/bin/perl

use 5.10.1;
use strict;
use warnings;
use Carp;
use List::Util qw(first);

my $input = 'Audio_list.txt';
my %titles;

open my $fh, '<', $input or croak "failed to open '$input' $!";
while ( my $line = <$fh>) {
    chomp $line;
    my ($title, $artist) = (split(/[~*]/x, $line))[0,1];
    unless ( first { $_ eq $artist } @{$titles{$title}} ) {
        push @{$titles{$title}}, $artist;
    }
}
close $fh;

for my $title (sort keys %titles) {
    for my $artist ( sort @{$titles{$title}}) {
	say join('~', $title, $artist) if @{$titles{$title}} > 1;
    }
}


Report •

Related Solutions

#4
March 11, 2013 at 09:58:34
JuniperGreen: Any ideas what it could be?
Your input file probably has a blank line, or a line composed of whitespace, or something.

For reference, this is my version. It handles blank lines, but requires you to have .NET installed.

Set fso = CreateObject("Scripting.FileSystemObject")
Set songs = CreateObject("Scripting.Dictionary")

Set inFile = fso.OpenTextFile("audio_list.txt")
Do Until inFile.AtEndOfStream
  line = Split(Trim(inFile.ReadLine), "~", 2)
  If UBound(line) = 1 Then
    If Not songs.Exists(line(0)) Then _
      Set songs(line(0)) = CreateObject("System.Collections.ArrayList")
    If Not songs(line(0)).Contains(line(1)) Then _
      songs(line(0)).Add(line(1))
  End If
Loop

Set outFile = fso.OpenTextFile("diffs.txt", 2, True)
For Each key In songs.keys
  If songs(key).Count > 1 Then _
    outFile.WriteLine key & "~" & _
     Join(songs(key).ToArray(), vbNewLine & key & "~")
Next

EDIT: If we're going to compare languages, PowerShell comes in under 10 lines:

$songs = @()
Get-Content audio_list.txt | ForEach-Object { 
  $t, $a = $_ -split "~", 2
  $songs += , (New-Object psobject -Property @{ Title = $t; Artist = $a })
}
$songs | Select-Object -Unique Title, Artist | Group-Object Title `
       | Where-Object { $_.Count -gt 1 } | ForEach-Object `
       { $_.Group | ForEach-Object { $_.Title + "~" + $_.Artist } } `
       > diffs

How To Ask Questions The Smart Way


Report •

#5
March 12, 2013 at 02:55:47
Hi Razor, thank you for your support again. I ran your script and while it produces the difference file with the correct entries, it does not eliminate duplicates in the original list. Perhaps I didn't make that clear in my post.

It never fails to impress me how many different ways there are to arrive at the same outcome using VBS and this all helps to consolidate my learning. I have no experience of PowerShell as yet given I am still striving to attain some degree of proficiency in VBS and don't want to confuse myself!

Thank you once again


Report •

#6
March 12, 2013 at 05:30:55
That's strange; I ran your test input against my script, and I got:
Loving Her Was Easier Than Anything I'll Ever Do Again~Jose Feliciano*
Loving Her Was Easier Than Anything I'll Ever Do Again~Lloyd Charmers*
Loving Her Was Easier Than Anything I'll Ever Do Again~Tompall And The Glaser Brothers*
Country Roads~Olivia Newton John*
Country Roads~Toots And The Maytalls*

Are you sure it doesn't work? If not, I'll have to chalk it up as a .NET incompatibility and rewrite it using a dictionary of dictionaries, and that's more work.

How To Ask Questions The Smart Way


Report •

#7
March 13, 2013 at 02:10:58
Nbrane,
Although I ran your revised script again (unchanged apart from adj for "obj required for FSO" in "Set Z =") but it produces an empty diff file and the duplicates still remain in the Audio_List file.

Razor,
The diff file contains the correct items but in my original post I said that I needed to delete any duplicates in the original Audio_List. To eliminate these I wrote back the contents of the dictionary to the Audio_List. I see your script isn't coded to do that so I don't think it is a NET 3.5 incompatibility.

The duplicates which should have been eliminated from the Audio_List are:

Bohemian Rhapsody~Queen*
She Loves You~Beatles*
Loving Her Was Easier Than Anything I'll Ever Do Again~Lloyd Charmers*

The diff file should contain:

Loving Her Was Easier Than Anything I'll Ever Do Again~Jose Feliciano*
Loving Her Was Easier Than Anything I'll Ever Do Again~Lloyd Charmers*
Loving Her Was Easier Than Anything I'll Ever Do Again~Tompall And The Glaser Brothers*
Country Roads~Olivia Newton John*
Country Roads~Toots And The Maytalls*

Thank you both for your support. I really do appreciate it.


Report •

#8
March 13, 2013 at 11:59:36
Ah, i forgot about that requirement. I edited my vbscript to fix this omission. I was not sure if you wanted the 'same-name-different-artist' songs in the audio_lst, or kept separate in "diffs", so there's a line where you can fix it either way.

Report •

#9
March 13, 2013 at 17:11:44
✔ Best Answer
Oh, sorry. Missed that part:
Set fso = CreateObject("Scripting.FileSystemObject")
Set songs = CreateObject("Scripting.Dictionary")

Set inFile = fso.OpenTextFile("audio_list.txt")
Do Until inFile.AtEndOfStream
  line = Split(Trim(inFile.ReadLine), "~", 2)
  If UBound(line) = 1 Then
    If Not songs.Exists(line(0)) Then _
      Set songs(line(0)) = CreateObject("System.Collections.ArrayList")
    If Not songs(line(0)).Contains(line(1)) Then _
      songs(line(0)).Add(line(1))
  End If
Loop

Set outFile = fso.OpenTextFile("diffs.txt", 2, True)
Set newFile = fso.OpenTextFile("audio_list.new.txt", 2, True)
For Each key In songs.keys
  line = key & "~" & Join(songs(key).ToArray(), vbNewLine & key & "~")
  newFile.WriteLine line
  If songs(key).Count > 1 Then _
    outFile.WriteLine line
Next

And since I already did PS:

$songs = @()
Get-Content audio_list.txt | ForEach-Object { 
  $t, $a = $_ -split "~", 2
  $songs += , (New-Object psobject -Property @{ Title = $t; Artist = $a })
}
$songs | Select-Object -Unique Title, Artist | Group-Object Title `
       | Where-Object { $_.Count -gt 1 } | ForEach-Object `
       { $_.Group | ForEach-Object { $_.Title + "~" + $_.Artist } } `
       > diffs.txt
$songs | Select-Object -Unique Title, Artist | ForEach-Object `
       { $_.Title + "~" + $_.Artist } > audio_list.new.txt

How To Ask Questions The Smart Way


Report •

#10
March 14, 2013 at 04:55:01
Hi Nbrane, hate to tell you but I still can't get your script to work properly. The diff file always comes up blank and although the rev Audio List file has no duplicates, where a song is sung by more than one performer only one version of the song is held.

Thank you for your help so far


Report •

#11
March 14, 2013 at 12:16:49
Hi Razor,

I ran your script and it produces the correct outcomes for the Audio_List and the diffs files. I tried it with the intended Audio file and it works perfectly.

Many thanks for your time, effort and support



Report •

Ask Question