Batch to merge saved web pages

June 22, 2013 at 04:10:45
Specs: Windows 7

I have a lot of web pages saved from one website in my hard drive (+4000). Those saved web pages consist of one html and one folder where pictures, css, js, etc... are stocked.

Since they were barely saved from the same website which have the same pictures, css, js, etc..., there is over 4000 duplicata of the same pictures, css, js, etc...!

My idea was to merge all those folders which contains the same things into one folder "data", and then edit all the links of pictures in the htmls to point to the same folder "data". This way there won't be any duplicatas !

For instance, let's say the name of the html is "How to animate a 3d model.html". It's folder would be "How to animate a 3d model_files" and the code that refers to this folder in the html would be src="How%20to%20animate%20a%203d%20model_files/picture.jpg".

So I was thinking to do this job with a loop that would replace all paths in the htmls and merge all data folders into one, but this seems to be over my batch knoweldge :(

Can anyone help me ?? Or at least give a me link that would help me ?


See More: Batch to merge saved web pages

June 22, 2013 at 10:23:50
Sounds like a hassle. Better way would be to scan for duplicates, and replace redundant copies with hard links.

How To Ask Questions The Smart Way

Report •

June 23, 2013 at 19:35:24
There's unfilled blanks to solving this, and I feel it will not be "cost-effective" considering the cheapness of disk-space. Mainly, are the filenames unique to each file across all the subdirs? f/e: if html#1 refers to "01.jpg", then 01.jpg must be unique in all 4000 subfolders (doubtful, but possible). Most websites and pages depend on subdirectories to isolate items with common names: xx/01.jpg, vs yy/01.jpg. SO, if you can't go by names, you would have to sort all the files (/s, or recursively) by size, then for each size value, you would have to loop with "FC" (or "FSUM") to compare. Each one that is equal is deleted, until you reach the end of that size-group, then the first one is moved to "data". Then repeat for the next one (remaining) in that size-group, repeat until end of size-group is reached, then start again on the next size-group. It's do-able, but question is if it's worth it. I would never depend on files names to indicate whether they are the same content or not, regardless.

Report •
Related Solutions

Ask Question