Solved Collect data from HTML files using batch?

January 22, 2016 at 18:50:03
Specs: Windows 7
I have a folder of websites, and am looking for a batch file that can automatically read every ".html" file in EVERY directory of that folder, collecting the files path, HTML title, meta description, and meta keywords (if any) of each page to be inputted into an offline local search engine. I figured this could be possible by getting the code to read in between certain parameters, e.g. <TITLE>Page Title</TITLE> and logging it line by line.

I've been struggling to do this. I'd essentially need the information collected to be divided by tabs and new lines so it can be imported into a spreadsheet easily. Any help much appreciated!

message edited by jameswsparker


See More: Collect data from HTML files using batch?

Report •


#1
January 22, 2016 at 20:14:40
"One step at a time" as they say. Nothing will work straight off the line, so first, use FIND or FINDSTR to get the strings you seek:
@echo off>test & setlocal
for /f "tokens=*" %%a in ('dir /b /s *.htm?') do (
>>test echo ----- %%a
find /I "<TITLE" <%%a>>test
find /I "<meta keywords"<%%a >> test
find /I "<meta description"<%%a >> test
)

then just see what "test" looks like, and if useable, refine it to your specs.
NOTE: some html does not use line-feeds, which will totally confuse "FIND" since it's line/text-based.
For those, you might want to use vbscript to replace < with LF<

message edited by nbrane


Report •

#2
January 22, 2016 at 23:18:59
This is really useful, thank you. I've modified this to only collect the page title at this time (see below). Is it possible however to get the "echo Title: " command I've added to appear on the same line as the outputted data? for example, "Title: Website Name Goes Here", and for when a <title> tag isn't found, for it to still show "Title: " and just leave it blank? :)

Also, do you know how to only list the FIRST result found and to ignore the ones afterwards? I've found some pages have multiple <title> tags.

Lastly, one really tricky thing I'm trying to do is to ONLY list the text inside the <TITLE> tags and make it appear after "echo Title: " without outputting the actual tags themselves. Sometimes, other code sitting right next to the <TITLE> gets outputted also and things get messy. Ha-ha. Here's a copy of the code so far:


@echo off>"List.txt" & setlocal
for /f "tokens=*" %%a in ('dir /b /s *.htm?') do (>>"List.txt" echo URL: %%a

echo Title: >>"List.txt"
find /I "<TITLE" <%%a>>"List.txt"
)
exit

I really appreciate your assistance with this by the way. This will make indexing sites much easier for me. Once the code is complete, I'll allow it to also collect the description and keyword tags again, but for now I just hope you can assist! :D :)

message edited by jameswsparker


Report •

#3
January 23, 2016 at 19:34:41
✔ Best Answer
" Sometimes, other code sitting right next to the <TITLE> gets outputted also and things get messy. "
Much depends on whether the junk is inside or outside the < > tag delimiters.
this is my initial attempt, I prob'ly got the tokens line-up wrong, so work with it. I don't know what
your raw data looks like exactly. The subroutine (xx) is designed to allow various tags submitted, but
this example only shows "title". substitute "meta keywords" or "meta description" for those elements.

@echo off>test & setlocal
for /f "tokens=*" %%a in ('dir /b /s *.htm?') do (
>>test echo ----- %%a
call :xx "<title" "%%a"
call :xx "<meta keywords" "%%a"
call :xx "<meta description" "%%a"
)
goto :eof

:xx
echo %1
echo %2
echo find /I "%~1" ^<%2^|^|echo _
for /f "tokens=2 delims=<>" %%b in ('find /I "%~1" ^<%2^|^|echo _') do (
>>test echo %1: %%b
:: jumps out to so we only get the first one.
goto :eof
)
::======= end
I know this is not going to the finish-line. It's just a series of suggestions along the way...

message edited by nbrane


Report •
Related Solutions


Ask Question