Articles

Extracting hyperlinks from HTMLfiles without Word/Excel

January 24, 2013 at 02:04:39
Specs: Windows Vista, 1.6 gb 4gb

To pull off hyperlinks I usually rely on VBScript using Word or Excel or even Open Office using its own Basic. However, I need to be able to extract and list the full hyperlink address and the accompanying "Display Text" of all the hyperlinks in quite a number of HTML files in a situation where recourse to the forementioned programs will not be possible. As the full link address is not visable on the page (only the Display Text) it would be a long laborious job for someone to go through each file as this would call for some knowledge of HTML given they would have to examine the source code. I doubt if going down the route of "Edit Hyperlink" via an editor would even be available to them.

To cut a long story short does anyone know of a VBScript which will do the job?. I have scoured the net and the forum and I can't see anything which addresses this. I would appreciate anyone's help.

Thank you


See More: Extracting hyperlinks from HTMLfiles without Word/Excel

Report •


#1
January 24, 2013 at 02:55:58

Post a few lines and show what output is needed.

=====================
M2 Golden-Triangle


Report •

#2
January 24, 2013 at 03:53:26

Apologies for this seemingly cluttered post but there didn't seem to be anyway to attach a file. Buried in the text below from an html file are two references and their "display text" :

CAPTURE THE SPIRIT OF BRAVE http://www.skye.co.uk/top-tips.php?...

BECAUSE WE KNOW MONEY DOESN'T GROW ON TREES http://www.skye.co.uk/special-offer...


The output only needs to be single line as per the above.

Any help gratefully accepted!

Thank you


Sample Text:

<td width="10" rowspan="2"></td>
<td rowspan="2" style="background-color:#ffffff;width:75px;height:60px;border-left:4px solid #bb4e34;border-right:4px solid #bb4e34;" align="center"><img src="/common/script/resize-image.php?image=/images/adverts/id353-brave-button.jpg&max_width=75&max_height=60" border="0" alt="" align="absmiddle" /></td>

<td width="115" height="60" rowspan="2" valign="top" style="padding:0 2px;">
<a class="adverttextblock" href="http://www.skye.co.uk/top-tips.php?id=37" target="_blank" onclick="link_handler(353, 0, 0);"> <div class="adverttitle">CAPTURE THE SPIRIT OF BRAVE</div>
<div class="advertstrapline">visit the mediaevil castles of Skye & Lochalsh</div>
</td>
<td height="10" width="10" style="background-color:#bb4e34;"></td>
</tr>

<tr>
<td width="10" height="50"> </td>
</tr>
</table>
</div>
<div style="padding-top:20px;"> <table cellpadding="0" cellspacing="0" width="210" height="60" class="advert">
<tr>
<td width="10" rowspan="2"></td>
<td rowspan="2" style="background-color:#ffffff;width:75px;height:60px;border-left:4px solid #990066;border-right:4px solid #990066;" align="center"><img src="/common/script/resize-image.php?image=/images/adverts/id361-money-on-trees-button.jpg&max_width=75&max_height=60" border="0" alt="" align="absmiddle" /></td>

<td width="115" height="60" rowspan="2" valign="top" style="padding:0 2px;">
<a class="adverttextblock" href="http://www.skye.co.uk/special-offers.php" target="_blank" onclick="link_handler(361, 0, 0);"> <div class="adverttitle">BECAUSE WE KNOW MONEY DOESN'T GROW ON TREES</div>
<div class="advertstrapline">we have some new Spring special offers</div>
</td>
<td height="10" width="10" style="background-color:#990066;"></td>
</tr>


Report •

#3
January 24, 2013 at 04:45:34

I can't find "display text" in the sample.

=====================
M2 Golden-Triangle


Report •

Related Solutions

#4
January 24, 2013 at 05:21:11

Sorry I've mislead you by using the generic name for the text which usually accompanies hyperlinks. The text appears after the tag <div class="adverttitle">. Unfortunately it's just occurred to me that each html file may well tag this differently.

I'm not very good with batch files but I think it might be impossible to get round this as the actual text would be unknown and would be sandwiched between undetermined tags.

However, using their "Edit Hyperlink" dialogue I have just looked at the actual file in Word. It identifies the URL address correctly but the Display Text field merely offers "<<shown in document>>". On the other hand LibreOffice ( latest incarnation of Open Office) gives the correct URL address and also the correct text "Capture the spirit of brave". This shows that it can be done although how is the big question! I didn't appreciate just how complicated this might be and I'm kind of reconciled now to accepting it's a no go without using Word/ EXcel/ Open Office

Thank you anyway for the interest you've shown!




Report •

#5
January 24, 2013 at 19:27:17

Well, you might try this vbscript. It is self-explanatory:
'====== begin vbscript href.vbs
set fso=createobject("scripting.filesystemobject")
lin=fso.opentextfile("h.htm",1).readall
set htmout=fso.createtextfile("h.out")
href=""
x=split(lin,"<")
for i=0 to ubound(x)
y=split(x(i),">")
if ubound(y) > -1 then
p=instr(lcase(y(0)),"href")
if p>0 then href=mid(y(0),p+4)
if href<>"" and ltrim(y(1))<>"" then
q=split(href,chr(34))
htmout.writeline q(1) & "::" & y(1)
href=""
'this is just for debugging purposes (next 3)
wscript.echo "href: " & q(1)
wscript.echo "disp: " & y(1)
o=wscript.stdin.readline
end if
end if
next
'===== end vbscript

Report •

#6
January 25, 2013 at 01:18:03

nbrane,

Thank you for coming to my rescue! Unfortunately, when I ran the script it produced a runtime error 800A0009 in line 14:1 namely ( if href<>"" and ltrim(rtrim((y(1))))<>"" then). The error was: Subscript out of range:[number:1].

I haven't changed anything in the script and as I just copied it over it unlikely to be a typo I've made. Any suggestions?

Thank you again


Report •

#7
January 25, 2013 at 11:31:00

It's probably a "split line" where the line-break is messing up the tags synch. Try the modified script in response #5.

Report •

#8
January 26, 2013 at 02:21:49

nbrane,

That was the version which produced the runtime error.

Thanks anyway


Report •

#9
January 26, 2013 at 10:05:23

If you look closely, I believe you will note that line two is different, and their's no do-loop for reading the file. The whole file is read in one piece, instead of line-by-line. I guess I should have re-posted, it's less confusing maybe. If it still throws an error, I will need a sample of the html that's throwing it. I've just been using the original snippet you posted up top.

Report •


Ask Question