Computing.Net > Forums > Unix > To Extract information from an html

Computer Problems? Computing.Net has over 1,000,000 posts about all things technology related! Over 90% answered within 24 hours! Click here to start participating now! Also, be sure to check out the New User Guide.

To Extract information from an html

Reply to Message Icon

Name: nishana
Date: May 14, 2007 at 22:57:03 Pacific
OS: Windows
CPU/Ram: 256
Product: xxx
Comment:

Hi,
I have a file,lets say InputList which contains a list of files.

Inputfile

/home/kgeorg/nishana/survey_pages/common/AdvancedSearchPage.jsp
/home/kgeorg/nishana/survey_pages/common/Footer.jsp
/home/kgeorg/nishana/survey_pages/common/Header.jsp


etc..

I have to open each file in Inputlist,parse through it and get the information stored in the meta tag

For Eg
<Title>ABC Personal Finance, Credit Cards, Business Services, and Travel Services</title>

<META NAME ="KEYWORDS" CONTENT="credit card, express, Credit Cards, express financial advisors, travel and entertainment, express card, hotel reservations, aexpress travel, express credit card, express cards, express financial, financial advisor">
<META NAME ="DESCRIPTION" CONTENT="ABC offers individuals online access to its world-class Card, Financial, and Travel services, including financial advice, retirement planning, air and hotel reservations and more.">


I have to extract the title, the contents in the meta tags KEYWORD and DESCRIPTION.

Could anyone help me out with a shell script with AWK/SED to do the same?

Thanks,
Nishana




Sponsored Link
Ads by Google

Response Number 1
Name: ghostdog
Date: May 14, 2007 at 23:39:48 Pacific
Reply:

experimental:
[code]

awk 'BEGIN{IGNORECASE=1}
/<Title>/,/<\/Title>/{
gsub("<Title>|</Title>", "");
title=$0
}
/<META NAME ="KEYWORDS" CONTENT=/,/">/ {
gsub(/META NAME ="KEYWORDS" CONTENT=|<|>|/,"")
gsub(/<|>|"/,"")
keywords=$0
}
/<META NAME ="DESCRIPTION" CONTENT=/,/">/ {
gsub(/<META NAME ="DESCRIPTION" CONTENT=|<|>|"/,"")
description=$0
}
END {
print "Title : " title
print "Keywords : " keywords
print "Description: " description
}' "file1"
[/code]


0

Response Number 2
Name: ghostdog
Date: May 14, 2007 at 23:40:41 Pacific
Reply:

edited:
[code]

awk 'BEGIN{IGNORECASE=1}
/<Title>/,/<\/Title>/{
gsub("<Title>|</Title>", "");
title=$0
}
/<META NAME ="KEYWORDS" CONTENT=/,/">/ {
gsub(/META NAME ="KEYWORDS" CONTENT=|<|>|/,"")
keywords=$0
}
/<META NAME ="DESCRIPTION" CONTENT=/,/">/ {
gsub(/<META NAME ="DESCRIPTION" CONTENT=|<|>|"/,"")
description=$0
}
END {
print "Title : " title
print "Keywords : " keywords
print "Description: " description
}' "file1"
[/code]


0

Sponsored Link
Ads by Google
Reply to Message Icon

Related Posts

See More







Post Locked

This post is quite old and has been locked from receiving new replies. Please create a new posting instead.


Go to Unix Forum Home


Sponsored links

Ads by Google


Results for: To Extract information from an html

Extracting information from a file www.computing.net/answers/unix/extracting-information-from-a-file/7770.html

How to get data from mylog.log www.computing.net/answers/unix/how-to-get-data-from-myloglog/3805.html

shell script to extract information www.computing.net/answers/unix/shell-script-to-extract-information/6276.html