Computing.Net > Forums > Linux > How to extract text from html?

Computer Problems? Computing.Net has over 1,000,000 posts about all things technology related! Over 90% answered within 24 hours! Click here to start participating now! Also, be sure to check out the New User Guide.

How to extract text from html?

Reply to Message Icon

Name: los111
Date: September 20, 2005 at 09:37:11 Pacific
OS: Linux/Windows XP
CPU/Ram: 2.8GHz
Comment:

I have a html file called myfile. If I simply put "cat myfile.html" in UNIX, it shows all the html tags. But I want to extract only text part.

Same problem happens in "type" command in MS-DOS.

I know you can do it by opening it in Internet Explorer, then "save as text", then open it again. But I need to do this from UNIX, as I have thousands of html files and have no time to convert them to text files one by one. I went through many books, but can't find a way. I would really appreciate your help.



Sponsored Link
Ads by Google

Response Number 1
Name: 3Dave
Date: September 21, 2005 at 02:09:44 Pacific
Reply:

You can use lynx and the -dump switch, eg:
$ lynx -dump http://www.mysite.com/page.html > page.txt

You can do this with local files too:
$ lynx -dump /path/to/page.html > page.txt

If you have lots of files the you may want to write a little script:

#!/bin/bash
ls -1 *.html | while read page
do
lynx -dump $page > $page.txt
done


0
Reply to Message Icon

Related Posts

See More







Post Locked

This post is quite old and has been locked from receiving new replies. Please create a new posting instead.


Go to Linux Forum Home


Sponsored links

Ads by Google


Results for: How to extract text from html?

how to extract iso images of linux www.computing.net/answers/linux/how-to-extract-iso-images-of-linux/23793.html

How to disable telnet from email user www.computing.net/answers/linux/how-to-disable-telnet-from-email-user/12801.html

how to access fat from linux www.computing.net/answers/linux/how-to-access-fat-from-linux/14096.html