How to Parse Comments from HTML

June 27, 2009 at 11:33:30
Specs: Windows XP
Hello All...!

I've been toying with ways to extract the comments from an html file and was looking for some assistance using Java.

I'm looking to pass in the html file to the java program via command line and would like to write the html comments (i.e., <!-- -->) to a separate file of the same name.

example:
1. read HomePage.html
2. parse comments
3. write to HomePage.txt

I'm sure this trivial for most in this forum, but looking for some advice/guidance.

Thanks much in advance.
FM


See More: How to Parse Comments from HTML

Report •


#1
June 27, 2009 at 17:17:06
if you have gawk for windows(see my sig),
/-->/ && !/<!--/{ 
	gsub(/-->.*/,"");
	f=0;
	print $0"-->" 
}
/<!--/ && /-->/{print}
/<!--/ && !/-->/{f=1}
f{print}

save as myscript.awk and on command line

C:\test>more file.txt
*caxxxxxx/
<!-- *ghhhgahh -->
some lines
some lines
<!-- comment lines start
comment lines
comment lines
comment lines blah blah -->122.5676
some lines
some lines

C:\test>gawk -f test.awk file.txt
<!-- *ghhhgahh -->
<!-- comment lines start
comment lines
comment lines
comment lines blah blah -->


in Java , do the same
1) check for --> tag. if yes, reset flag to 0, remove all characters after --> but keep the ones before -->. and print out.
2) check for <!-- and --> apprearing together on a line, then print
3) if only <!-- appear on a line, set a flag to 1
4) if flag is 1, print lines.

GNU win32 packages | Gawk


Report •
Related Solutions


Ask Question