Computing.Net > Forums > Unix > help: parsing html file using perl

Computer Problems? Computing.Net has over 1,000,000 posts about all things technology related! Over 90% answered within 24 hours! Click here to start participating now! Also, be sure to check out the New User Guide.

help: parsing html file using perl

Reply to Message Icon

Name: esu (by Raj)
Date: June 12, 2007 at 14:03:05 Pacific
OS: linux
CPU/Ram: 512
Product: 64
Comment:

Hi All,

Can someone suggest the way to parse the html file to look for specific text using perl? or any programming language

I have abc.html file; which is result of one of automated test suite output. This result file contains status of all tests within suite. All I want to find test name and their status(pass/fail) from html file and redirect to another text file called xyz.txt. xyz should contain test name space seperate by status followed space seperated by description of test failure. The format of this text file is pasted below as xyz.txt.

==================================
abc.html file looks like this:
==================================
<html>
<body>
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th width="80%">Name</th><th>Tests</th><th>Errors</th><th>Failures</th><th nowrap="nowrap">Time(s)</th>
</tr>
<tr valign="top" class="Error">
<td>TestSuite</td><td>34</td><td>1</td><td>1</td><td>321.625</td>

</tr>
</table>
<h2>Tests</h2>
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Name</th><th>Status</th><th width="80%">Type</th><th nowrap="nowrap">Time(s)</th>
</tr>
<tr valign="top" class="TableRowColor">
<td>testLoginInitialLoad</td><td>Success</td><td></td><td>7.938</td>

</tr>
<tr valign="top" class="TableRowColor">
<td>testFailedSignOn</td><td>Success</td><td></td><td>7.156</td>
</tr>
<tr valign="top" class="TableRowColor">
<td>testLoginSignOn</td><td>Success</td><td></td><td>8.078</td>
</tr>
<td>testComponents</td><td>Failure</td><td>null



<code>junit.framework.AssertionFailedError: null at tests.Components.Components(ProductComponents.java:44)</code></td><td>7.625</td>
</tr>
<tr valign="top" class="TableRowColor">
<td>testProductColumnSort</td><td>Success</td><td></td><td>20.484</td>
</tr>

<tr valign="top" class="Error">
<td>testCompare</td><td>Error</td><td>Product missmatch: To compare , please select exactly two comp of the same product.



<code>tests.ProductMissmatchException: Product missmatch: To compare , please select exactly two comp of the same product. at tests.Compare.testCompare(Compare.java:45)</code></td><td>7.141</td>
</tr>
</body>
</html>
===========================================

xyz.txt
======
testLoginInitialLoad Success
testFailedSignOn Success
testLoginSignOn Success
testComponents Failure junit.framework.AssertionFailedError: null at tests.Components.Components(ProductComponents.java:44)
testCompare Error Product missmatch: To compare , please select exactly two comp of the same product. tests.ProductMissmatchException: Product missmatch: To compare , please select exactly two comp of the same product. at tests.Compare.testCompare(Compare.java:45)



Sponsored Link
Ads by Google

Response Number 1
Name: dmj2
Date: June 12, 2007 at 23:26:30 Pacific
Reply:

Try this:

cat abc.html | sed -e 's/<td>/\%/g' -e 's/<[^>]*>//g' | egrep -v '^$' | tr "%" " " | egrep -i '(Suc|Fail|Err)'


0

Response Number 2
Name: esu (by Raj)
Date: June 13, 2007 at 11:49:11 Pacific
Reply:

Great this is what I'm looking for. Thank you very much.

I want to create this as excutable so I can call from other program/script. Not sure why following snippet gives me error.

#!/usr/bin/perl -w

$File="0_abc.html";
cat $File|sed -e 's/<td>/\%/g'-e 's/<[^>]*>//g'|egrep -v '^$'|tr "%" " " |egrep -i '(Suc|Fail|Err)' >> test.txt



0

Response Number 3
Name: esu (by Raj)
Date: June 13, 2007 at 17:23:46 Pacific
Reply:

Hi there,

There was typo in my script ...I corrected it and it worked well. However the script produces following output. The things which I don't want to see in out put are first line which is NameTestsErrorsFailuresTime(s).
Then there is white space in beginning of each line which not reuqired. Also there's extra white space after status and before time is printed on each line.

In short:
1)get rid of first line
2)remove while space from begining of each line
3) Remove extra white space after status and before time(s).
================================
this is current output from abouve script:
====================================
NameTestsErrorsFailuresTime(s)
testLoginInitialLoad Success 7.938
testFailedSignOn Success 7.156
testLoginSignOn Success 8.078
testHomeTabNone Success 16.469

==========================
we want following output:
===========================
testLoginInitialLoad Success 7.938
testFailedSignOn Success 7.156
testLoginSignOn Success 8.078
testHomeTabNone Success 16.469
testDefectProductComponents Failure null
testDefectRunsCompare Error Product missmatch: To compare runs, please select exactly two runs of the same product.


0

Response Number 4
Name: ghostdog
Date: June 15, 2007 at 23:17:14 Pacific
Reply:

[code]
awk '/<td>/,/<\/td>/ { if ($0 ~ /TestSuite/) {next} ;
gsub("<td>|</td>"," ",$0)
gsub("<code>|</code>"," ",$0)
gsub("^ ","",$0)
print
}
' "file"
[/code]


0

Sponsored Link
Ads by Google
Reply to Message Icon

Related Posts

See More







Post Locked

This post is quite old and has been locked from receiving new replies. Please create a new posting instead.


Go to Unix Forum Home


Sponsored links

Ads by Google


Results for: help: parsing html file using perl

Extracting data from file using sed www.computing.net/answers/unix/extracting-data-from-file-using-sed/7660.html

parsing html file & put in excel www.computing.net/answers/unix/parsing-html-file-put-in-excel/8198.html

using find comman www.computing.net/answers/unix/using-find-comman/5011.html