html - ignore malformed XML with Perl-XML -
i'm using perl command line utility xpath extract data html code follows:
#!/bin/bash echo $html | xpath -q -e "//h2[1]"
the html malformed causes xpath throw below error:
not well-formed (invalid token) @ line x, column y, byte z:
i can't fix html since it's provided external source means every time html changed have fix manually again.
i looked xpath man pretty empty: http://www.linuxcertif.com/man/1/xpath.1p/
i wondering whether there way tell xpath ignore malformed html. give idea of how malformed here few lines source code:
<div id="header-background" style="top: 42px; > </div> <---- missing closing " <div id-"page-inner"> <---- - instead of =
thanks
try out html::treebuilder::xpath uses html parser build document can queried using xpaths. html parser should ok malformed xml.
also see article on html scraping xpath.
Comments
Post a Comment