BeautifulSoup (Python) and parsing HTML table -
##### update ###### : rendercontents() instead of contents[0] did trick. still leave open if can provide better, elegant solution!
i trying parse number of web pages desired data. table doesn't have class/id tag. have search 'website' in tr contents.
problem @ hand : displaying td.contents works fine text not hyperlinks reason? doing wrong? there better way of doing using bs in python?
those suggesting lxml, have ongoing thread here centos , lxml installation without admin privileges proving handful @ time. hence exploring beautifulsoup option.
html sample :
<table border="2" width="100%"> <tbody><tr> <td width="33%" class="boldtd">website</td> <td width="33%" class="boldtd">last visited</td> <td width="34%" class="boldtd">last loaded</td> </tr> <tr> <td width="33%"> <a href="http://google.com"></a> </td> <td width="33%">01/14/2011 </td> <td width="34%"> </td> </tr> <tr> <td width="33%"> stackoverflow.com </td> <td width="33%">01/10/2011 </td> <td width="34%"> </td> </tr> <tr> <td width="33%"> <a href="http://stackoverflow.com"></a> </td> <td width="33%">01/10/2011 </td> <td width="34%"> </td> </tr> </tbody></table>
python code far :
f1 = open(path + "/" + file) pagesource = f1.read() f1.close() soup = beautifulsoup(pagesource) alltables = soup.findall( "table", {"border":"2", "width":"100%"} ) print "number of tables found : " , len(alltables) table in alltables: rows = table.findall('tr') tr in rows: cols = tr.findall('td') td in cols: print td.contents[0]
from beautifulsoup import beautifulsoup pagesource='''...omitted brevity...''' soup = beautifulsoup(pagesource) alltables = soup.findall( "table", {"border":"2", "width":"100%"} ) results=[] table in alltables: rows = table.findall('tr') lines=[] tr in rows: cols = tr.findall('td') td in cols: text=td.rendercontents().strip('\n') lines.append(text) text_table='\n'.join(lines) if 'website' in text_table: results.append(text_table) print "number of tables found : " , len(results) result in results: print(result)
yields
number of tables found : 1 website last visited last loaded <a href="http://google.com"></a> 01/14/2011 stackoverflow.com 01/10/2011 <a href="http://stackoverflow.com"></a> 01/10/2011
is close looking for? problem td.contents
returns list of navigablestrings
, soup tags
. instance, running print(td.contents)
might yield
['', '<a href="http://stackoverflow.com"></a>', '']
so picking off first element of list makes miss <a>
-tag.
Comments
Post a Comment