BeautifulSoup (Python) and parsing HTML table -


##### update ###### : rendercontents() instead of contents[0] did trick. still leave open if can provide better, elegant solution!

i trying parse number of web pages desired data. table doesn't have class/id tag. have search 'website' in tr contents.

problem @ hand : displaying td.contents works fine text not hyperlinks reason? doing wrong? there better way of doing using bs in python?

those suggesting lxml, have ongoing thread here centos , lxml installation without admin privileges proving handful @ time. hence exploring beautifulsoup option.

html sample :

                   <table border="2" width="100%">                       <tbody><tr>                         <td width="33%" class="boldtd">website</td>                         <td width="33%" class="boldtd">last visited</td>                         <td width="34%" class="boldtd">last loaded</td>                       </tr>                       <tr>                         <td width="33%">                           <a href="http://google.com"></a>                         </td>                         <td width="33%">01/14/2011                                 </td>                         <td width="34%">                                 </td>                       </tr>                       <tr>                         <td width="33%">                           stackoverflow.com                         </td>                         <td width="33%">01/10/2011                                 </td>                         <td width="34%">                                 </td>                       </tr>                       <tr>                         <td width="33%">                           <a href="http://stackoverflow.com"></a>                         </td>                         <td width="33%">01/10/2011                                 </td>                         <td width="34%">                                 </td>                       </tr>                     </tbody></table> 

python code far :

        f1 = open(path + "/" + file)         pagesource = f1.read()         f1.close()         soup = beautifulsoup(pagesource)         alltables = soup.findall( "table", {"border":"2", "width":"100%"} )         print "number of tables found : " , len(alltables)          table in alltables:             rows = table.findall('tr')             tr in rows:                 cols = tr.findall('td')                 td in cols:                     print td.contents[0] 

from beautifulsoup import beautifulsoup  pagesource='''...omitted brevity...'''      soup = beautifulsoup(pagesource) alltables = soup.findall( "table", {"border":"2", "width":"100%"} )  results=[] table in alltables:     rows = table.findall('tr')     lines=[]     tr in rows:         cols = tr.findall('td')         td in cols:             text=td.rendercontents().strip('\n')             lines.append(text)     text_table='\n'.join(lines)     if 'website' in text_table:         results.append(text_table)  print "number of tables found : " , len(results) result in results:     print(result) 

yields

number of tables found :  1 website last visited last loaded <a href="http://google.com"></a> 01/14/2011  stackoverflow.com 01/10/2011  <a href="http://stackoverflow.com"></a> 01/10/2011 

is close looking for? problem td.contents returns list of navigablestrings , soup tags. instance, running print(td.contents) might yield

['', '<a href="http://stackoverflow.com"></a>', ''] 

so picking off first element of list makes miss <a>-tag.


Comments

Popular posts from this blog

java - SNMP4J General Variable Binding Error -

windows - Python Service Installation - "Could not find PythonClass entry" -

Determine if a XmlNode is empty or null in C#? -