HTML表格抓取–从列获取href属性

汤姆

这是一个简单有效的HTML表格(实时演示):

<!DOCTYPE html>
<html id="world_presidents" xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
    <head>
        <meta charset='utf-8' />
        <style>
            table, td, th {
                border: 1px solid grey;
            }
        </style>
    </head>
    <body>
        <table>
            <tr>
                <th>Name</th>
                <th>Took office</th>
                <th>Resources</th>
            </tr>
            <tr>
                <td><a href="http://en.wikipedia.org/wiki/George_Washington">George Washington</a></td>

                <td>1789</td>
                <td><a href="http://books.google.co.uk/books?id=t1pQ4YG-TDIC&pg=PA148&dq=#v=onepage&q=&f=false"  title="encyclopedia">Encyclopedia</a>(<a href="http://constitutioncenter.org/media/audio/ron_chernow_10-18-10_(64).mp3"  title="Subresource">Sub Resource</a>)<br>
                <a href="http://en.wikipedia.org/wiki/George_Washington#CITEREFParry1991"  title="parry">Parry 1991</a></td>
            </tr>
            <tr>
                <td>John Adams</td>
                <td>1797</td>
                <td><a href="http://www.adherents.com/people/pa/John_Adams.html"  title="Adherents.com">Adherents.com</a></td>
            </tr>
            <tr>
                <td>Thomas Jefferson</td>
                <td>1801</td>
                <td><a href="http://books.google.co.uk/books?id=qkTPAAAAMAAJ&redir_esc=y"  title="Government Printing Office">Government Printing Office</a></td>
            </tr>
            <tr>
                <td>James Madison</td>
                <td>1809</td>
                <td><a href="http://www.loa.org/volume.jsp?RequestID=16&section=toc"  title="Library of America">Library of America</a><br>
                <a href="http://quod.lib.umich.edu/cgi/t/text/text-idx?c=acls;cc=acls;view=toc;idno=HEB00509.0001.001"  title="Federal Republic">Federal Republic</a></td>
            </tr>
            <tr>
                <td>James Monroe</td>
                <td>1817</td>
                <td><a href="https://rads.stackoverflow.com/amzn/click/com/0813912660" rel="nofollow noreferrer"  title="scholarly biography">scholarly biography</a></td>
            </tr>
            <tr>
                <td>John Quincy Adams</td>
                <td>1825</td>
                <td><a href="http://www.common-place.org/vol-09/no-01/adams/"  title="Common-Place">Common-Place</a>&nbsp;(<a href="http://dx.doi.org/10.1111%2F1467-7709.00049"  title="Diplomatic History">Aditional - Diplomatic History</a>)</td>
            </tr>
            <tr>
                <td>Andrew Jackson</td>
                <td>1829</td>
                <td><a href="http://statelibrary.dcr.state.nc.us/nc/bio/public/jackson.htm"  title="Information Services Branch">Information Services Branch</a>&nbsp;(<a href="http://www.discovernorthernireland.com/product.aspx?ProductID=2801"  title="Tourist Board">Tourist Board</a>)</td>
            </tr>
            <tr>
                <td>Martin Van Buren</td>
                <td>1837</td>
                <td><a href="http://en.wikipedia.org/wiki/Holmes_Alexander"  title="The American Talleyrand">The American Talleyrand</a></td>
            </tr>
        </table>
    </body>
</html>

我试图从列中获取所有数据,以便可以将其插入数据库。当我尝试从'href'属性获取所有链接时,我陷入了第三列。我创建了下面的代码,该代码可用于第一列和第二列,但是我找不到改变它的方法,因此它将显示第三列中的所有链接。

<?php 

require_once 'simple_html_dom.php'; 
$html = new simple_html_dom(); 
$html = file_get_html('table.html'); 

//engine 
//go through table and find href attributes
echo"<p>Presidents</p>"; 
foreach($html->find('//*[@id="world_presidents"]/body/table/tbody/tr/') as $row) { 
    $presidentsLink = $row->find('a', 2); 
    if(!empty($presidentsLink)){ 
        echo $presidentsLink->href . "<br>"; 
    } 
}

?> 

现在,它只显示一个链接,而不是13(实时演示)。

简而言之:

  • 我正在使用简单HTML DOM解析器从html表获取内容
  • 我无法更改THML表
  • 我的问题是从第三列获取所有href属性并显示它们

我将不胜感激。

凯文

无论如何,似乎您正在使用xpath查询,只需将其直接指向表行单元格并循环其锚点子代即可:

$html = file_get_html('http://five-kings.co.uk/question/table.html');
$table_rows = $html->find('tr td[3]');
foreach($table_rows as $cell) {
    foreach($cell->children as $child) {
        if($child->tag == 'a') {
            echo $child->href . '<br/>';
        }
    }
    echo '<hr/>';
}

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章