在Python中从HTML表创建数据框

布莱恩·奥哈洛兰(Brian O'Halloran)

我正在尝试从多个表中提取信息,例如下面的表。我正在尝试提取地址,批号,指导价,说明-我是否应该简单地进行正则表达式匹配?有232个这样的表-大概做一个循环来提取它们(并将它们粘贴到熊猫中)?

                            <table cellspacing="0" id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1" style="width:100%;border-collapse:collapse;">
<tr>
    <td colspan="2">
    <table class="table-search-result">
        <tr>
            <th>66D Charlwood Street, Pimlico, London, SW1V 4PQ</th>
            <th style="text-align: right; white-space: nowrap;">

                <a href="http://www.englishhouseprices.com/results.aspx?postcode=SW1V 4PQ" id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1_A2" class="icon" target="_blank">
                    <img src="/content/images/icons/32/houseprices.png" alt="Compare with Property Prices" title="Compare with Property Prices in this Postcode" /></a>
                <a id="" title="View Auction Details" class="icon" onclick="return o(this,900,650,1,1)" href="/clients/auctions/details.aspx?auctionid=17999" target="_blank"><img title="View Auction Details" src="/content/images/icons/32/auctiondetails.png" alt="" /></a>


                <a id="" title="Trend Analysis" class="icon" onclick="return o(this,900,650,1,1)" href="/clients/lots/trend-analysis.aspx?lotid=756425" target="_blank"><img title="Trend Analysis" src="/content/images/icons/32/piechart.png" alt="" /></a>
                <a href='http://maps.google.co.uk?q=SW1V 4PQ' target="_blank">
                    <img id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1_ImageLocationMap" title="Location Map" class="icon" src="/content/images/icons/32/compass.png" /></a>
                <a href='http://www.multimap.com/map/photo.cgi?scale=5000&mapsize=big&pc=SW1V 4PQ' target="_blank">
                    <img id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1_ImageAerialPhoto" title="Aerial Photo" class="icon" src="/content/images/icons/32/camera.png" /></a>
                <a href='/clients/search/search-results.aspx?searchtype=comparable&lotid=756425' title="Find similar properties like this one">
                    <img src="/content/images/icons/32/find.png" alt="Find other properties matching this tenant" title="Find similar properties like this one" class="icon" /></a>

                <a href='/clients/search/search-results.aspx?searchtype=history&lotid=756425'>
                    <img src="/content/images/icons/32/history.png" alt="Find history of property in this street" title="Find history of property in this street" class="icon" /></a>
                <a id="" title="Add to one of my portfolios" class="icon" Title="Add to portfolio" onclick="return o(this,650,500,1,1)" href="/clients/portfolios/lot.aspx?lotid=756425" target="_blank"><img title="Add to one of my portfolios" src="/content/images/icons/32/briefcase.png" alt="" /></a>
                <a href="https://www.eigroup.co.uk/files/55/17999/6ec339ec-d59e-4b8a-9136-dc6e9a583328.pdf" id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1_A4" target="_blank">
                    <img src="/content/images/icons/32/catalogue.png" alt="Catalogue Entry" class="icon" title="Full Catalogue Entry" /></a>
                <a id="" title="Add to my shortlist" class="icon" Title="Add to shortlist" onclick="return o(this,900,650,1,1)" href="/clients/lots/shortlist.aspx?lotid=756425" target="shortlist"><img title="Add to my shortlist" src="/content/images/icons/32/shortlist.png" alt="" /></a>

            </th>
        </tr>
        <tr>
            <td colspan="2" style="background-color: #f5f5f5;">
                <table style="width: 100%">
                    <tr>
                        <td style="background-color: #f1f1f1; width: 170px; text-align: center;">
                            <a href='/clients/lots/details.aspx?lotid=756425&hb=1' target='756425' onclick="window.open(this.href,this.target,'width=900,height=650,resizable=yes,scrollbars=yes');return false" title="Auction property in Pimlico, London, SW1">
                                <img id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1_Image1" src="https://www.eigroup.co.uk/files/55/17999/de591a4f-7da1-4bcd-a42c-76731bd72a23.jpg" alt="Pimlico, London, SW1" style="border-color:Black;border-width:2px;border-style:Solid;width:150px;" />
                            </a>
                        </td>
                        <td style="padding-left: 10px; width: 50%;">
                            <p>
                                <b>Description</b><br />
                                Leasehold 2nd Floor Studio Flat Unmodernised Vacant
                            </p>
                            <p id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1_P1">
                                <b>Guide Price</b><br />
                                £450,000 Plus
                            </p>

                            <p>
                                <b>Lot Number</b><br />
                                2
                            </p>
                             <p>
                               <b> </b>
                            </p>
                        </td>
                        <td style="white-space: nowrap;">
                            <p>
                                <b>Auctioneer</b><br />
                                <a id="" onclick="return o(this,900,650,1,1)" href="/clients/auctioneers/details.aspx?auctioneerid=55" target="_blank">Savills (London - National)</a>

                            </p>
                            <p id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1_P3">
                                <b>Vendor</b><br />
                                Housing Association
                            </p>

                        </td>
                        <td style="white-space: nowrap;">
                            <p>
                                <b>Auction Date</b><br />
                                <a id="" onclick="return o(this,900,650,1,1)" href="/clients/auctions/details.aspx?auctionid=17999" target="_blank">28 October 2014</a>
                            </p>


                            <p id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1_P7">
                                <b>Lease Details</b><br />
                                125 Yr, commencing 01/01/2013 (GR.£250.PA)
                            </p>
                        </td>
                    </tr>
                </table>
            </td>
        </tr>

    </table>
</td>
</tr>

羊皮纸

如果表格格式合理,则可以使用pandasread_html方法。它将返回一个数据帧列表,每个找到的表一个。

pandas.read_html(html_string_or_url)

如果熊猫无法阅读,则需要手动对其进行解析。您应该使用HTML解析器库,例如Beautiful Soup

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章