使用 Beautifulsoup 和 Python 抓取复杂的表格

恩瓦罕
<table cellspacing="0" rules="all" border="1" id="MainContent_grdUsers2" style="border-style:None;width:100%;border-collapse:collapse;">
                    <tbody><tr class="listHeader">
                        <th scope="col" style="width:11%;">Name</th><th scope="col" style="width:12%;">Password</th><th scope="col" style="width:16%;">Rights</th><th scope="col" style="width:10%;">Bureaus</th><th scope="col" style="width:15%;">FullName</th><th scope="col" style="width:16%;">Email</th><th scope="col" style="width:12%;">Status</th><th scope="col" style="width:12%;">Logon Tries</th>
                    </tr><tr>
                        <td>user1</td><td align="center">
                                                    <input name="ctl00$MainContent$grdUsers2$ctl02$txtManageUsersPassword" type="text" maxlength="50" id="MainContent_grdUsers2_txtManageUsersPassword_0" style="width: 95%; background-image: url(&quot;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAASCAYAAABSO15qAAAAAXNSR0IArs4c6QAAAUBJREFUOBGVVE2ORUAQLvIS4gwzEysHkHgnkMiEc4zEJXCMNwtWTmDh3UGcYoaFhZUFCzFVnu4wIaiE+vvq6+6qTgthGH6O4/jA7x1OiCAIPwj7CoLgSXDxSjEVzAt9k01CBKdWfsFf/2WNuEwc2YqigKZpK9glAlVVwTTNbQJZlnlCkiTAZnF/mePB2biRdhwHdF2HJEmgaRrwPA+qqoI4jle5/8XkXzrCFoHg+/5ICdpm13UTho7Q9/0WnsfwiL/ouHwHrJgQR8WEwVG+oXpMPaDAkdzvd7AsC8qyhCiKJjiRnCKwbRsMw9hcQ5zv9maSBeu6hjRNYRgGFuKaCNwjkjzPoSiK1d1gDDecQobOBwswzabD/D3Np7AHOIrvNpHmPI+Kc2RZBm3bcp8wuwSIot7QQ0PznoR6wYSK0Xb/AGVLcWwc7Ng3AAAAAElFTkSuQmCC&quot;); background-repeat: no-repeat; background-attachment: scroll; background-size: 16px 18px; background-position: 98% 50%; cursor: auto;" autocomplete="off">
                                                </td><td align="center">
                                                    <select name="ctl00$MainContent$grdUsers2$ctl02$ddlManageUsersRights" id="MainContent_grdUsers2_ddlManageUsersRights_0" style="width:95%;">
                            <option value="User">User</option>
                            <option selected="selected" value="Supervisor">Supervisor</option>
                            <option value="Administrator">Administrator</option>
                            <option value="Child Supervisor">Child Supervisor</option>

                        </select>

                                                </td><td align="center">
                                                    <select name="ctl00$MainContent$grdUsers2$ctl02$ddlManageUsersBureaus" id="MainContent_grdUsers2_ddlManageUsersBureaus_0" style="width:95%;">
                            <option value="255">High</option>
                            <option selected="selected" value="128">Medium</option>
                            <option value="0">Low</option>

                        </select>

                                                </td><td align="center">
                                                    <input name="ctl00$MainContent$grdUsers2$ctl02$txtManageUsersFullName" type="text" value="First1 Last1" maxlength="50" id="MainContent_grdUsers2_txtManageUsersFullName_0" style="width: 95%; background-image: url(&quot;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAASCAYAAABSO15qAAAAAXNSR0IArs4c6QAAAUBJREFUOBGVVE2ORUAQLvIS4gwzEysHkHgnkMiEc4zEJXCMNwtWTmDh3UGcYoaFhZUFCzFVnu4wIaiE+vvq6+6qTgthGH6O4/jA7x1OiCAIPwj7CoLgSXDxSjEVzAt9k01CBKdWfsFf/2WNuEwc2YqigKZpK9glAlVVwTTNbQJZlnlCkiTAZnF/mePB2biRdhwHdF2HJEmgaRrwPA+qqoI4jle5/8XkXzrCFoHg+/5ICdpm13UTho7Q9/0WnsfwiL/ouHwHrJgQR8WEwVG+oXpMPaDAkdzvd7AsC8qyhCiKJjiRnCKwbRsMw9hcQ5zv9maSBeu6hjRNYRgGFuKaCNwjkjzPoSiK1d1gDDecQobOBwswzabD/D3Np7AHOIrvNpHmPI+Kc2RZBm3bcp8wuwSIot7QQ0PznoR6wYSK0Xb/AGVLcWwc7Ng3AAAAAElFTkSuQmCC&quot;); background-repeat: no-repeat; background-attachment: scroll; background-size: 16px 18px; background-position: 98% 50%; cursor: auto;" autocomplete="off">
                                                </td><td align="center">
                                                    <input name="ctl00$MainContent$grdUsers2$ctl02$txtManageUsersEmail" type="text" value="[email protected]" maxlength="50" id="MainContent_grdUsers2_txtManageUsersEmail_0" style="width: 95%; background-image: url(&quot;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAASCAYAAABSO15qAAAAAXNSR0IArs4c6QAAAUBJREFUOBGVVE2ORUAQLvIS4gwzEysHkHgnkMiEc4zEJXCMNwtWTmDh3UGcYoaFhZUFCzFVnu4wIaiE+vvq6+6qTgthGH6O4/jA7x1OiCAIPwj7CoLgSXDxSjEVzAt9k01CBKdWfsFf/2WNuEwc2YqigKZpK9glAlVVwTTNbQJZlnlCkiTAZnF/mePB2biRdhwHdF2HJEmgaRrwPA+qqoI4jle5/8XkXzrCFoHg+/5ICdpm13UTho7Q9/0WnsfwiL/ouHwHrJgQR8WEwVG+oXpMPaDAkdzvd7AsC8qyhCiKJjiRnCKwbRsMw9hcQ5zv9maSBeu6hjRNYRgGFuKaCNwjkjzPoSiK1d1gDDecQobOBwswzabD/D3Np7AHOIrvNpHmPI+Kc2RZBm3bcp8wuwSIot7QQ0PznoR6wYSK0Xb/AGVLcWwc7Ng3AAAAAElFTkSuQmCC&quot;); background-repeat: no-repeat; background-attachment: scroll; background-size: 16px 18px; background-position: 98% 50%; cursor: auto;" autocomplete="off">
                                                </td><td align="center">
                                                    <select name="ctl00$MainContent$grdUsers2$ctl02$ddlManageUsersStatus" id="MainContent_grdUsers2_ddlManageUsersStatus_0" style="width:95%;">
                            <option value="Active">Active</option>
                            <option selected="selected" value="Inactive">Inactive</option>
                            <option value="Terminated">Terminated</option>

                        </select>

                                                </td><td align="center">                                                    
                                                    <input name="ctl00$MainContent$grdUsers2$ctl02$txtManageUsersLogonTries" type="text" value="0" maxlength="1" id="MainContent_grdUsers2_txtManageUsersLogonTries_0" style="width:95%;">
                                                </td>
                    </tr><tr style="background-color:#CED6E7;">
                        <td>user2</td><td align="center">
                                                    <input name="ctl00$MainContent$grdUsers2$ctl03$txtManageUsersPassword" type="text" maxlength="50" id="MainContent_grdUsers2_txtManageUsersPassword_1" style="background-color: rgb(206, 214, 231); width: 95%; background-image: url(&quot;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAASCAYAAABSO15qAAAAAXNSR0IArs4c6QAAAUBJREFUOBGVVE2ORUAQLvIS4gwzEysHkHgnkMiEc4zEJXCMNwtWTmDh3UGcYoaFhZUFCzFVnu4wIaiE+vvq6+6qTgthGH6O4/jA7x1OiCAIPwj7CoLgSXDxSjEVzAt9k01CBKdWfsFf/2WNuEwc2YqigKZpK9glAlVVwTTNbQJZlnlCkiTAZnF/mePB2biRdhwHdF2HJEmgaRrwPA+qqoI4jle5/8XkXzrCFoHg+/5ICdpm13UTho7Q9/0WnsfwiL/ouHwHrJgQR8WEwVG+oXpMPaDAkdzvd7AsC8qyhCiKJjiRnCKwbRsMw9hcQ5zv9maSBeu6hjRNYRgGFuKaCNwjkjzPoSiK1d1gDDecQobOBwswzabD/D3Np7AHOIrvNpHmPI+Kc2RZBm3bcp8wuwSIot7QQ0PznoR6wYSK0Xb/AGVLcWwc7Ng3AAAAAElFTkSuQmCC&quot;); background-repeat: no-repeat; background-attachment: scroll; background-size: 16px 18px; background-position: 98% 50%;" autocomplete="off">
                                                </td><td align="center">
                                                    <select name="ctl00$MainContent$grdUsers2$ctl03$ddlManageUsersRights" id="MainContent_grdUsers2_ddlManageUsersRights_1" style="background-color:#CED6E7;width:95%;">
                            <option value="User">User</option>
                            <option selected="selected" value="Supervisor">Supervisor</option>
                            <option value="Administrator">Administrator</option>
                            <option value="Child Supervisor">Child Supervisor</option>

                        </select>

                                                </td><td align="center">
                                                    <select name="ctl00$MainContent$grdUsers2$ctl03$ddlManageUsersBureaus" id="MainContent_grdUsers2_ddlManageUsersBureaus_1" style="background-color:#CED6E7;width:95%;">
                            <option value="255">High</option>
                            <option selected="selected" value="128">Medium</option>
                            <option value="0">Low</option>

                        </select>

                                                </td><td align="center">
                                                    <input name="ctl00$MainContent$grdUsers2$ctl03$txtManageUsersFullName" type="text" value="First2 Last2" maxlength="50" id="MainContent_grdUsers2_txtManageUsersFullName_1" style="background-color: rgb(206, 214, 231); width: 95%; background-image: url(&quot;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAASCAYAAABSO15qAAAAAXNSR0IArs4c6QAAAUBJREFUOBGVVE2ORUAQLvIS4gwzEysHkHgnkMiEc4zEJXCMNwtWTmDh3UGcYoaFhZUFCzFVnu4wIaiE+vvq6+6qTgthGH6O4/jA7x1OiCAIPwj7CoLgSXDxSjEVzAt9k01CBKdWfsFf/2WNuEwc2YqigKZpK9glAlVVwTTNbQJZlnlCkiTAZnF/mePB2biRdhwHdF2HJEmgaRrwPA+qqoI4jle5/8XkXzrCFoHg+/5ICdpm13UTho7Q9/0WnsfwiL/ouHwHrJgQR8WEwVG+oXpMPaDAkdzvd7AsC8qyhCiKJjiRnCKwbRsMw9hcQ5zv9maSBeu6hjRNYRgGFuKaCNwjkjzPoSiK1d1gDDecQobOBwswzabD/D3Np7AHOIrvNpHmPI+Kc2RZBm3bcp8wuwSIot7QQ0PznoR6wYSK0Xb/AGVLcWwc7Ng3AAAAAElFTkSuQmCC&quot;); background-repeat: no-repeat; background-attachment: scroll; background-size: 16px 18px; background-position: 98% 50%; cursor: auto;" autocomplete="off">
                                                </td><td align="center">
                                                    <input name="ctl00$MainContent$grdUsers2$ctl03$txtManageUsersEmail" type="text" value="[email protected]" maxlength="50" id="MainContent_grdUsers2_txtManageUsersEmail_1" style="background-color: rgb(206, 214, 231); width: 95%; background-image: url(&quot;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAASCAYAAABSO15qAAAAAXNSR0IArs4c6QAAAUBJREFUOBGVVE2ORUAQLvIS4gwzEysHkHgnkMiEc4zEJXCMNwtWTmDh3UGcYoaFhZUFCzFVnu4wIaiE+vvq6+6qTgthGH6O4/jA7x1OiCAIPwj7CoLgSXDxSjEVzAt9k01CBKdWfsFf/2WNuEwc2YqigKZpK9glAlVVwTTNbQJZlnlCkiTAZnF/mePB2biRdhwHdF2HJEmgaRrwPA+qqoI4jle5/8XkXzrCFoHg+/5ICdpm13UTho7Q9/0WnsfwiL/ouHwHrJgQR8WEwVG+oXpMPaDAkdzvd7AsC8qyhCiKJjiRnCKwbRsMw9hcQ5zv9maSBeu6hjRNYRgGFuKaCNwjkjzPoSiK1d1gDDecQobOBwswzabD/D3Np7AHOIrvNpHmPI+Kc2RZBm3bcp8wuwSIot7QQ0PznoR6wYSK0Xb/AGVLcWwc7Ng3AAAAAElFTkSuQmCC&quot;); background-repeat: no-repeat; background-attachment: scroll; background-size: 16px 18px; background-position: 98% 50%; cursor: auto;" autocomplete="off">
                                                </td><td align="center">
                                                    <select name="ctl00$MainContent$grdUsers2$ctl03$ddlManageUsersStatus" id="MainContent_grdUsers2_ddlManageUsersStatus_1" style="background-color:#CED6E7;width:95%;">
                            <option selected="selected" value="Active">Active</option>
                            <option value="Inactive">Inactive</option>
                            <option value="Terminated">Terminated</option>

                        </select>

                                                </td><td align="center">                                                    
                                                    <input name="ctl00$MainContent$grdUsers2$ctl03$txtManageUsersLogonTries" type="text" value="0" maxlength="1" id="MainContent_grdUsers2_txtManageUsersLogonTries_1" style="background-color:#CED6E7;width:95%;">
</td>
</tr>
</tbody>
</table>

我正在尝试抓取一个包含文本、下拉选项和值的表格。结果看起来是: user1 | 主管 | 中 | 前 1 后 1 | [email protected] | 不活动

用户 2 | 主管 | 中 | 前 2 后 2 | [email protected] | 积极的

打算输出到csv。到目前为止,我有:

headers = [c.get_text(strip=True) for c in soup.find('tr', attrs={'class':'listHeader'}).findAll('th')]

#find_all doesn't work here it just grabs one
for table in soup.find('table', attrs={'id':'MainContent_grdUsers2'}):
        try:
            column3=(table.find("option", attrs={"selected": "selected"}).get('value')) 
        except:
            continue

#this only grabs a specific cell
for table in soup.find('table', attrs={'id':'MainContent_grdUsers2'}):
        try:
            column6=(table.find("input", attrs={"id": "MainContent_grdUsers2_txtManageUsersEmail_0"}).get('value')) 
        except:
            continue

我可以进入并单独抓取我想要的单元格,但该表中有大约 100 行记录,我发现很难弄清楚如何一次抓取所有这些,因为不仅有文本,还有下拉选项值, 和值。有没有办法用 Beautifulsoup 做到这一点?我曾简单地尝试过 pandas 和 lxml,但我以前从未使用过它们。

更新代码:

headers = [c.get_text(strip=True) for c in soup.find('tr', attrs={'class':'listHeader'}).findAll('th')]
table = soup.find('table', attrs={'id':'MainContent_grdUsers2'})
data = []

for tr in table.find_all('tr')[1:] : 
    td = tr.find_all('td') 
    try : 
        data += [ 
            [ 
                td[0].getText() , 
                td[2].find('option', {'selected':'selected'}).getText(), 
                td[3].find('option', {'selected':'selected'}).getText(),
                td[4].find('input').get('value'),
                    if value is None:
                        continue
                td[5].find('input').get('value'),
                td[6].find('option', {'selected':'selected'}).getText()
            ] 
        ]
    except Exception as ex : 
        #print(ex)  ## you can uncomment this line for debugging ##
        continue

for row in data : 
    print(' '.join(row))
妈妈

鉴于您提供的 html,这应该有效:

if soup.find('tr', attrs={'class':'listHeader'}) : 
    headers = [ 
        'none' if c is None else c.get_text(strip=True) 
        for c in soup.find('tr', attrs={'class':'listHeader'}).findAll('th') 
    ]
else : 
    headers = None

table = soup.find('table', attrs={'id':'MainContent_grdUsers2'})
data = []

for tr in table.find_all('tr')[1:] : 
    td = tr.find_all('td') 
    try : 
        data += [ 
            [ 
                td[0].getText() , 
                td[2].find('option', {'selected':'selected'}).getText(), 
                td[3].find('option', {'selected':'selected'}).getText(), 
                td[4].find('input').get('value'),  
                td[5].find('input').get('value'),
                td[6].find('option', {'selected':'selected'}).getText()
            ] 
        ]
    except Exception as ex : 
        #print(ex)  ## you can uncomment this line for debugging ##
        continue

for row in data : 
    print(' '.join(str(r) for r in row))

输出:

user1 Supervisor Medium First1 Last1 [email protected] Inactive
user2 Supervisor Medium First2 Last2 [email protected] Active

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章

使用 BeautifulSoup 和 Python 抓取表格

使用 BeautifulSoup 和 Python 抓取多个表格页面

使用Python的BeautifulSoup抓取表格ID

使用 Python Beautifulsoup 抓取表格和数据

使用 Python-BeautifulSoup 抓取表格数据

使用python和BeautifulSoup进行网络抓取

使用 Python 和 BeautifulSoup 进行问题抓取

如何使用BeautifulSoup和Python抓取页面?

Python - 使用 BeautifulSoup 和 Urllib 进行抓取

使用 Python 和 BeautifulSoup 抓取 alt 标签

使用Python和BeautifulSoup从HTML抓取数字

使用 BeautifulSoup 抓取表格

如何使用Python Selenium BeautifulSoup抓取表格html格式的数据?

网页抓取 - 从使用 BeautifulSoup 和 Python 的类中获取文本?

如何使用Python和BeautifulSoup抓取多个Google页面

使用python 2.7和beautifulsoup 4进行网站抓取

无法使用python和beautifulsoup抓取网页中的某些href

使用Beautifulsoup和Python进行Web抓取不起作用

使用 python 和 BeautifulSoup 抓取不完整的标签

使用 Python、Selenium 和 BeautifulSoup 来抓取标签的内容?

使用python和BeautifulSoup抓取数据时,Float的无效侧向

使用Python和BeautifulSoup抓取时模拟点击链接

使用python和beautifulsoup抓取多页网站

如何使用 Python 和 BeautifulSoup 从 html 表中抓取数据?

从 Python Beautifulsoup 中抓取表格

使用Python和beautifulsoup进行Web抓取:BeautifulSoup函数可以保存什么?

使用BeautifulSoup Python抓取网页

使用 python 抓取网站 - BeautifulSoup

使用 BeautifulSoup 和 Selenium 的网页抓取网站不会检测网页中的表格元素