Python BeautifulSoup-使用来自给定URL的iframe抓取多个网页

FrankC 发表于 Dev

弗兰克

我们有以下代码（感谢Cody和Alex Tereshenkov）：

import pandas as pd
import requests
from bs4 import BeautifulSoup

pd.set_option('display.width', 1000)
pd.set_option('display.max_columns', 50)

url = "https://www.aliexpress.com/store/feedback-score/1665279.html"
s = requests.Session()
r = s.get(url)

soup = BeautifulSoup(r.content, "html.parser")
iframe_src = soup.select_one("#detail-displayer").attrs["src"]

r = s.get(f"https:{iframe_src}")

soup = BeautifulSoup(r.content, "html.parser")
rows = []
for row in soup.select(".history-tb tr"):
    #print("\t".join([e.text for e in row.select("th, td")]))
    rows.append([e.text for e in row.select("th, td")])
#print

df = pd.DataFrame.from_records(
    rows,
    columns=['Feedback', '1 Month', '3 Months', '6 Months'],
)

# remove first row with column names
df = df.iloc[1:]
df['Shop'] = url.split('/')[-1].split('.')[0]

pivot = df.pivot(index='Shop', columns='Feedback')
pivot.columns = [' '.join(col).strip() for col in pivot.columns.values]

column_mapping = dict(
    zip(pivot.columns.tolist(), [col[:12] for col in pivot.columns.tolist()]))
# column_mapping
# {'1 Month Negative (1-2 Stars)': '1 Month Nega',
#  '1 Month Neutral (3 Stars)': '1 Month Neut',
#  '1 Month Positive (4-5 Stars)': '1 Month Posi',
#  '1 Month Positive feedback rate': '1 Month Posi',
#  '3 Months Negative (1-2 Stars)': '3 Months Neg',
#  '3 Months Neutral (3 Stars)': '3 Months Neu',
#  '3 Months Positive (4-5 Stars)': '3 Months Pos',
#  '3 Months Positive feedback rate': '3 Months Pos',
#  '6 Months Negative (1-2 Stars)': '6 Months Neg',
#  '6 Months Neutral (3 Stars)': '6 Months Neu',
#  '6 Months Positive (4-5 Stars)': '6 Months Pos',
#  '6 Months Positive feedback rate': '6 Months Pos'}
pivot.columns = [column_mapping[col] for col in pivot.columns]

pivot.to_excel('Report.xlsx')

该代码提取给定URL（位于iframe中）的“反馈历史记录”表，并将所有表数据转换为1行，就像这样：

另一方面，我们在同一个项目文件夹（“ urls.txt”）中有50个URL，如下所示：

https://www.aliexpress.com/store/feedback-score/4385007.html
https://www.aliexpress.com/store/feedback-score/1473089.html
https://www.aliexpress.com/store/feedback-score/3085095.html
https://www.aliexpress.com/store/feedback-score/2793002.html
https://www.aliexpress.com/store/feedback-score/4656043.html
https://www.aliexpress.com/store/feedback-score/4564021.html

我们只需要为文件中的所有URL提取相同的数据。

我们该怎么做呢？

比托·本尼汉（Bitto Bennichan）

由于网址数量约为50，因此您可以将网址读取到列表中，然后向每个网址发送请求。我刚刚测试了这6个网址，该解决方案适用于它们。但是您可能想要添加一些try-except，除非可能发生任何异常。

import pandas as pd
import requests
from bs4 import BeautifulSoup
with open('urls.txt','r') as f:
    urls=f.readlines()
master_list=[]
for idx,url in enumerate(urls):
    s = requests.Session()
    r = s.get(url)
    soup = BeautifulSoup(r.content, "html.parser")
    iframe_src = soup.select_one("#detail-displayer").attrs["src"]
    r = s.get(f"https:{iframe_src}")
    soup = BeautifulSoup(r.content, "html.parser")
    rows = []
    for row in soup.select(".history-tb tr"):
        rows.append([e.text for e in row.select("th, td")])
    df = pd.DataFrame.from_records(
        rows,
        columns=['Feedback', '1 Month', '3 Months', '6 Months'],
    )

    df = df.iloc[1:]
    shop=url.split('/')[-1].split('.')[0]
    df['Shop'] = shop
    pivot = df.pivot(index='Shop', columns='Feedback')
    master_list.append([shop]+pivot.values.tolist()[0])
    if idx == len(urls) - 1: #last item in the list
        final_output=pd.DataFrame(master_list)
        pivot.columns = [' '.join(col).strip() for col in pivot.columns.values]
        column_mapping = dict(zip(pivot.columns.tolist(), [col[:12] for col in pivot.columns.tolist()]))
        final_output.columns = ['Shop']+[column_mapping[col] for col in pivot.columns]
        final_output.set_index('Shop', inplace=True)
final_output.to_excel('Report.xlsx')

输出：

也许您可以考虑的一个更好的解决方案是完全避免使用熊猫。获得数据后，可以操纵它以获取列表并使用XlsxWriter。

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2020-12-27

我来说两句

0 条评论

登录后参与评论

上一篇：Active Directory组列出了一个成员用户，用户的输入未提及该组

Python BeautifulSoup-使用来自给定URL的iframe抓取多个网页

Python BeautifulSoup-使用来自给定URL的iframe抓取多个网页

UITableView的项目向下滚动后更改颜色，然后快速备份

Linux的官方Adobe Flash存储库是否已过时？

用日期数据透视表和日期顺序查询

应用发明者仅从列表中选择一个随机项一次

Mac OS X更新后的GRUB 2问题

验证REST API参数

Java Eclipse中的错误13，如何解决？

带有错误“ where”条件的查询如何返回结果？

ggplot：对齐多个分面图-所有大小不同的分面

尝试反复更改屏幕上按钮的位置 - kotlin android studio

如何从视图一次更新多行（ASP.NET - Core）

计算数据帧中每行的NA

蓝屏死机没有修复解决方案

在 Python 2.7 中。如何从文件中读取特定文本并分配给变量

离子动态工具栏背景色

VB.net将2条特定行导出到DataGridView

通过 Git 在运行 Jenkins 作业时获取 ClassNotFoundException

在Windows 7中无法删除文件（2）

python中的boto3文件上传

当我尝试下载 StanfordNLP en 模型时，出现错误

Node.js中未捕获的异常错误，发生调用