我正在使用beautifulSoup抓取一些xml站点,然后将抓取的数据存储到数据帧中。XML通常采用统一格式,因此抓取效果很好。但是也许有15%的时间,数据不会保存到数据帧中,因为其中一个前缀略有不同。
例如,在抓取这三个URL时,第二个和第三个URL会毫无问题地存储到数据帧中,而第一个则没有。
from bs4 import BeautifulSoup
import requests
import pandas as pd
session = requests.Session()
# urls to loop through
form_urls = ['https://www.sec.gov/Archives/edgar/data/1418814/000141881220000017/vac13f021420.xml',
'https://www.sec.gov/Archives/edgar/data/820124/000095012320003895/408.xml',
'https://www.sec.gov/Archives/edgar/data/1067983/000095012320002466/form13fInfoTable.xml']
# Create dataframe and set columns to match XML doc
cols = ['nameOfIssuer', 'titleOfClass', 'cusip', 'value', 'sshPrnamt',
'sshPrnamtType', 'putCall', 'investmentDiscretion',
'otherManager', 'Sole', 'Shared', 'None']
res_df = pd.DataFrame(columns=cols)
# Iterate over URLs
for form_url in form_urls:
data = []
soup = BeautifulSoup(session.get(form_url).content, 'lxml')
print(soup)
for info_table in soup.find_all(['ns1:infotable', 'infotable']):
row = []
for col in cols:
d = info_table.find([col.lower(), 'ns1:' + col.lower()])
row.append(d.text.strip() if d else 'NaN')
data.append(row)
url_df = pd.DataFrame(data, columns=cols)
res_df = res_df.append(url_df, ignore_index=True)
print(res_df)
因此,如果前缀采用非预期格式(例如,它可能是空字符串或其他大小写字母和数字的组合),如何使刮板更加灵活?
您提供的第一个链接的第二行为n1:infoTable,而不是ns1:infoTable,因此,为了使代码正常工作,您需要考虑到这一点。
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
session = requests.Session()
# urls to loop through
form_urls = ['https://www.sec.gov/Archives/edgar/data/1418814/000141881220000017/vac13f021420.xml',
'https://www.sec.gov/Archives/edgar/data/820124/000095012320003895/408.xml',
'https://www.sec.gov/Archives/edgar/data/1067983/000095012320002466/form13fInfoTable.xml']
# Create dataframe and set columns to match XML doc
cols = ['nameOfIssuer', 'titleOfClass', 'cusip', 'value', 'sshPrnamt',
'sshPrnamtType', 'putCall', 'investmentDiscretion',
'otherManager', 'Sole', 'Shared', 'None']
res_df = pd.DataFrame(columns=cols)
# Iterate over URLs
for form_url in form_urls:
data = []
soup = BeautifulSoup(session.get(form_url).content, 'lxml')
for info_table in soup.find_all(re.compile("([A-Za-z0-9]+:|)infotable")):
row = []
for col in cols:
pattern = re.compile("([A-Za-z0-9]+:|)" + col.lower())
d = info_table.find(pattern)
row.append(d.text.strip() if d else 'NaN')
data.append(row)
url_df = pd.DataFrame(data, columns=cols)
res_df = res_df.append(url_df, ignore_index=True)
编辑:现在前缀可以不存在(空字符串“”),也可以是小写,大写字母和数字的组合
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句