当我运行这段代码时,我可以看到标题列表中填充了我想要的结果,但是它们被一些我不想保留的HTML包围着。
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
# barchart.com uses javascript, so for now I need selenium to get full html
url = 'https://www.barchart.com/stocks/quotes/qqq/constituents'
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
browser = webdriver.Chrome(options=chrome_options)
browser.get(url)
page = browser.page_source
# BeautifulSoup find table
soup = BeautifulSoup(page, 'lxml')
table = soup.find("table")
browser.quit()
# create list headers, then populate with th tagged cells
headers = []
for i in table.find_all('th'):
title = i()
headers.append(title)
所以我尝试了:
for i in table.find_all('th'):
title = i.text()
headers.append(title)
哪个回来了 "TypeError: 'str' object is not callable"
在某些示例文档中,这似乎可行,但是那里使用的Wikipedia表似乎比Barchart上的简单。有任何想法吗?
正如@MendelG指出的那样,错误在于i.text()
因为text
是属性而不是函数。
另外,您也可以使用get_text()
函数。
我还建议添加一个,strip()
以消除文本周围多余的空格。或者,如果您要使用get_text()
它,则内置此功能:
title = i.get_text(strip=True)
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
# barchart.com uses javascript, so for now I need selenium to get full html
url = 'https://www.barchart.com/stocks/quotes/qqq/constituents'
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
browser = webdriver.Chrome(options=chrome_options)
browser.get(url)
page = browser.page_source
# BeautifulSoup find table
soup = BeautifulSoup(page, 'lxml')
table = soup.find("table")
browser.quit()
# create list headers, then populate with th tagged cells
headers = []
for i in table.find_all('th'):
title = i.text.strip()
# Or alternatively:
#title = i.get_text(strip=True)
headers.append(title)
print(headers)
打印:
['Symbol', 'Name', '% Holding', 'Shares', 'Links']
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句