我试图弄清楚如何从https://www.fda.gov/Safety/Recalls/网站提取我想要的多个信息
import requests
from bs4 import BeautifulSoup
res = requests.get("https://www.fda.gov/Safety/Recalls/")
soup = BeautifulSoup(res.text, "lxml")
for item in soup.select("table td"):
if "Undeclared" in item.text:
brand = item.find_parents()[0].select("td")[1].text
reason = item.text
print(brand,reason)
如何从html获取brand_link?
我想这就是您的预期输出:
import requests
from bs4 import BeautifulSoup
res = requests.get("https://www.fda.gov/Safety/Recalls/")
soup = BeautifulSoup(res.text, "lxml")
for item in soup.select("table td"):
if "Undeclared" in item.text:
brand = item.find_parents()[0].select("td")[1].text
reason = item.text
print(brand,reason)
部分输出:
N/A Undeclared Milk
Colorado Nut Company and various other private labels Undeclared milk
All Natural, Weis, generic Undeclared milk
Dilettante Chocolates Undeclared almonds
Hot Pockets Undeclared egg, milk, soy, and wheat
Figiâs Undeclared Milk
Germack Undeclared Milk
当您还想获得到品牌名称的链接时,可以执行以下操作:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = "https://www.fda.gov/Safety/Recalls/"
res = requests.get(url)
soup = BeautifulSoup(res.text, "lxml")
for item in soup.select("table td"):
if "Undeclared" in item.text:
brand = item.find_parents()[0].select("td")[1].text
brand_link = urljoin(url,item.find_parents()[0].select("td")[1].select("a")[0]['href'])
reason = item.text
print("Brand: {}\nBrand_link: {}\nReason: {}\n".format(brand,brand_link,reason))
输出:
Brand: N/A
Brand_link: https://www.fda.gov/Safety/Recalls/ucm587012.htm
Reason: Undeclared Milk
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句