我已经在python中编写了一个脚本,Thread
用于同时处理多个请求并更快地执行抓取过程。该脚本正在相应地完成它的工作。
简而言之,抓取工具的作用:它分析从着陆页到其主页(存储信息)的所有链接,
happy hours
然后featured special
从那里进行抓取。刮板将继续进行直到所有29页都被爬网为止。
由于可能有很多链接可以使用,因此我想限制请求的数量。但是,由于对此我没有太多的想法,因此无法找到任何理想的方式来修改现有脚本来达到目的。
任何帮助将不胜感激。
到目前为止,这是我的尝试:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import threading
url = "https://www.totalhappyhour.com/washington-dc-happy-hour/?page={}"
def get_info(link):
for mlink in [link.format(page) for page in range(1,30)]:
response = requests.get(mlink)
soup = BeautifulSoup(response.text,"lxml")
itemlinks = [urljoin(link,container.select_one("h2.name a").get("href")) for container in soup.select(".profile")]
threads = []
for ilink in itemlinks:
thread = threading.Thread(target=fetch_info,args=(ilink,))
thread.start()
threads+=[thread]
for thread in threads:
thread.join()
def fetch_info(nlink):
response = requests.get(nlink)
soup = BeautifulSoup(response.text,"lxml")
for container in soup.select(".specials"):
try:
hours = container.select_one("h3").text
except Exception: hours = ""
try:
fspecial = ' '.join([item.text for item in container.select(".special")])
except Exception: fspecial = ""
print(f'{hours}---{fspecial}')
if __name__ == '__main__':
get_info(url)
由于我刚开始使用多处理程序创建任何刮板,因此我希望拥有任何实际脚本,以便非常清楚地理解逻辑。脚本中使用的站点具有某种机器人保护机制。但是,我发现了一个非常相似的网页,可以在其中进行多重处理。
import requests
from multiprocessing import Pool
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "http://srar.com/roster/index.php?agent_search={}"
def get_links(link):
completelinks = []
for ilink in [chr(i) for i in range(ord('a'),ord('d')+1)]:
res = requests.get(link.format(ilink))
soup = BeautifulSoup(res.text,'lxml')
for items in soup.select("table.border tr"):
if not items.select("td a[href^='index.php?agent']"):continue
data = [urljoin(link,item.get("href")) for item in items.select("td a[href^='index.php?agent']")]
completelinks.extend(data)
return completelinks
def get_info(nlink):
req = requests.get(nlink)
sauce = BeautifulSoup(req.text,"lxml")
for tr in sauce.select("table[style$='1px;'] tr"):
table = [td.get_text(strip=True) for td in tr.select("td")]
print(table)
if __name__ == '__main__':
allurls = get_links(url)
with Pool(10) as p: ##this is the number responsible for limiting the number of requests
p.map(get_info,allurls)
p.join()
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句