抓取时无法修改我的脚本来分隔请求数

Sim 发表于 Dev

SIM卡

我已经在python中编写了一个脚本，Thread用于同时处理多个请求并更快地执行抓取过程。该脚本正在相应地完成它的工作。

简而言之，抓取工具的作用：它分析从着陆页到其主页（存储信息）的所有链接，happy hours然后featured special从那里进行抓取。刮板将继续进行直到所有29页都被爬网为止。

由于可能有很多链接可以使用，因此我想限制请求的数量。但是，由于对此我没有太多的想法，因此无法找到任何理想的方式来修改现有脚本来达到目的。

任何帮助将不胜感激。

到目前为止，这是我的尝试：

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import threading

url = "https://www.totalhappyhour.com/washington-dc-happy-hour/?page={}"

def get_info(link):
    for mlink in [link.format(page) for page in range(1,30)]:
        response = requests.get(mlink)
        soup = BeautifulSoup(response.text,"lxml")
        itemlinks = [urljoin(link,container.select_one("h2.name a").get("href")) for container in soup.select(".profile")]
        threads = []
        for ilink in itemlinks:
            thread = threading.Thread(target=fetch_info,args=(ilink,))
            thread.start()
            threads+=[thread]

        for thread in threads:
            thread.join()

def fetch_info(nlink):
    response = requests.get(nlink)
    soup = BeautifulSoup(response.text,"lxml")
    for container in soup.select(".specials"):
        try:
            hours = container.select_one("h3").text
        except Exception: hours = ""
        try:
            fspecial = ' '.join([item.text for item in container.select(".special")])
        except Exception: fspecial = ""
        print(f'{hours}---{fspecial}')

if __name__ == '__main__':
    get_info(url)

SIM卡

由于我刚开始使用多处理程序创建任何刮板，因此我希望拥有任何实际脚本，以便非常清楚地理解逻辑。脚本中使用的站点具有某种机器人保护机制。但是，我发现了一个非常相似的网页，可以在其中进行多重处理。

import requests
from multiprocessing import Pool
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "http://srar.com/roster/index.php?agent_search={}"

def get_links(link):
    completelinks = []
    for ilink in [chr(i) for i in range(ord('a'),ord('d')+1)]:
        res = requests.get(link.format(ilink))  
        soup = BeautifulSoup(res.text,'lxml')
        for items in soup.select("table.border tr"):
            if not items.select("td a[href^='index.php?agent']"):continue
            data = [urljoin(link,item.get("href")) for item in items.select("td a[href^='index.php?agent']")]
            completelinks.extend(data)
    return completelinks

def get_info(nlink):
    req = requests.get(nlink)
    sauce = BeautifulSoup(req.text,"lxml")
    for tr in sauce.select("table[style$='1px;'] tr"):
        table = [td.get_text(strip=True) for td in tr.select("td")]
        print(table)

if __name__ == '__main__':
    allurls = get_links(url)
    with Pool(10) as p:  ##this is the number responsible for limiting the number of requests
        p.map(get_info,allurls)
        p.join()

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。