下面我有从 craigslist 中提取记录的代码。一切都很好,但我需要能够进入下一组记录并重复相同的过程,但我是编程新手,我被卡住了。从查看页面代码看来,我应该单击此处跨度中包含的箭头按钮,直到它不包含 href 为止:
<a href="/search/syp?s=120" class="button next" title="next page">next > </a>
我在想,也许这是一个循环中的循环,但我想这也可能是一种尝试/除外情况。听起来对吗?你将如何实施?
import requests
from urllib.request import urlopen
import pandas as pd
response = requests.get("https://nh.craigslist.org/d/computer-parts/search/syp")
soup = BeautifulSoup(response.text,"lxml")
listings = soup.find_all('li', class_= "result-row")
base_url = 'https://nh.craigslist.org/d/computer-parts/search/'
next_url = soup.find_all('a', class_= "button next")
dates = []
titles = []
prices = []
hoods = []
while base_url !=
for listing in listings:
datar = listing.find('time', {'class': ["result-date"]}).text
dates.append(datar)
title = listing.find('a', {'class': ["result-title"]}).text
titles.append(title)
try:
price = listing.find('span', {'class': "result-price"}).text
prices.append(price)
except:
prices.append('missing')
try:
hood = listing.find('span', {'class': "result-hood"}).text
hoods.append(hood)
except:
hoods.append('missing')
#write the lists to a dataframe
listings_df = pd.DataFrame({'Date': dates, 'Titles' : titles, 'Price' : prices, 'Location' : hoods})
#write to a file
listings_df.to_csv("craigslist_listings.csv")
对于您抓取的每个页面,您都可以找到下一个要抓取的网址并将其添加到列表中。
这就是我将如何做到这一点,而不会过多地更改您的代码。我添加了一些评论,以便您了解发生了什么,但如果您需要任何额外的解释,请给我留言:
import requests
from urllib.request import urlopen
import pandas as pd
from bs4 import BeautifulSoup
base_url = 'https://nh.craigslist.org/d/computer-parts/search/syp'
base_search_url = 'https://nh.craigslist.org'
urls = []
urls.append(base_url)
dates = []
titles = []
prices = []
hoods = []
while len(urls) > 0: # while we have urls to crawl
print(urls)
url = urls.pop(0) # removes the first element from the list of urls
response = requests.get(url)
soup = BeautifulSoup(response.text,"lxml")
next_url = soup.find('a', class_= "button next") # finds the next urls to crawl
if next_url: # if it's not an empty string
urls.append(base_search_url + next_url['href']) # adds next url to crawl to the list of urls to crawl
listings = soup.find_all('li', class_= "result-row") # get all current url listings
# this is your code unchanged
for listing in listings:
datar = listing.find('time', {'class': ["result-date"]}).text
dates.append(datar)
title = listing.find('a', {'class': ["result-title"]}).text
titles.append(title)
try:
price = listing.find('span', {'class': "result-price"}).text
prices.append(price)
except:
prices.append('missing')
try:
hood = listing.find('span', {'class': "result-hood"}).text
hoods.append(hood)
except:
hoods.append('missing')
#write the lists to a dataframe
listings_df = pd.DataFrame({'Date': dates, 'Titles' : titles, 'Price' : prices, 'Location' : hoods})
#write to a file
listings_df.to_csv("craigslist_listings.csv")
编辑:您还忘记导入BeautifulSoup
您的代码,我在我的回复中添加了该代码Edit2:您只需要找到下一个按钮的第一个实例,因为页面可以(在这种情况下它确实)有更多的下一个按钮。
Edit3:为了抓取计算机部件,base_url
应更改为此代码中存在的那个
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句