我试图通过在线尝试一些简单的教程来抓取日本网站,但我无法从该网站获取信息。下面是我的代码:
import requests
wiki = "https://www.athome.co.jp/chintai/1001303243/?DOWN=2&BKLISTID=002LPC&sref=list_simple&bi=tatemono"
page = requests.get(wiki)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.text, 'lxml')
for i in soup.findAll('data payments'):
print(i.text)
我想得到的是以下部分:
<dl class="data payments">
<dt>賃料:</dt>
<dd><span class="num">7.3万円</span></dd>
</dl>
我想打印数据付款是“我们的租料”,价格是“7.3万円”。
预期(字符串):
“付款:租料7.3万円”
编辑:
import requests
wiki = "https://www.athome.co.jp/"
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})
page = requests.get(wiki,headers=headers)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'lxml')
print(soup.decode('utf-8', 'replace'))
在您最新版本的代码中,您对汤进行解码,您将无法使用BeautifulSoup 中的find
和 等功能find_all
。但我们稍后再谈。
得到汤后,可以打印汤,你会看到:(只显示关键部分)
<meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/>
<meta content="0" http-equiv="expires"/>
<meta content="Tue, 01 Jan 1980 1:00:00 GMT" http-equiv="expires"/>
<meta content="10; url=/distil_r_captcha.html?requestId=2ac19293-8282-4602-8bf5-126d194a4827&httpReferrer=%2Fchintai%2F1001303243%2F%3FDOWN%3D2%26BKLISTID%3D002LPC%26sref%3Dlist_simple%26bi%3Dtatemono" http-equiv="refresh"/>
这意味着您没有获得足够的元素并且您被检测为爬虫。
因此,@KunduK 的答案中缺少一些内容,与该find
功能无关。
首先,你需要让你的python脚本不像爬虫那样。
标头最常用于检测爬虫者。在原始请求中,当您从请求中获取会话时,您可以使用以下命令检查标头:
>>> s = requests.session()
>>> print(s.headers)
{'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
可以看到这里的headers会告诉服务器你是一个爬虫程序,也就是python-requests/2.22.0
.
因此,您需要User-Agent
使用更新标题来修改。
s = requests.session()
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
}
s.headers.update(headers)
但是,在测试爬虫时,它仍然被检测为爬虫。因此,我们需要进一步挖掘标题部分。(但也可能是其他原因,如 IP 拦截器或 Cookie 原因。我稍后会提到它们。)
在 Chrome 中,我们打开开发人员工具,然后打开网站。(假装是第一次访问网站,最好先清除cookies。)清除cookies后,刷新页面。我们可以在开发者工具的网卡中看到,它显示了很多来自 Chrome 的请求。
通过输入第一个属性,即https://www.athome.co.jp/
,我们可以在右侧看到一个详细的表格,其中Request Headers是Chrome生成的用于请求目标站点服务器的标头。
To make sure everthing works fine, you could just add everthing in this Chrome headers to your crawler, and it cannot find out you are the real Chrome or crawler anymore. (For most of sites, but I have also find some sites use starnge setting requiring a special header in every requests.)
I have already digged out that after adding accept-language
, the website's anti-cralwer function will let you pass.
Therefore, all together, you need to update your headers like this.
headers = {
'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8,zh-TW;q=0.7',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
}
s.headers.update(headers)
For the explaination of cookie, you can refer to the wiki. To obtain the cookie, there is a easy way. First, initial a session and update the header, like I mentioned above. Second, request to get the page https://www.athome.co.jp, once you get the page, you will obtain a cookie issued by the server.
s.get(url='https://www.athome.co.jp')
The advantage of requests.session is the session will help you to maintain the cookies, so your next request will use this cookie automatically.
You can just check the cookie you obtained by using this:
print(s.cookies)
And my result is:
<RequestsCookieJar[Cookie(version=0, name='athome_lab', value='ffba98ff.592d4d027d28b', port=None, port_specified=False, domain='www.athome.co.jp', domain_specified=False, domain_initial_dot=False, path='/', path_specified=True, secure=False, expires=1884177606, discard=False, comment=None, comment_url=None, rest={}, rfc2109=False)]>
You do not need to parse this page, because you just want the cookie rather than the content.
You can just use the session you obtained to request the wiki page you mentioned.
wiki = "https://www.athome.co.jp/chintai/1001303243/?DOWN=2&BKLISTID=002LPC&sref=list_simple&bi=tatemono"
page = s.get(wiki)
And now, everything you want will be posted to you by the server, you can just parse them with BeautifulSoup.
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
After getting the content you want, you can use BeautifulSoup to get the target element.
soup.find('dl', attrs={'class': 'data payments'})
And what you will get is:
<dl class="data payments">
<dt>賃料:</dt>
<dd><span class="num">7.3万円</span></dd>
</dl>
And you can just extract the infomation you want from it.
target_content = soup.find('dl', attrs={'class': 'data payments'})
dt = target_content.find('dt').get_text()
dd = target_content.find('dd').get_text()
To format it as a line.
print('payment: {dt} is {dd}'.format(dt=dt[:-1], dd=dd))
Everything has been done.
I will paste the code below.
# Import packages you want.
import requests
from bs4 import BeautifulSoup
# Initiate a session and update the headers.
s = requests.session()
headers = {
'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8,zh-TW;q=0.7',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
}
s.headers.update(headers)
# Get the homepage of the website and get cookies.
s.get(url='https://www.athome.co.jp')
"""
# You might need to use the following part to check if you have successfully obtained the cookies.
# If not, you might be blocked by the anti-cralwer.
print(s.cookies)
"""
# Get the content from the page.
wiki = "https://www.athome.co.jp/chintai/1001303243/?DOWN=2&BKLISTID=002LPC&sref=list_simple&bi=tatemono"
page = s.get(wiki)
# Parse the webpage for getting the elements.
soup = BeautifulSoup(page.content, 'html.parser')
target_content = soup.find('dl', attrs={'class': 'data payments'})
dt = target_content.find('dt').get_text()
dd = target_content.find('dd').get_text()
# Print the result.
print('payment: {dt} is {dd}'.format(dt=dt[:-1], dd=dd))
In crawler field, there is a long way to go.
最好在线获取它,并充分利用浏览器中的开发者工具。
您可能需要确定内容是否由 JavaScript 加载,或者内容是否在 iframe 中。
更重要的是,您可能会被检测为爬虫并被服务器阻止。反反爬虫技术只能通过更频繁的编码来获得。
我建议你从一个没有反爬虫功能的更简单的网站开始。
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句