from bs4 import BeautifulSoup
import requests
source = requests.get('http://www.mocky.io/v2/5e34780e3000008c00d964dd').text
soup = BeautifulSoup(source)
print(soup)
输出:
<textarea cols="100" name="olidata" readonly="" rows="40"><?xml version="1.0" encoding="UTF-8"?>
<EVENT spec="IDL:o2bcs/automator/common/tasklistEvents:1.0#tasklistupdateevent">
<?xml version="1.0" encoding="UTF-8"?>
<event spec="IDL:o2bcs/automator/common/tasklistEvents:1.0#tasklistupdateevent">
<tasklistoli>
<bpid>
<oid>B32028040:M11</oid>
<type>MIGOPT1</type>
</bpid>
<oli>
<tolicontrol>
<oliid>1</oliid>
<externalid1></externalid1>
<externalid2></externalid2>
<highlevelstatus>1</highlevelstatus>
<status>550</status>
<catalogue>14</catalogue>
<errorcode>500220</errorcode>
<errorstring>Unable to select the given SI COMP for deletion.</errorstring>
<subscriptionid></subscriptionid>
<activityid></activityid>
<activityaccesscode></activityaccesscode>
<dateofnetworkexecution></dateofnetworkexecution>
</tolicontrol>
<toli_1>
<discriminator>29</discriminator>
<tmigopt>
这为我提供了间距良好的结构化XML(类型= bs4.BeautifulSoup)
现在,如果我用
print(soup.text)
输出:
<?xml version="1.0" encoding="UTF-8"?>\r\n<EVENT spec="IDL:o2bcs/automator/common/tasklistEvents:1.0#tasklistupdateevent">\r\n \r\n\r\n \r\n \r\n B32028040:M11\r\n MIGOPT1\r\n \r\n \r\n \r\n 1\r\n \r\n \r\n 1\r\n 550\r\n 14\r\n 500220\r\n Unable to select the given SI COMP for deletion.\r\n \r\n \r\n \r\n \r\n \r\n \r\n 29\r\n \r\n \r\n \r\n 524742\r\n 40193375\r\n \r\n \r\n 40003859\r\n MOB\r\n o2UniteBasicService\r\n O2P0058\r\n 2018-05-08 00:00:00\r\n \r\n \r\n \r\n N\r\n \r\n \r\n \r\n 2014-07-09 00:00:00\r\n 0\r\n \r\n \r\n O2O0014\r\n \r\n 524742\r\n SIM
这给了我非常糟糕的非结构化数据,(类型= str)
我打算在文本上使用正则表达式,但需要适当的数据,请帮忙
我仍然不明白您的确切需求。让我举一个例子,看看它是否对您有帮助
from simplified_scrapy.request import req
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = req.get('http://www.mocky.io/v2/5e34780e3000008c00d964dd')
doc = SimplifiedDoc(html)
print (doc.event.text) # Output the text in the event tag
print ('-'*50)
# Traverse all nodes
def test(ele):
if isinstance(ele,list):
for e in ele:
test(e)
return
children = ele.children
if children:
for e in children:
test(e)
else:
print (ele.tag,ele.text)
test(doc.event.children)
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句