当转换为带有.text的文本时，漂亮的汤XML标记（<>）成为非常非结构化的文本

omkar patil

from bs4 import BeautifulSoup
import requests

source = requests.get('http://www.mocky.io/v2/5e34780e3000008c00d964dd').text  
soup = BeautifulSoup(source)
print(soup)

输出：

 <textarea cols="100" name="olidata" readonly="" rows="40">&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;EVENT spec="IDL:o2bcs/automator/common/tasklistEvents:1.0#tasklistupdateevent"&gt;
  <?xml version="1.0" encoding="UTF-8"?>
<event spec="IDL:o2bcs/automator/common/tasklistEvents:1.0#tasklistupdateevent">
  <tasklistoli>
    <bpid>
      <oid>B32028040:M11</oid>
      <type>MIGOPT1</type>
    </bpid>
    <oli>
      <tolicontrol>
        <oliid>1</oliid>
        <externalid1></externalid1>
        <externalid2></externalid2>
        <highlevelstatus>1</highlevelstatus>
        <status>550</status>
        <catalogue>14</catalogue>
        <errorcode>500220</errorcode>
        <errorstring>Unable to select the given SI COMP for deletion.</errorstring>
        <subscriptionid></subscriptionid>
        <activityid></activityid>
        <activityaccesscode></activityaccesscode>
        <dateofnetworkexecution></dateofnetworkexecution>
      </tolicontrol>
      <toli_1>
        <discriminator>29</discriminator>
        <tmigopt>

这为我提供了间距良好的结构化XML（类型= bs4.BeautifulSoup）

现在，如果我用

    print(soup.text)

输出：

<?xml version="1.0" encoding="UTF-8"?>\r\n<EVENT spec="IDL:o2bcs/automator/common/tasklistEvents:1.0#tasklistupdateevent">\r\n  \r\n\r\n  \r\n    \r\n      B32028040:M11\r\n      MIGOPT1\r\n    \r\n    \r\n      \r\n        1\r\n        \r\n        \r\n        1\r\n        550\r\n        14\r\n        500220\r\n        Unable to select the given SI COMP for deletion.\r\n        \r\n        \r\n        \r\n        \r\n      \r\n      \r\n        29\r\n        \r\n          \r\n            \r\n              524742\r\n              40193375\r\n              \r\n              \r\n              40003859\r\n              MOB\r\n              o2UniteBasicService\r\n              O2P0058\r\n              2018-05-08 00:00:00\r\n              \r\n              \r\n              \r\n              N\r\n              \r\n              \r\n              \r\n              2014-07-09 00:00:00\r\n              0\r\n              \r\n              \r\n                O2O0014\r\n                \r\n                524742\r\n                SIM

这给了我非常糟糕的非结构化数据，（类型= str）

我打算在文本上使用正则表达式，但需要适当的数据，请帮忙

大兵搜

我仍然不明白您的确切需求。让我举一个例子，看看它是否对您有帮助

from simplified_scrapy.request import req
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = req.get('http://www.mocky.io/v2/5e34780e3000008c00d964dd') 
doc = SimplifiedDoc(html)
print (doc.event.text) # Output the text in the event tag
print ('-'*50)
# Traverse all nodes
def test(ele):
    if isinstance(ele,list):
        for e in ele:
            test(e)
        return
    children = ele.children
    if children:
        for e in children:
            test(e)
    else:
        print (ele.tag,ele.text)

test(doc.event.children)

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-01-22

我来说两句

0 条评论

登录后参与评论

上一篇：如何在Python中仅打印JSON之类的字符串

当转换为带有.text的文本时，漂亮的汤XML标记（<>）成为非常非结构化的文本

当转换为带有.text的文本时，漂亮的汤XML标记（<>）成为非常非结构化的文本

隐藏发件人没有短信PHP

Hashchange事件侦听器在将事件处理程序附加到事件之前进行侦听

用日期数据透视表和日期顺序查询

flask-admin 如何自定义删除按钮

在浏览器中请求URL时会发生什么？

材质UI垂直滑块。如何改变在垂直材料UI滑块导轨的厚度（反应）

为什么PlusShare.Builder setRecipients方法不起作用？

OS X-为什么我需要打开WiFi才能确定最近的位置

在Windows 7中无法删除文件（2）

android 背部按下

Swift如何使用Base64Url编码JWT标头和有效负载之类的json对象

PyQt4.QtCore模块无法向sip模块注册

用白色图像隐藏Android Studio中的所有textView

为什么随机森林中的平均降低基尼系数取决于人口规模？

应用发明者仅从列表中选择一个随机项一次

正则表达式，用于查找所有以任何字母开头和数字开头的文件

ArgumentError：错误＃2109：在场景默认设置中未找到默认的帧标签

sshd AllowGroups组未授予访问权限

jQuery无限滚动固定div中的滚动

无法加载文件或程序集System.Runtime.CompilerServices.Unsafe

Jqgrid：多级别组摘要