Python,Pandas,XML-根据长度拆分XML元素

大众汽车

我正在解析XML文件,子元素之一偶尔包含4000多个字符。这样做时,我想创建第二个元素来存储溢出字符,然后再将其保存到pandas数据框。构建数据框后,将其导出到Excel(我知道该怎么做)。

或者在解析并且它具有4000多个字符时,动态创建一个新的dataframe列来存储数据(我认为这是更好的解决方案,因为数据已导出到Excel以进行报告)

import pandas as pd 
import xml.etree.cElementTree as et 
from bs4 import BeautifulSoup

def getvalueofnode(node):
    if node is None:        
        return None     
    else:       
        soup = BeautifulSoup(node.text) # clean js keywords         
        text = soup.get_text()
        text = text.replace("\n", " ") # remove newline         
        text = text.replace("\r", " ") # remove newline         
        text = text.replace(' +', ' ') # remove duplicate spaces        
        return text

parsedXML = et.parse(filename) 
dfcols = ['datarec','casekey','description','narative'] 
df_xml = pd.DataFrame(columns=dfcols)

for node in parsedXML.getroot():        
    datarec = node.find('DATA_RECORD')
    casekey = node.find('CASE_KEY')         
    description = node.find('DESCRIPTION')      
    narative = node.find('CASE_NARRATIVE')      

    df_xml = df_xml.append(pd.Series([datarec, getvalueofnode(casekey), getvalueofnode(description), getvalueofnode(narative)], index=dfcols), ignore_index=True)
  1. 标题并不是那么重要,因此我认为我不需要定义df列名称。因此,如果计数超过4000,我将动态创建一个新列(在8000、12000处会发生什么?)
  2. 我在想办法是在构建数据框之前先修复XML,如果这样做,如何将其拆分为4000个字符并创建一个新元素?2.1如果确实创建一个新元素,我不确定我的getvalueofnode函数是否将返回所有行?

我应该走哪条路?

编辑------复制XML

<?xml version="1.0" ?>
<!DOCTYPE main [
  <!ELEMENT main (DATA_RECORD*)>
  <!ELEMENT DATA_RECORD (CASE_KEY,DESCRIPTION?,CASE_NARRATIVE?)+>
  <!ELEMENT CASE_KEY (#PCDATA)>
  <!ELEMENT DESCRIPTION (#PCDATA)>
  <!ELEMENT CASE_NARRATIVE (#PCDATA)>
]>
<main>
  <DATA_RECORD>
    <CASE_KEY>6479351</CASE_KEY>
    <DESCRIPTION>Four bill payments</DESCRIPTION>
    <CASE_NARRATIVE>
        Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque accumsan congue risus, tristique imperdiet sapien consectetur nec. Donec ut urna lectus. Duis eget magna et quam aliquet porta non vitae enim. Proin diam ex, ullamcorper in lectus ac, cursus sollicitudin ipsum. Sed lorem urna, congue et condimentum in, rhoncus id nunc. Duis vel mauris pharetra, accumsan neque non, pellentesque leo. Nullam vel nibh vulputate, eleifend turpis condimentum, faucibus mi. Sed mattis dolor non libero scelerisque, in congue ligula ullamcorper. In finibus laoreet erat et venenatis. Aenean tincidunt magna a nisl euismod posuere tristique eget orci. Vestibulum ac turpis vel justo laoreet fermentum rutrum eget est. In hac habitasse platea dictumst. Aenean blandit at leo vel pharetra. Duis vel commodo orci. 

        Praesent tincidunt mattis suscipit. Nam aliquet purus eu nibh ultrices, ac tristique risus euismod. Sed bibendum tincidunt elit, a finibus arcu bibendum at. Praesent turpis neque, auctor at dui ut, cursus rhoncus tortor. Cras rutrum, lacus et molestie posuere, odio purus porta nisi, vel egestas nulla nibh accumsan erat. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Integer imperdiet, ligula ac iaculis iaculis, augue massa dapibus neque, sit amet iaculis orci nibh quis libero. Phasellus tortor ligula, luctus non mi quis, consequat dapibus risus. Vestibulum nec finibus ex. Duis ipsum nisl, tincidunt in erat rhoncus, pulvinar consequat tortor. Curabitur faucibus interdum metus. Morbi egestas ipsum ac rutrum faucibus. Maecenas non leo sem. 

        In ultrices, libero ut sagittis blandit, ex dolor pretium nibh, ac bibendum ligula nunc sed quam. In ultricies, arcu aliquam porta pharetra, orci mauris imperdiet lectus, a facilisis purus purus at sem. Nullam ac feugiat nulla. Duis congue lorem sit amet tellus varius ultrices. Curabitur risus mauris, rutrum ut sodales tempor, varius eget lectus. In eget hendrerit ligula, ac mollis mi. Nulla volutpat felis ornare elit facilisis dapibus. Fusce facilisis nisi est, eget gravida lorem aliquam nec. Ut sed purus sit amet mi sodales vestibulum id sit amet purus. Ut in vestibulum purus. Donec eget enim ipsum. Mauris eget neque neque. Pellentesque feugiat faucibus felis, quis tincidunt nisl. 

        In viverra posuere nulla sed cursus. Praesent nec rutrum enim, et gravida lorem. Fusce gravida lorem quam. Interdum et malesuada fames ac ante ipsum primis in faucibus. Morbi at aliquam lacus. Nulla suscipit nibh eu congue finibus. Phasellus et sem non dolor tempus aliquam. Ut tincidunt elit erat, varius molestie lacus mattis feugiat. Ut lectus ex, suscipit non condimentum sit amet, condimentum vitae sem. Donec et scelerisque leo. 

        Suspendisse velit nisl, suscipit quis metus ac, suscipit sollicitudin libero. Nulla euismod lectus sit amet congue efficitur. Fusce a sagittis magna, ut fringilla mi. Ut suscipit lectus quis luctus euismod. Sed at dui fermentum, tincidunt risus sit amet, pretium diam. Etiam eleifend varius urna nec volutpat. Nam efficitur tellus non volutpat consequat. Mauris ut elit enim. Pellentesque sit amet tincidunt metus. Nam ornare massa quis libero fermentum sagittis. Sed facilisis turpis dolor, eget mattis lectus laoreet eu. 

        Aliquam egestas leo mauris, non placerat dolor euismod eu. Proin eget convallis augue. Suspendisse elit ante, ornare at augue sit amet, molestie elementum leo. Duis id leo in odio consequat auctor. Duis commodo elementum velit, porttitor blandit libero luctus commodo. Nulla in libero vel libero varius faucibus a non tellus. Pellentesque dapibus eget lectus id fringilla. Sed vitae nisi nisi. Sed ultricies orci vitae sapien ultrices, nec ornare tortor placerat. Vestibulum et ligula tristique, rhoncus dolor in, semper lorem. Integer non urna nec risus convallis pharetra. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam vitae ullamcorper leo. Suspendisse potenti. 

        Sed congue, mi rutrum placerat bibendum, erat tortor finibus lorem, eget varius velit lacus ut mauris. Nullam congue placerat mollis. Duis et fringilla nunc, id dictum enim. Morbi non gravida nisi. In nec nunc ante. In vitae odio accumsan, imperdiet lectus a, egestas sapien. In sit amet elit pharetra, scelerisque turpis a, tincidunt nisl. Curabitur tempus eu risus et vulputate. Fusce iaculis diam quis nibh viverra, pulvinar fringilla massa fermentum. Proin elementum in felis sed rutrum. Etiam eget elit vitae turpis ultrices auctor lobortis a erat. Duis fermentum tristique consectetur. Fusce quis est tincidunt, ultricies erat a, pharetra est. 

        Nullam ac velit et ipsum cursus sodales. Pellentesque consequat quis dui ac aliquam. Suspendisse libero turpis, porttitor quis malesuada ut, interdum ac dui. Phasellus varius suscipit tristique. Praesent vel ante vel augue pellentesque tempus. Pellentesque volutpat finibus lorem, non malesuada nisi imperdiet eget. Proin dignissim mi non lorem imperdiet, sit amet mattis neque sodales. Aliquam erat volutpat. Phasellus non nisl metus. 
    </CASE_NARRATIVE>
  </DATA_RECORD>
  <DATA_RECORD>
    <CASE_KEY>6479356</CASE_KEY>
    <DESCRIPTION>Financial Crime Concern</CASE_NARRATIVE>
  </DATA_RECORD>
  <DATA_RECORD>
    <CASE_KEY>6480409</CASE_KEY>
    <DESCRIPTION>Financial Crime Concern :M&#38;S customer was cold called by someone about an investment opportunity, the caller gave customer different options and she chose 3 to invest in. She was unaware of the scam until she was contacted by the police. There is a seperate scion case re the police notification</DESCRIPTION>
    <CASE_NARRATIVE>&#60;p&# Lorum Ipsum</CASE_NARRATIVE>
  </DATA_RECORD>
  <DATA_RECORD>
    <CASE_KEY>6480519</CASE_KEY>
    <DESCRIPTION>Financial Crime </DESCRIPTION>
    <CASE_NARRATIVE>fraudster had set up two new payments and created </CASE_NARRATIVE>
  </DATA_RECORD>
  <DATA_RECORD>
    <CASE_KEY>6480521</CASE_KEY>
    <DESCRIPTION>Triage Europe</DESCRIPTION>
    <CASE_NARRATIVE>Mr. Ockwell is a HB</CASE_NARRATIVE>
  </DATA_RECORD>
</main>
完善

而不是BeautifulSoup考虑使用lxmlXSLT和XPath来运行兄弟姐妹:

  • XSLT可以使用函数为添加的OVERFLOW元素转换原始XML substring()string-length()

  • XPath可以解析新的,经过转换的树,以便通过循环或列表/字典理解将值映射到熊猫数据帧。

XSLT (另存为.xslt文件,一个特殊的.xml文件)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes" method="xml"/>
  <xsl:strip-space elements="*"/>

  <!-- IDENTITY TRANSFORM -->
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="DATA_RECORD">
    <xsl:copy>
        <xsl:apply-templates select="CASE_KEY|DESCRIPTION"/>
        <CASE_NARRATIVE>
            <xsl:value-of select="substring(normalize-space(CASE_NARRATIVE), 1, 4000)"/>
        </CASE_NARRATIVE>
        <OVERFLOW>
            <xsl:value-of select="substring(normalize-space(CASE_NARRATIVE), 4001, 
                                            string-length(normalize-space(CASE_NARRATIVE)))"/>
        </OVERFLOW>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

Python (包括简短列表理解版本和长循环版本)

import lxml.etree as et
import pandas as pd

# LOAD XML AND XSL FILES
xml = 'Input.xml'
xsl = 'XSLT_Script.xsl'

# TRANSFORM SOURCE
transform = et.XSLT(xsl)
result = transform(xml)

# SHORT VERSION
data = [{el.tag: el.text for el in dr.xpath("*")} for dr in result.xpath("//DATA_RECORD")]

# LONG VERSION    
data = []
for dr in result.xpath("//DATA_RECORD"):
    inner = {}
    for el in dr.xpath("*"):
        inner[el.tag] = el.text 
    data.append(inner)

df = pd.DataFrame(data)

输出量

print(df)
#   CASE_KEY                                     CASE_NARRATIVE                                        DESCRIPTION                                           OVERFLOW
# 0  6479351  Lorem ipsum dolor sit amet, consectetur adipis...                                 Four bill payments  s velit lacus ut mauris. Nullam congue placera...
# 1  6479356                                               None                            Financial Crime Concern                                               None
# 2  6480409                                     <p Lorum Ipsum  Financial Crime Concern :M&S customer was cold...                                               None
# 3  6480519  fraudster had set up two new payments and created                                   Financial Crime                                                None
# 4  6480521                                Mr. Ockwell is a HB                                      Triage Europe                                               None

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章