从多个 XML 节点中提取值

Yves 发表于 Dev

伊夫

我有以下数据结构（原始是 2.5gb，因此必须解析）：

<households xmlns="http://www.matsim.org/files/dtd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.matsim.org/files/dtd http://www.matsim.org/files/dtd/households_v1.0.xsd">
    <household id="1473">
        <members>
            <personId refId="2714"/>
            <personId refId="2715"/>
            <personId refId="2716"/>
            <personId refId="2717"/>
            <personId refId="2718"/>
            <personId refId="2719"/>
        </members>
        <income currency="CHF" period="month">
                3094.87101
        </income>
        <attributes>
            <attribute name="bikeAvailability" class="java.lang.String" >some</attribute>
            <attribute name="carAvailability" class="java.lang.String" >some</attribute>
            <attribute name="consumptionUnits" class="java.lang.Double" >3.3</attribute>
            <attribute name="householdIncomePerConsumptionUnit" class="java.lang.Double" >3094.8710104279835</attribute>
            <attribute name="numberOfCars" class="java.lang.Integer" >1</attribute>
            <attribute name="residenceZoneCategory" class="java.lang.Integer" >1</attribute>
            <attribute name="totalHouseholdIncome" class="java.lang.Double" >10213.074334412346</attribute>
        </attributes>

    </household>
    <household id="2474">
        <members>
            <personId refId="4647"/>
            <personId refId="4648"/>
            <personId refId="4649"/>
            <personId refId="4650"/>
            <personId refId="4651"/>
            <personId refId="4652"/>
            <personId refId="4653"/>
            <personId refId="4654"/>
            <personId refId="4655"/>
        </members>
        <income currency="CHF" period="month">
                1602.562822
        </income>
        <attributes>
            <attribute name="bikeAvailability" class="java.lang.String" >none</attribute>
            <attribute name="carAvailability" class="java.lang.String" >all</attribute>
            <attribute name="consumptionUnits" class="java.lang.Double" >3.6999999999999997</attribute>
            <attribute name="householdIncomePerConsumptionUnit" class="java.lang.Double" >1602.5628215679633</attribute>
            <attribute name="numberOfCars" class="java.lang.Integer" >1</attribute>
            <attribute name="residenceZoneCategory" class="java.lang.Integer" >1</attribute>
            <attribute name="totalHouseholdIncome" class="java.lang.Double" >5929.482439801463</attribute>
        </attributes>

    </household>
    <household id="4024">
        <members>
            <personId refId="7685"/>
        </members>
        <income currency="CHF" period="month">
                61610.096619
        </income>
        <attributes>
            <attribute name="bikeAvailability" class="java.lang.String" >none</attribute>
            <attribute name="carAvailability" class="java.lang.String" >none</attribute>
            <attribute name="consumptionUnits" class="java.lang.Double" >1.0</attribute>
            <attribute name="householdIncomePerConsumptionUnit" class="java.lang.Double" >61610.096618936936</attribute>
            <attribute name="numberOfCars" class="java.lang.Integer" >0</attribute>
            <attribute name="residenceZoneCategory" class="java.lang.Integer" >1</attribute>
            <attribute name="totalHouseholdIncome" class="java.lang.Double" >61610.096618936936</attribute>
        </attributes>

    </household>
</households>

我想提取所有person ID refId值及其对应的income值。最终，我计划有一个 df，其中一列是 personId，一列是收入（收入将是重复的）。所以棘手的部分不仅是命名空间，还有如何在不同节点级别访问 XML。

到目前为止，我的方法未能做到这一点。

import gzip
import xml.etree.ElementTree as ET
from collections import defaultdict
import pandas as pd
import numpy as np

tree = ET.parse(gzip.open('V0_1pm/output_households.xml.gz', 'r'))
root = tree.getroot()
rows = []
for it in root.iter('household'):
    hh = it.attrib['id']
    inc = it.find('income').text
    rows.append([hh,inc])

hh_inc = pd.DataFrame(rows, columns=['id', 'PTSubscription'])
hh_inc

任何帮助都受到高度赞赏。

瓦尔迪博

您的代码失败的原因是您的输入元素具有非空命名空间。

处理命名空间 XML 的方法之一是：

定义一个字典“快捷方式：命名空间”，其中包含 XPath 表达式中使用的所有命名空间。
调用findall或find，将此字典作为第二个参数传递，并在 XPath 表达式中添加相关的命名空间快捷方式（以及一个冒号作为分隔符）。

还要注意的是找到（...）。文字的回报完全文本，以换行字符和空格。为了解决这个问题，你可能应该：

剥离从“周围”白色字符读取的内容。
将其转换为float。

因此，将您的代码更改为：

# Namespace dictionary
ns = {'dtd': 'http://www.matsim.org/files/dtd'}
rows = []
for it in root.findall('dtd:household', ns):
    hh = it.attrib['id']
    inc = it.find('dtd:income', ns).text
    inc = float(inc.strip())
    rows.append([hh, inc])
hh_inc = pd.DataFrame(rows, columns=['id', 'PTSubscription'])
hh_inc

对于您的示例输入，我得到了：

     id  PTSubscription
0  1473     3094.871010
1  2474     1602.562822
2  4024    61610.096619

在关于refId的问题之后进行编辑

我假设 DataFrame 应该为每个refId包含单独的行，具有相关的id和PTSubscription。

要包含refId，请将循环更改为：

for it in root.findall('dtd:household', ns):
    hh = it.attrib['id']
    inc = it.find('dtd:income', ns).text
    inc = float(inc.strip())
    pids = it.findall('.//dtd:personId', ns)
    for pId in pids:
        refId = pId.attrib['refId']
        rows.append([hh, inc, int(refId)])
    if not pids:
        rows.append([hh, inc, -1])

我添加了最后 2 条说明，以免“丢失”任何不包含refId 的家庭。

创建 DataFrame 时，传递附加列名称：

hh_inc = pd.DataFrame(rows, columns=['id', 'PTSubscription', 'refId'])

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-08-2

我来说两句

0 条评论

登录后参与评论

如何使用Rest-assured从具有多个名称空间的SOAP XML响应中提取值？

使用Java从xml文件中提取值

如何从多个XML中提取值到excel？

从xml节点中提取值

T-SQL查找具有匹配文本的节点，并从XML字段中的同级节点中提取值

从Oracle中的XML列中提取值

从SQL Server XML列中提取值

在Oracle中，如何从具有多个值的XML / CLOB字段中提取值？

如何使用python从多个XML节点和层次结构中提取信息？

从BLOB中提取多个值作为XML

从无法与查询节点或值一起使用的XML中提取值

使用Python ElementTree从XML中提取值

从xml网址中提取值

使用具有多个条件的xpath从xml文件中提取值

XML数据多个节点

Xquery如何从一组xml记录内的一组xml节点中提取专有术语？

R：如何最好地从节点中提取两个XML属性？

从Oracle中重复的节点中提取特定的xml节点

从 XML 节点中提取多个值并将其映射到数据库

从 XML 节点中提取数据

在shell中从xml中提取值

使用 xpath 从 xml 元素中提取值

从特定的 xml 节点中提取值

Python - BeautifulSoup 从多个选项中提取值

使用正文路径表达式从 WCF-SQL 消息中提取 XML 节点中的 XML 文档

从 CLOB XML 列中提取值

PostgreSQL：如何从 XML 中提取值属性

如何从 R 中的复杂 XML 中提取值而不丢弃没有现有值的节点？

从 XML 文件中提取值

TOP 榜单

文章

从多个 XML 节点中提取值

从多个 XML 节点中提取值

在关于refId的问题之后进行编辑

UITableView的项目向下滚动后更改颜色，然后快速备份

Linux的官方Adobe Flash存储库是否已过时？

用日期数据透视表和日期顺序查询

应用发明者仅从列表中选择一个随机项一次

Mac OS X更新后的GRUB 2问题

验证REST API参数

Java Eclipse中的错误13，如何解决？

带有错误“ where”条件的查询如何返回结果？

ggplot：对齐多个分面图-所有大小不同的分面

尝试反复更改屏幕上按钮的位置 - kotlin android studio

如何从视图一次更新多行（ASP.NET - Core）

计算数据帧中每行的NA

蓝屏死机没有修复解决方案

在 Python 2.7 中。如何从文件中读取特定文本并分配给变量

离子动态工具栏背景色

VB.net将2条特定行导出到DataGridView

通过 Git 在运行 Jenkins 作业时获取 ClassNotFoundException

在Windows 7中无法删除文件（2）

python中的boto3文件上传

当我尝试下载 StanfordNLP en 模型时，出现错误

Node.js中未捕获的异常错误，发生调用