如何从R中的复杂XML中提取值而不丢弃不存在值的节点？我的循环很慢

therob 发表于 Dev

Therob

我有一个大型，复杂的XML文件，需要提取某些子（sub ...）节点的值和属性。但是因为并非所有子注释都具有所有想要的值（有些缺失），所以我不能轻易使用非常快的值xml_find_all（Packet XML2），因为它当然不会包括缺少值的子注释。

我的解决方案是对所有xml节点（对象）使用for循环循环，并在每个节点内检查是否存在我所需的值-如果是，则将其提取。多亏了循环的索引，我知道了它属于哪个对象，并将其写入相应的data.frame$Feature[i]。

这种方法很好用，但是对于我的大型XML节点，它要花很长时间（20分钟），并且非常消耗内存（由于if循环，大约需要1.5GB）。我的XML：100MB，大约30.000个“条目/对象”，每个都有大约50个功能（〜2条Mio行）

我发现的主要问题xpathSApply(...xml_path(Obj[i]...)是：如果循环的索引[i]很高（> 5000），它会非常慢

我的问题是：

您是否有更好/更简单的想法来解决我的问题，该问题非常复杂且层次分明，结构化XML并非所有对象（节点）中都存在所有功能？
我读了这种有趣的方法，但无法弄清楚如何将其转换为非常复杂的XML，其中所需的值位于不同的Nodeset级别...
是否可能有一些嵌套的xpathSApply-expression可以绕过for循环并避免使用索引？
您现在是否对我的问题有任何“矢量”处理方法（在R中速度更快）？

请参阅下面的其他评论，查看我的MWE代码。

XML格式

<?xml version="1.0" encoding="UTF-8"?>
<featureMember>
        <Object>
                <XML_Name>Object 1</XML_Name>
               <XML_Feature1>
                   <XML_Feature1a href="URL1"></XML_Feature1a>
                </XML_Feature1>
                <XML_Feature2>
                   <XML_Feature2a>1</XML_Feature2a>
                   <XML_Feature2a>1x</XML_Feature2a>
                   <XML_Feature2a>1y</XML_Feature2a>
                </XML_Feature2>
                <XML_Feature3>
                   <XML_Feature3a>F3a_1</XML_Feature2a>
                   <XML_Feature3b>F3b_1</XML_Feature2a>
                </XML_Feature3>
                <XML_Feature3>
                   <XML_Feature3a>F3a_2</XML_Feature2a>
                   <XML_Feature3b>F3b_2</XML_Feature2a>
                </XML_Feature3>
                <XML_Feature4>F4_1</XML_Feature4>
                <XML_Feature4>F4_2</XML_Feature4>   
        </Object>       
        <Object >
            <XML_Name>Object 2</XML_Name>
               <XML_Feature1>
                   <XML_Feature1a href="URL2"></XML_Feature1a>
                </XML_Feature1>         
        </Object>       
        <Object >
        <XML_Name>Object 3</XML_Name>
            <XML_Feature1>
               <XML_Feature1>               
               </XML_Feature1>
            </XML_Feature1>
            <XML_Feature2>
                <XML_Feature2a>Value 3</XML_Feature2a>
            </XML_Feature2>
        </Object>
</featureMember>

require(xml2)
require(XML)
test_xml2 <- read_xml("above_file.xml") # using Packet xml2 (for using xml_find_all)
test_XML <- xmlParse("above_file.xml") # Packet XML (for using xpathSApply)

  # XML-Noteset of all Objects I want to process:
Obj <- xml_find_all(test_xml2, "//Object") # --> has 3 nodes, contains all Objects!

  # initialize a destination dataframe and fill with NAs
df <- data.frame('Name'=integer(), 'f2a'=character() , 'f1a'=character(), stringsAsFactors = FALSE)
df[1:length(Obj),] <- NA

# My Initial approach to extract all features by xml_find_all (which is very fast) is not working because not all xml-nodes have all wanted xml-features:
Name <- xml_text(xml_find_all(test_xml2, "//XML_Name")) 
  # --> length(Name)=3, because all 3 Objects have a name!
f1a  <- xml_attr(xml_find_all(test_xml2, "//XML_Feature1/XML_Feature1a"),"href") 
  # --> length(f1a)=2, because XML_Feature1a is missing in Object3! 
f2a  <- xml_text(xml_find_all(test_xml2, "//XML_Feature2/XML_Feature2a")) 
  # --> length(f2a)=2, because XML_Feature2a is missing in Object2!
# Joining these to a final df is not possible, because "Name", "f2a" and "f1a" have of course different lengths, plus correct data matching is not possible!


# Therefore I decided to make instead the following approach.
  # 1.) crawl all features, which are present in all nodes, because its fast (here: "Name"):
df$Name <- xml_text(xml_find_all(test_xml2, "//XML_Name"))

  # 2.) making a for-loop over all Objects/XML-Nodes of interest and check if eacht wanted feature exist.
    # if yes: write to df$FeatureXY[i]
    # if not: make nothing (thus df$FeatureXY[i]stays NA from initialization)
for (i in 1:length(Obj))
{  # 1. Feature:
 tmp  <- xpathSApply(test_XML, paste0(xml_path(Obj[i]),"/XML_Feature1/XML_Feature1a"),  xmlGetAttr, "href")
 if(length(tmp )>0) { df$f1a[i] <- tmp # otherwise it would produce an error-message}
    # 2. Feature:
 tmp  <- xpathSApply(test_XML, paste0(xml_path(Obj[i]),"/XML_Feature2/XML_Feature2a"),  xmlValue)
 if(length(tmp )>0) { df$f2a[i] <- tmp} 
}  

# Result of df as it should be:
# Name      f2a             f1a   f3a            f3b             f4
# Object 1  1 # 1x # 1y     URL1  F3a_1 # F3a_2  F3b_1 # F3b_2   F4_1 # F4_2
# Object 2  NA              URL2  NA             NA              NA 
# Object 3  Value 3         NA    NA             NA              NA

编辑1：扩展的XML示例（feature2a，feature3a / b Feature4的多个元素）

戴夫2e

为了处理样本数据和实际数据之间的任何潜在变化，此类问题可能非常棘手。如果我们假设每个“对象”最多有一个“ Feature1a”节点，并且最多有一个“ Feature2a”节点，那么这可以解决一个直截了当的问题。

首先找到所有父“对象”节点，然后使用此节点向量对名称，feature1a属性和Feature2a文本进行解析。xml_find_first如果该节点存在，它将返回一个值；如果不存在，它将返回NA。由于该xml_find_first函数是矢量化的，因此可以在父节点的矢量上运行，而无需循环，并且可以显着提高性能。

library(xml2)
library(dplyr)

#Read file to process
doc<- read_xml("above_file.xml")

#find parent nodes
parents <- xml_find_all(doc, ".//Object")

#Now extract the requested data from each parent
# Notice the use of the . in the xpath. 
# //  finds anywhere in the document (ignoring the current node)
# .// finds anywhere beneath the current node
Names<- xml_find_first(parents, ".//XML_Name") %>% xml_text()
feature1 <- xml_find_first(parents, ".//XML_Feature1a") %>% xml_attr("href")

#fill features with first elements as default
feature2 <- xml_find_first(parents, ".//XML_Feature2a") %>% xml_text()
#find parents with more than 1 feature2
moretwos<-which(xml_find_all(parents, ".//XML_Feature2")  %>% xml_length() >1)
#reparse the parent nodes with more than one child
feature2[moretwos] <-sapply(parents[moretwos], function(node){
        xml_find_all(node, ".//XML_Feature2a") %>% xml_text() %>% paste(collapse = "#")
})


#Make combinded dataframe
answer <-data.frame(Names, feature1, feature2)
answer

这是一个类似的问题，但是子节点数未知：从xml创建具有不同数量元素的数据帧

更新对于具有多个子节点且具有多个子节点的已修订问题，但此处没有孙子选项是可选项。

#find parent nodes
parents<-xml_find_all(doc, ".//Object")

dfs<-lapply(parents, function(parent) {
  #Get oject name
  object<-xml_find_first(parent, ".//XML_Name") %>% xml_text()

  #find the number of children under each child
  numchild<-xml_children(parent) %>% xml_length()

  #if number of children is zero get name and value
  name  <- xml_children(parent)[numchild==0] %>% xml_name()
  value <- xml_children(parent)[numchild==0] %>% xml_text()

   #if the number of childern is 1 or more the get the name value of the child
   namec2  <- xml_children(parent)[numchild>=1] %>% xml_children() %>% xml_name()
   valuec2 <- xml_children(parent)[numchild>=1] %>% xml_children() %>% xml_text()

  #make data frame of the values and column headings
  df<-data.frame(object, name=c(name, namec2), value=c(value, valuec2), stringsAsFactors = FALSE)
  print(df)
  df
})

#Make combinded dataframe
answer<-bind_rows(dfs)
answer
library(tidyr) 
pivot_wider(answer, object, names_from = name, values_from= value, values_fn = list(value = toString))

最后的答案将需要清理列，gsub(", ", " # ", ...)并从上方检索URL属性。

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-01-23

我来说两句

0 条评论

登录后参与评论

上一篇：MATLAB：尝试不显示此代码中的逻辑数组

TOP 榜单

文章

如何从R中的复杂XML中提取值而不丢弃不存在值的节点？我的循环很慢

如何从R中的复杂XML中提取值而不丢弃不存在值的节点？我的循环很慢

Android Studio Kotlin：提取为常量

IE 11中的FormData未定义

计算数据帧R中的字符串频率

如何在R中转置数据

如何使用Redux-Toolkit重置Redux Store

Excel 2016图表将增长与4个参数进行比较

在 Python 2.7 中。如何从文件中读取特定文本并分配给变量

未捕获的SyntaxError：带有Ajax帖子的意外令牌u

OpenCv：改变 putText() 的位置

ActiveModelSerializer仅显示关联的ID

算术中的c ++常量类型转换

如何开始为Ubuntu开发

将加号/减号添加到jQuery菜单

去噪自动编码器和常规自动编码器有什么区别？

获取并汇总所有关联的数据

OpenGL纹理格式的颜色错误

在 React Native Expo 中使用 react-redux 更改另一个键的值

http：// localhost：3000 /＃！/为什么我在localhost链接中得到“＃！/”。

TreeMap中的自定义排序

Redux动作正常，但减速器无效

如何对treeView的子节点进行排序