如何从R中的复杂XML中提取值而不丢弃不存在值的节点?我的循环很慢

Therob

我有一个大型,复杂的XML文件,需要提取某些子(sub ...)节点的值和属性。但是因为并非所有子注释都具有所有想要的值(有些缺失),所以我不能轻易使用非常快的值xml_find_all(Packet XML2),因为它当然不会包括缺少值的子注释。

我的解决方案是对所有xml节点(对象)使用for循环循环,并在每个节点内检查是否存在我所需的值-如果是,则将其提取。多亏了循环的索引,我知道了它属于哪个对象,并将其写入相应的data.frame$Feature[i]

这种方法很好用,但是对于我的大型XML节点,它要花很长时间(20分钟),并且非常消耗内存(由于if循环,大约需要1.5GB)。我的XML:100MB,大约30.000个“条目/对象”,每个都有大约50个功能(〜2条Mio行)

我发现的主要问题xpathSApply(...xml_path(Obj[i]...)如果循环的索引[i]很高(> 5000),它会非常慢

我的问题是:

  • 您是否有更好/更简单的想法来解决我的问题,该问题非常复杂且层次分明,结构化XML并非所有对象(节点)中都存在所有功能?
  • 我读了这种有趣的方法,但无法弄清楚如何将其转换为非常复杂的XML,其中所需的值位于不同的Nodeset级别...
  • 是否可能有一些嵌套的xpathSApply-expression可以绕过for循环并避免使用索引?
  • 您现在是否对我的问题有任何“矢量”处理方法(在R中速度更快)?

请参阅下面的其他评论,查看我的MWE代码。

XML格式

<?xml version="1.0" encoding="UTF-8"?>
<featureMember>
        <Object>
                <XML_Name>Object 1</XML_Name>
               <XML_Feature1>
                   <XML_Feature1a href="URL1"></XML_Feature1a>
                </XML_Feature1>
                <XML_Feature2>
                   <XML_Feature2a>1</XML_Feature2a>
                   <XML_Feature2a>1x</XML_Feature2a>
                   <XML_Feature2a>1y</XML_Feature2a>
                </XML_Feature2>
                <XML_Feature3>
                   <XML_Feature3a>F3a_1</XML_Feature2a>
                   <XML_Feature3b>F3b_1</XML_Feature2a>
                </XML_Feature3>
                <XML_Feature3>
                   <XML_Feature3a>F3a_2</XML_Feature2a>
                   <XML_Feature3b>F3b_2</XML_Feature2a>
                </XML_Feature3>
                <XML_Feature4>F4_1</XML_Feature4>
                <XML_Feature4>F4_2</XML_Feature4>   
        </Object>       
        <Object >
            <XML_Name>Object 2</XML_Name>
               <XML_Feature1>
                   <XML_Feature1a href="URL2"></XML_Feature1a>
                </XML_Feature1>         
        </Object>       
        <Object >
        <XML_Name>Object 3</XML_Name>
            <XML_Feature1>
               <XML_Feature1>               
               </XML_Feature1>
            </XML_Feature1>
            <XML_Feature2>
                <XML_Feature2a>Value 3</XML_Feature2a>
            </XML_Feature2>
        </Object>
</featureMember>

[R

require(xml2)
require(XML)
test_xml2 <- read_xml("above_file.xml") # using Packet xml2 (for using xml_find_all)
test_XML <- xmlParse("above_file.xml") # Packet XML (for using xpathSApply)

  # XML-Noteset of all Objects I want to process:
Obj <- xml_find_all(test_xml2, "//Object") # --> has 3 nodes, contains all Objects!

  # initialize a destination dataframe and fill with NAs
df <- data.frame('Name'=integer(), 'f2a'=character() , 'f1a'=character(), stringsAsFactors = FALSE)
df[1:length(Obj),] <- NA

# My Initial approach to extract all features by xml_find_all (which is very fast) is not working because not all xml-nodes have all wanted xml-features:
Name <- xml_text(xml_find_all(test_xml2, "//XML_Name")) 
  # --> length(Name)=3, because all 3 Objects have a name!
f1a  <- xml_attr(xml_find_all(test_xml2, "//XML_Feature1/XML_Feature1a"),"href") 
  # --> length(f1a)=2, because XML_Feature1a is missing in Object3! 
f2a  <- xml_text(xml_find_all(test_xml2, "//XML_Feature2/XML_Feature2a")) 
  # --> length(f2a)=2, because XML_Feature2a is missing in Object2!
# Joining these to a final df is not possible, because "Name", "f2a" and "f1a" have of course different lengths, plus correct data matching is not possible!


# Therefore I decided to make instead the following approach.
  # 1.) crawl all features, which are present in all nodes, because its fast (here: "Name"):
df$Name <- xml_text(xml_find_all(test_xml2, "//XML_Name"))

  # 2.) making a for-loop over all Objects/XML-Nodes of interest and check if eacht wanted feature exist.
    # if yes: write to df$FeatureXY[i]
    # if not: make nothing (thus df$FeatureXY[i]stays NA from initialization)
for (i in 1:length(Obj))
{  # 1. Feature:
 tmp  <- xpathSApply(test_XML, paste0(xml_path(Obj[i]),"/XML_Feature1/XML_Feature1a"),  xmlGetAttr, "href")
 if(length(tmp )>0) { df$f1a[i] <- tmp # otherwise it would produce an error-message}
    # 2. Feature:
 tmp  <- xpathSApply(test_XML, paste0(xml_path(Obj[i]),"/XML_Feature2/XML_Feature2a"),  xmlValue)
 if(length(tmp )>0) { df$f2a[i] <- tmp} 
}  

# Result of df as it should be:
# Name      f2a             f1a   f3a            f3b             f4
# Object 1  1 # 1x # 1y     URL1  F3a_1 # F3a_2  F3b_1 # F3b_2   F4_1 # F4_2
# Object 2  NA              URL2  NA             NA              NA 
# Object 3  Value 3         NA    NA             NA              NA

编辑1:扩展的XML示例(feature2a,feature3a / b Feature4的多个元素)

戴夫2e

为了处理样本数据和实际数据之间的任何潜在变化,此类问题可能非常棘手。如果我们假设每个“对象”最多有一个“ Feature1a”节点,并且最多有一个“ Feature2a”节点,那么这可以解决一个直截了当的问题。

首先找到所有父“对象”节点,然后使用此节点向量对名称,feature1a属性和Feature2a文本进行解析。xml_find_first如果该节点存在,它将返回一个值;如果不存在,它将返回NA。由于该xml_find_first函数是矢量化的,因此可以在父节点的矢量上运行,而无需循环,并且可以显着提高性能。

library(xml2)
library(dplyr)

#Read file to process
doc<- read_xml("above_file.xml")

#find parent nodes
parents <- xml_find_all(doc, ".//Object")

#Now extract the requested data from each parent
# Notice the use of the . in the xpath. 
# //  finds anywhere in the document (ignoring the current node)
# .// finds anywhere beneath the current node
Names<- xml_find_first(parents, ".//XML_Name") %>% xml_text()
feature1 <- xml_find_first(parents, ".//XML_Feature1a") %>% xml_attr("href")

#fill features with first elements as default
feature2 <- xml_find_first(parents, ".//XML_Feature2a") %>% xml_text()
#find parents with more than 1 feature2
moretwos<-which(xml_find_all(parents, ".//XML_Feature2")  %>% xml_length() >1)
#reparse the parent nodes with more than one child
feature2[moretwos] <-sapply(parents[moretwos], function(node){
        xml_find_all(node, ".//XML_Feature2a") %>% xml_text() %>% paste(collapse = "#")
})


#Make combinded dataframe
answer <-data.frame(Names, feature1, feature2)
answer

这是一个类似的问题,但是子节点数未知:从xml创建具有不同数量元素的数据帧

更新对于具有多个子节点且具有多个子节点的已修订问题,但此处没有孙子选项是可选项。

#find parent nodes
parents<-xml_find_all(doc, ".//Object")

dfs<-lapply(parents, function(parent) {
  #Get oject name
  object<-xml_find_first(parent, ".//XML_Name") %>% xml_text()

  #find the number of children under each child
  numchild<-xml_children(parent) %>% xml_length()

  #if number of children is zero get name and value
  name  <- xml_children(parent)[numchild==0] %>% xml_name()
  value <- xml_children(parent)[numchild==0] %>% xml_text()

   #if the number of childern is 1 or more the get the name value of the child
   namec2  <- xml_children(parent)[numchild>=1] %>% xml_children() %>% xml_name()
   valuec2 <- xml_children(parent)[numchild>=1] %>% xml_children() %>% xml_text()

  #make data frame of the values and column headings
  df<-data.frame(object, name=c(name, namec2), value=c(value, valuec2), stringsAsFactors = FALSE)
  print(df)
  df
})

#Make combinded dataframe
answer<-bind_rows(dfs)
answer
library(tidyr) 
pivot_wider(answer, object, names_from = name, values_from= value, values_fn = list(value = toString))

最后的答案将需要清理列,gsub(", ", " # ", ...)并从上方检索URL属性。

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章