我有一个大型,复杂的XML文件,需要提取某些子(sub ...)节点的值和属性。但是因为并非所有子注释都具有所有想要的值(有些缺失),所以我不能轻易使用非常快的值xml_find_all
(Packet XML2),因为它当然不会包括缺少值的子注释。
我的解决方案是对所有xml节点(对象)使用for循环循环,并在每个节点内检查是否存在我所需的值-如果是,则将其提取。多亏了循环的索引,我知道了它属于哪个对象,并将其写入相应的data.frame$Feature[i]
。
这种方法很好用,但是对于我的大型XML节点,它要花很长时间(20分钟),并且非常消耗内存(由于if循环,大约需要1.5GB)。我的XML:100MB,大约30.000个“条目/对象”,每个都有大约50个功能(〜2条Mio行)
我发现的主要问题xpathSApply(...xml_path(Obj[i]...)
是:如果循环的索引[i]很高(> 5000),它会非常慢
我的问题是:
请参阅下面的其他评论,查看我的MWE代码。
XML格式
<?xml version="1.0" encoding="UTF-8"?>
<featureMember>
<Object>
<XML_Name>Object 1</XML_Name>
<XML_Feature1>
<XML_Feature1a href="URL1"></XML_Feature1a>
</XML_Feature1>
<XML_Feature2>
<XML_Feature2a>1</XML_Feature2a>
<XML_Feature2a>1x</XML_Feature2a>
<XML_Feature2a>1y</XML_Feature2a>
</XML_Feature2>
<XML_Feature3>
<XML_Feature3a>F3a_1</XML_Feature2a>
<XML_Feature3b>F3b_1</XML_Feature2a>
</XML_Feature3>
<XML_Feature3>
<XML_Feature3a>F3a_2</XML_Feature2a>
<XML_Feature3b>F3b_2</XML_Feature2a>
</XML_Feature3>
<XML_Feature4>F4_1</XML_Feature4>
<XML_Feature4>F4_2</XML_Feature4>
</Object>
<Object >
<XML_Name>Object 2</XML_Name>
<XML_Feature1>
<XML_Feature1a href="URL2"></XML_Feature1a>
</XML_Feature1>
</Object>
<Object >
<XML_Name>Object 3</XML_Name>
<XML_Feature1>
<XML_Feature1>
</XML_Feature1>
</XML_Feature1>
<XML_Feature2>
<XML_Feature2a>Value 3</XML_Feature2a>
</XML_Feature2>
</Object>
</featureMember>
[R
require(xml2)
require(XML)
test_xml2 <- read_xml("above_file.xml") # using Packet xml2 (for using xml_find_all)
test_XML <- xmlParse("above_file.xml") # Packet XML (for using xpathSApply)
# XML-Noteset of all Objects I want to process:
Obj <- xml_find_all(test_xml2, "//Object") # --> has 3 nodes, contains all Objects!
# initialize a destination dataframe and fill with NAs
df <- data.frame('Name'=integer(), 'f2a'=character() , 'f1a'=character(), stringsAsFactors = FALSE)
df[1:length(Obj),] <- NA
# My Initial approach to extract all features by xml_find_all (which is very fast) is not working because not all xml-nodes have all wanted xml-features:
Name <- xml_text(xml_find_all(test_xml2, "//XML_Name"))
# --> length(Name)=3, because all 3 Objects have a name!
f1a <- xml_attr(xml_find_all(test_xml2, "//XML_Feature1/XML_Feature1a"),"href")
# --> length(f1a)=2, because XML_Feature1a is missing in Object3!
f2a <- xml_text(xml_find_all(test_xml2, "//XML_Feature2/XML_Feature2a"))
# --> length(f2a)=2, because XML_Feature2a is missing in Object2!
# Joining these to a final df is not possible, because "Name", "f2a" and "f1a" have of course different lengths, plus correct data matching is not possible!
# Therefore I decided to make instead the following approach.
# 1.) crawl all features, which are present in all nodes, because its fast (here: "Name"):
df$Name <- xml_text(xml_find_all(test_xml2, "//XML_Name"))
# 2.) making a for-loop over all Objects/XML-Nodes of interest and check if eacht wanted feature exist.
# if yes: write to df$FeatureXY[i]
# if not: make nothing (thus df$FeatureXY[i]stays NA from initialization)
for (i in 1:length(Obj))
{ # 1. Feature:
tmp <- xpathSApply(test_XML, paste0(xml_path(Obj[i]),"/XML_Feature1/XML_Feature1a"), xmlGetAttr, "href")
if(length(tmp )>0) { df$f1a[i] <- tmp # otherwise it would produce an error-message}
# 2. Feature:
tmp <- xpathSApply(test_XML, paste0(xml_path(Obj[i]),"/XML_Feature2/XML_Feature2a"), xmlValue)
if(length(tmp )>0) { df$f2a[i] <- tmp}
}
# Result of df as it should be:
# Name f2a f1a f3a f3b f4
# Object 1 1 # 1x # 1y URL1 F3a_1 # F3a_2 F3b_1 # F3b_2 F4_1 # F4_2
# Object 2 NA URL2 NA NA NA
# Object 3 Value 3 NA NA NA NA
编辑1:扩展的XML示例(feature2a,feature3a / b Feature4的多个元素)
为了处理样本数据和实际数据之间的任何潜在变化,此类问题可能非常棘手。如果我们假设每个“对象”最多有一个“ Feature1a”节点,并且最多有一个“ Feature2a”节点,那么这可以解决一个直截了当的问题。
首先找到所有父“对象”节点,然后使用此节点向量对名称,feature1a属性和Feature2a文本进行解析。xml_find_first
如果该节点存在,它将返回一个值;如果不存在,它将返回NA。由于该xml_find_first
函数是矢量化的,因此可以在父节点的矢量上运行,而无需循环,并且可以显着提高性能。
library(xml2)
library(dplyr)
#Read file to process
doc<- read_xml("above_file.xml")
#find parent nodes
parents <- xml_find_all(doc, ".//Object")
#Now extract the requested data from each parent
# Notice the use of the . in the xpath.
# // finds anywhere in the document (ignoring the current node)
# .// finds anywhere beneath the current node
Names<- xml_find_first(parents, ".//XML_Name") %>% xml_text()
feature1 <- xml_find_first(parents, ".//XML_Feature1a") %>% xml_attr("href")
#fill features with first elements as default
feature2 <- xml_find_first(parents, ".//XML_Feature2a") %>% xml_text()
#find parents with more than 1 feature2
moretwos<-which(xml_find_all(parents, ".//XML_Feature2") %>% xml_length() >1)
#reparse the parent nodes with more than one child
feature2[moretwos] <-sapply(parents[moretwos], function(node){
xml_find_all(node, ".//XML_Feature2a") %>% xml_text() %>% paste(collapse = "#")
})
#Make combinded dataframe
answer <-data.frame(Names, feature1, feature2)
answer
这是一个类似的问题,但是子节点数未知:从xml创建具有不同数量元素的数据帧
更新对于具有多个子节点且具有多个子节点的已修订问题,但此处没有孙子选项是可选项。
#find parent nodes
parents<-xml_find_all(doc, ".//Object")
dfs<-lapply(parents, function(parent) {
#Get oject name
object<-xml_find_first(parent, ".//XML_Name") %>% xml_text()
#find the number of children under each child
numchild<-xml_children(parent) %>% xml_length()
#if number of children is zero get name and value
name <- xml_children(parent)[numchild==0] %>% xml_name()
value <- xml_children(parent)[numchild==0] %>% xml_text()
#if the number of childern is 1 or more the get the name value of the child
namec2 <- xml_children(parent)[numchild>=1] %>% xml_children() %>% xml_name()
valuec2 <- xml_children(parent)[numchild>=1] %>% xml_children() %>% xml_text()
#make data frame of the values and column headings
df<-data.frame(object, name=c(name, namec2), value=c(value, valuec2), stringsAsFactors = FALSE)
print(df)
df
})
#Make combinded dataframe
answer<-bind_rows(dfs)
answer
library(tidyr)
pivot_wider(answer, object, names_from = name, values_from= value, values_fn = list(value = toString))
最后的答案将需要清理列,gsub(", ", " # ", ...)
并从上方检索URL属性。
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句