How do I convert an XML file into a dataframe/tibble in R?

Mike Lee

How do I convert an XML file that looks like this:

<bible>
  <b n="Psalm">
    <c n="1">
      <v n="1"> text text text text </v>
      <v n="2"> text text text text </v>
      <v n="3"> text text text text </v>
    </c>
    <c n="2">
      <v n="1"> text text text text </v>
      <v n="2"> text text text text </v>
      <v n="3"> text text text text </v>
  </c>
  </b>
  <b n="Revelation">
    <c n="1">
      <v n="1"> text text text text </v>
      <v n="2"> text text text text </v>
      <v n="3"> text text text text </v>
    </c>
    <c n="2">
      <v n="1"> text text text text </v>
      <v n="2"> text text text text </v>
      <v n="3"> text text text text </v>
    </c>
    <c n="3">
      <v n="1"> text text text text </v>
      <v n="2"> text text text text </v>
      <v n="3"> text text text text </v>
    </c>
  </b>
</bible>

Into a dataframe/tibble format that looks like this:

# A tibble: 15 x 4
 book       chapter verse text               
 <chr>        <dbl> <int> <chr>              
1 Psalm            1     1 text text text text
2 Psalm            1     2 text text text text
3 Psalm            1     3 text text text text
4 Psalm            2     1 text text text text
5 Psalm            2     2 text text text text
6 Psalm            2     3 text text text text
7 Revelation       1     1 text text text text
8 Revelation       1     2 text text text text
9 Revelation       1     3 text text text text
10 Revelation       2     1 text text text text
11 Revelation       2     2 text text text text
12 Revelation       2     3 text text text text
13 Revelation       3     1 text text text text
14 Revelation       3     2 text text text text
15 Revelation       3     3 text text text text

I've tried using xmlToDataFrame(nodes = getNodeSet(doc, "/bible")) from the XML package but I just get one observation with multiple columns. When I tried changing node levels for the getNodeSet function I get a duplicate subscripts for columns error. Thanks.

Parfait

Consider XSLT, the special-purpose language designed to transform XML files and sibling to XPath. Specifically, you need to flatten all data down into a single level such as verse where you migrate ancestor nodes or attributes to sibling nodes, of course repeating values for data frame setup.

Once transformed you can then use the convenience method XML::xmlToDataFrame suitable for flatter XML. R can run XSLT 1.0 with the xslt package (extension to xml2)

XSLT (save as .xsl, a special .xml file)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:template match="/bible">
        <xsl:copy>
            <xsl:apply-templates select="descendant::v"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="v">
        <data>
            <book><xsl:value-of select="ancestor::b/@n"/></book>
            <chapter><xsl:value-of select="ancestor::c/@n"/></chapter>
            <verse><xsl:value-of select="@n"/></verse>
            <text><xsl:value-of select="text()"/></text>
        </data>
    </xsl:template>

</xsl:stylesheet>

R (no loops or mapping needed)

library(XML)
library(xslt)

doc <- read_xml("Import.xml", package = "xslt")
style <- read_xml("Script.xsl", package = "xslt")

new_xml <- xml_xslt(doc, style)

new_doc <- XML::xmlParse(new_xml)    
bible_df <- XML::xmlToDataFrame(nodes=getNodeSet(new_doc, "//data"))

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related