Join the R Community at ShinyConf 2023

R XML: How to Work With XML Files in R


R programming language can read all sorts of data, and XML is no exception. There are many ways to read, parse, and manipulate these markup language files in R, and today we’ll explore two. By the end of the article, you’ll know how to use two R packages to work with XML.

We’ll kick things off with an R XML introduction – you’ll get a sense of what XML is, and we’ll also write an XML dataset from scratch. Then, you’ll learn how to access individual elements, convert XML files to an R tibble and a data.frame, and much more.

Are you a complete beginner in R? See how R handles Object-Oriented Programming (OOP) with R6.

Table of contents:


Introduction to R XML

First, let’s answer one important question: What is XML? The acronym stands for Extensible Markup Language. It’s similar to HTML since they’re both markup languages, but XML is used for storing and transmitting data over the internet. As you would assume, all XML files have an .xml file extension.

Building an interactive map with R and Shiny? See if you should be using Leaflet vs Tmap.

When you first start working with XML files you’ll immediately appreciate the structure. It’s human-readable, and there aren’t a gazillion of brackets as with JSON. There are no predefined tags, as in HTML. You can name your tags however you want, but it’s best to name them around the business logic.

All XML documents start with the following – the XML prolog:

<?xml version="1.0" encoding="UTF-8"?>

Each XML file also must have a root element that can have one or many child notes. All child nodes may have sub-childs.

Let’s see this in action! The following code snippet declares an XML dataset containing employees. There’s one root element – <records>, and each <employee> child has sub-childs, such as <last_name>:

<?xml version="1.0" encoding="UTF-8"?>
<records>
    <employee>
        <id>1</id>
        <first_name>John</first_name>
        <last_name>Smith</last_name>
        <position>CEO</position>
        <salary>10000</salary>
        <hire_date>2022-1-1</hire_date>
        <department>Management</department>
    </employee>
    <employee>
        <id>2</id>
        <first_name>Jane</first_name>
        <last_name>Sense</last_name>
        <position>Marketing Associate</position>
        <salary>3500</salary>
        <hire_date>2022-1-15</hire_date>
        <department>Marketing</department>
    </employee>
    <employee>
        <id>3</id>
        <first_name>Frank</first_name>
        <last_name>Brown</last_name>
        <position>R Developer</position>
        <salary>6000</salary>
        <hire_date>2022-1-15</hire_date>
        <department>IT</department>
    </employee>
    <employee>
        <id>4</id>
        <first_name>Judith</first_name>
        <last_name>Rollers</last_name>
        <position>Data Scientist</position>
        <salary>6500</salary>
        <hire_date>2022-3-1</hire_date>
        <department>IT</department>
    </employee>
    <employee>
        <id>5</id>
        <first_name>Karen</first_name>
        <last_name>Switch</last_name>
        <position>Accountant</position>
        <salary>4000</salary>
        <hire_date>2022-1-10</hire_date>
        <department>Accounting</department>
    </employee>
</records>

Copy this file and save it locally – we’ve named it data.xml. You’ll need it in the following section when we’ll work with XML in R.

But before we can do that, you’ll have to install two R packages:

install.packages("xml2")
install.packages("XML")

Both are used to work with XML, and you can pretty much get around by using only the first. The second one has a couple of convenient functions for converting XML files, which we’ll cover later.

Want to add a Google Map to Shiny? Check out our guide to building interactive Google Maps with R Shiny!

First things first, let’s see how you can read and parse XML files in R.

R XML Basics – How to Read and Parse XML Files

By now you should have the dataset downloaded and R packages installed. Create a new R script and use the following code to load in the packages and read the XML file:

library(xml2)
library(XML)

employee_data <- read_xml("data.xml")
employee_data

Here’s what it looks like:

Image 1 – Contents of an XML document loaded into R

The data is all there, but it’s unusable. You can make it usable by parsing the entire document or reading individual elements.

Let’s explore the parsing option first. Call the xmlParse() function and pass in employee_data:

employee_xml <- xmlParse(employee_data)
employee_xml

The contents now look like our source file:

Image 2 – Parsed XML document

Pro tip: if you don’t care about the data, you can print the structure only. That’s done with the xml_structure() function:

xml_structure(employee_data)

Image 3 – Structure of an XML document

If you want to access all elements with the same tag, you can use the xml_find_all() function. It returns both the opening and closing tags and any content that’s between them:

xml_find_all(employee_data, ".//position")

Image 4 – Accessing individual nodes

In the case you only want the content, use either xml_text(), xml_integer(), or xml_double() function – depending on the underlying data type. The first one makes the most sense here:

xml_text(xml_find_all(employee_data, ".//position"))

Image 5 – Getting values from individual nodes

You now know how to do some basic R XML operations, but most of the time you want to convert these files to either a tibble or a data frame for easier access and manipulation. Let’s see how to do that next.

How to Convert XML Data to tibble and data.frame

Most of the time with R and XML you’ll want to extract either all or a couple of features and turn them into a more readable format. We’ve already shown you how to use xml_text() to extract text from a specific element, and now we’ll do a similar thing with integers. Then, we’ll format these two attributes as a tibble.

Here’s the entire code snippet:

library(tibble)

# Extract department and salary info
dept <- xml_text(xml_find_all(employee_data, ".//department"))
salary <- xml_integer(xml_find_all(employee_data, ".//salary"))

# Format as a tibble
df_dept_salary <- tibble(department = dept, salary = salary)
df_dept_salary

Image 6 – Converting an XML document to an R tibble

Now we have the department names and salaries for all employees. From here, it’s easy to calculate the average salary per department (note that only the IT department occurs twice):

library(dplyr)

# Group by department name to get average salary by department
df_dept_salary %>% 
  group_by(department) %>%
  summarise(salary = mean(salary))

Image 7 – Aggregations on an R tibble

In case you want to convert the entire XML document to an R data.frame, look no further than the XML package. It has a convenient xmlToDataFrame() method that does the job perfectly:

df_employees <- xmlToDataFrame(nodes = getNodeSet(employee_xml, "//employee"))
df_employees

Image 8 – Converting an XML document to an R data.frame

That’s all the loading and preprocessing needed before you can start analyzing and visualizing datasets. It’s also the most common pipeline you’ll have for loading XML files, so we’ll end today’s article here.


Summary of R XML

XML files are common in 2022 and you as a data professional must know how to work with them. Almost all R XML-related work you’ll do boils down to loading and parsing XML documents and converting them to an analysis-friendly format. Today you’ve learned how to do that with two excellent R packages.

For a homework assignment, try to read only the <hire_date> attribute, and make sure to parse it as a date. Is there a built-in function, or do you need to take an extra step? Make sure to let us know in the comment section below.

Excel power user? You can combine R and Excel with these two packages.