stringr: 10 Examples on How to Do Efficient String Processing in R
Working with strings in R can be surprisingly complex and challenging. Dealing with diverse data types, including textual, numeric, and language-specific characters, adds further complexity. It’s even worse if you’re collecting string data through some website form. Good luck processing that.
Truth be told, R’s built-in functions for working with strings leave a lot to be desired. That’s where the R package stringr comes in, and it ships with the Tidyverse ecosystem, so it’s likely you already have it installed. Don’t worry if you don’t, as we’ll walk you through the stringr installation steps.
This article will demonstrate 10 useful stringr functions that you should know in order to work efficiently with string data and avoid wasting time reinventing the wheel. Before diving into the examples, we will cover some basics about the R stringr package.
Need to manage environment-specific configuration files in R? Look no futher than R config package.
Table of contents:
- What is the stringr package and How to Install it
- stringr in Action – 10 Functions You Must Know
- Summing up R’s stringr
What is the stringr package and How to Install it
The stringr package provides you with a collection of functions for working with strings. It was developed by Hadley Wickham, who is a Chief Scientist at Posit and a well-known figure in the world of R programming language.
This package is designed to be user-friendly, easy to learn, and easy to use, which makes it an essential tool for those who want to work with string data effectively.
stringr package has a lot of things going for it. It’s consistent with function naming, which isn’t always given in other packages. For example, all
stringr functions have a prefix of
str_, followed by the function name.
You can expect to find pretty much any function you can imagine, from simple string operations to pattern matching, substitution, trimming, splitting, and much more. It’s an easy to understand tidyverse wrapper over common stringi functions; generally, if the use-case is not too complex, stringr helps the user avoid using stringi.
But before you can use the package, you’ll have to install it. The recommended method is to install the entire
stringr is a part of it.
You can do so by running the following command from the R console:
Alternatively, you can install only
stringr by running the following command:
Either way, you now have
stringr installed, which means we can go over the top 10 functions next.
stringr in Action – 10 Functions to Preprocess Textual Data
This section will give you 10 function examples of the
stringr package, which will come in handy when preprocessing textual data.
As for the data, we’ll declare a vector
x that contains five strings:
library(stringr) x <- c("house", "car", "plant", "telephone", "arm chair") print(x)
Here’s what it looks like:
We can now apply a whole collection of stringr functions to this vector. Let’s start with a simple one.
This function is used when you want to return the number of characters in a given string. When applied to a vector, it returns a vector where each item represents the number of characters in a corresponding string.
str_length() function takes a string or a vector as a parameter and returns either an int or a vector of ints, depending on what was passed in.
Take a look at the following example – we’re using the function on the entire vector at once:
And this is the output:
The returned vector of integers matches the input vector of strings and informs you how long each string is.
str_sub() function returns a substring of a given string. It takes three parameters:
- The string (or a vector of strings)
- The starting index of the substring
- The ending index of the substring
For example, if you pass in
5 for the last two parameters, only a part of the string between those index locations would be returned.
This function is much easier to understand in practice, so let’s apply it to our vector of strings:
str_sub(x, start = 2, end = 5)
And here is the result:
It’s useful when you want to limit the number of characters or trim the start/end of a string.
This function returns a boolean or a vector of booleans. The value depends on whether the entered pattern exists in a given string or not.
str_detect() function takes two parameters – your string (or vector of strings) and a pattern to search for. If the pattern is found, the function returns
TRUE; otherwise, it returns
Let’s take a look at it in code. We’ll search for the
ar letter pattern in our vector of strings:
Here’s the resulting vector of booleans:
It’s a boolean vector, which means you can use it to select only those input strings that satisfy the condition:
We now get a vector of strings back:
There’s a more convenient function for doing so, and we’ll explore it later, but it doesn’t hurt to be a bit creative.
str_replace() function is useful when you want to replace the first occurrence of a pattern in a string with a specified replacement string. It takes three parameters:
- The string (or a vector of strings) to search
- The pattern to search
- The replacement string
The function returns a modified string in which the pattern to search is replaced with the replacement string, but only at the first occurrence.
Let’s give it a shot and replace all letters
e with a string
str_replace(x, "e", "***")
Here’s what it returns:
The function does what was advertised, which is replacing only the first occurrence of the search pattern. Just take a look at the
telephone string and you’ll see that only the first
e was replaced.
If you want to replace all occurrences, do so with the upcoming function.
This function is almost identical to the previous one, but it replaces all occurrences of the search pattern with the provided replacement string. It takes in identical parameters, so there’s no need to go over them once again.
We’ll once again replace all characters
e with a string
***. Here’s the code:
str_replace_all(x, "e", "***")
And these are the results:
Take a look at the
telephone string and you’ll immediately see that all
e‘s were successfully replaced.
In practice, you’ll use
str_replace_all() much more frequent than
str_count() function is here to count the number of times a search pattern appears in a string. It takes two parameters: the string on which the search is performed (or a vector of strings), and the search pattern which can also be a regular expression.
This function will return an integer (or a vector of integers) representing the number of times the search pattern was found.
Let’s declare the letter
a as a search pattern and perform the search on our vector of words:
Here’s what the function returns:
That’s the number of times the letter
a is present in all of the input strings. Easy!
Remember earlier when we said there’s an easier way to get a vector of strings that satisfies the condition than comparing it to a boolean vector? Well, this is the function for the job.
str_subset() function returns a subset of a vector of strings that match a certain search pattern. It takes in two parameters: the vector of strings to search and the search pattern itself.
Let’s take a look at this function in code and return all words that contain a letter
We get a vector of three strings back:
Neat! No need to reinvent the wheel.
str_trim() function is useful when you have messy strings full of leading and trailing whitespaces. It will remove all of them, either from a single string or from a vector of strings.
Since our vector
x doesn’t contain any elements with leading or trailing whitespaces, we’ll declare a new one that does:
y <- c(" hello ", "from ", " R ") print(y)
Here’s what it looks like:
From here, just pass this vector into the
str_trim() function and you’ll be good to go:
This is the result:
This function is particularly useful when processing form data, and will make sure no whitespace was entered by mistake.
This function will split a string or a vector of strings into a vector of substrings or a list of vectors of substrings, depending on the format of data passed in. It does so on a specified delimiter which you have to pass in, meaning there are two parameters in total to this function.
Now, there’s only one string with two words in our
x vector, so we’ll declare a new one where strings are a bit wordier:
z <- c("office chair", "front desk", "brown laptop case") print(z)
This is what it looks like:
We can now call
z and pass in space as a delimiter:
str_split(z, " ")
The function returns a list in which each child element is a vector of strings:
Let’s take a look at another function before wrapping up.
There’s actually no function named
str_to_xyz(), but there’s a set of functions for transforming a string or a vector of strings. You can use one of the following functions:
str_to_title()– To capitalize first letter of each word in a string
str_to_sentence()– To capitalize the first letter of a string
str_to_upper()– To uppercase the entire string
str_to_lower()– To lowercase the entire string
We’ll show you two of these in action. First, let’s use
str_to_title() on the entire vector
Here are the results:
Each word now has the first letter capitalized. Up next, let’s take a look at
str_to_upper(). Here’s the code:
And these are the results:
All letters of each vector item are now uppercased.
And these are the top 10 stringr functions you must know. Let’s make a brief recap next.
Summing Up Strings in R with stringr
To conclude, the R stringr package packs a powerful set of functions for working with text data. We’ve explored 10 of them in this article, and we hope they’ll help you in your job.
The main benefit of using the stringr package is its simplicity. The functions are intuitive and easy to use, even for newcomers to R. In addition, the package offers consistent syntax across functions, making it easy to learn and apply these tools to different text analysis projects.
What’s your favorite stringr/stringi function? Or a set of functions? Make sure to share in the comment section below, or reach out on Twitter – @appsilon. We’d love to hear your thoughts.
Having trouble managing dependencies in R projects? Try R renv, you’ll never look back.