How many times have you stumbled upon a fully-working solution to your coding problem, but in a different programming language? This happens all the time to R developers - the <i>different programming language</i> being Python most of the time. There's a way to use R and Python together, and today you'll learn all about it. The <a href="https://rstudio.github.io/reticulate/" target="_blank" rel="noopener">R reticulate</a> package allows you to import Python modules, use different Python virtual environments, and even import entire files - straight from R. Today you'll see a practical use case of a simple web scraper written in Python and used in R, and the implementation will be so clean you'd never suspect there's a communication between two programming languages. <blockquote>Want to explore alternative methods of using R and Python together? <a href="https://appsilon.com/use-r-and-python-together/" target="_blank" rel="noopener">This article covers two of them</a>.</blockquote> Table of contents: <ul><li><a href="#introduction">R reticulate Introduction - Problem Description</a></li><li><a href="#scraper">Writing a Web Scraper in Python</a></li><li><a href="#usage">How to Run Python Code from R with reticulate</a></li><li><a href="#summary">Summing up R reticulate Use Case</a></li></ul> <hr /> <h2 id="introduction">R reticulate Introduction - Problem Description</h2> The <code>reticulate</code> package from R packs a set of tools for interoperability between Python and R. This means you can easily call Python from R via R Markdown, Python scripts, Python libraries, and so on. You can also translate between common R and Python objects (R data frames and Python Pandas dataframes, R matrices, and Python Numpy arrays). Long story short - if you have some functions written in Python, reticulate allows you to use them in R. That's exactly what we'll do today. The idea is to build a simple web scraper that will fetch books from the <a href="http://books.toscrape.com/" target="_blank" rel="noopener">Books to Scrape</a> website. The site is built for developers to practice web scraping, so don't worry about <i>legality</i> or <a href="https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01" target="_blank" rel="noopener">ethics</a>. You have permission to scrape this site day in and day out. You'll see a bunch of books split into categories when you first open the page: <img class="size-full wp-image-15807" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d77e09973e5f1b984586_df0dc968_1-3.webp" alt="Image 1 - Books to Scrape website" width="2766" height="2046" /> Image 1 - Books to Scrape website Here, we care about the URL. Notice how it has <code>travel_2</code> as a category - we'll need this to fetch all books in a given category, but more on that in a minute. To scrape a single book, you'll need to know its HTML tags and classes for different elements. Below is an HTML hierarchy for a single book: <img class="size-full wp-image-15809" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d780c24a1b8139c5ee3d_c2952843_2-3.webp" alt="Image 2 - HTML hierarchy for a single book" width="4336" height="2654" /> Image 2 - HTML hierarchy for a single book We'll fetch 7 elements for every book. Each book is stored in <code>article.product_pod</code> tag: <ol><li><b>Title: </b><code>h3</code> -> <code>a</code> (<i>title</i> property).</li><li><b>Book URL: </b><code>div.image_container</code> -> <code>a</code> (<i>href</i> property).</li><li><b>Book thumbnail URL: </b><code>div.image_container</code> -> <code>img</code> (<i>src</i> property).</li><li><b>Rating: </b><code>p.star-rating</code> -> (<i>class</i> property, last item).</li><li><b>Price: </b><code>div.product_price</code> -> <code>p.price_color</code>.</li><li><b>Availability: </b><code>div.product_price</code> -> <code>p.instock availability</code>.</li><li><b>Topic: </b>Extracted from the URL.</li></ol> Now you know the tags, but there's still one thing we need to take care of, and that's the Python environment. The following shell command creates a new Anaconda virtual environment, activates it, and installs the needed Python libraries. <pre><code class="language-shell">conda create --name py_scrape python=3.9 -y <br>conda activate py_scrape pip install numpy pandas requests bs4 </code></pre> That's all we need to write the scraper. Let's do that next! <h2 id="scraper">Writing a Web Scraper in Python</h2> The best practice when working with R reticulate is to wrap the Python programming logic in a function. This way, you can call the function in R without any hassle. The <code>scrape_books</code> Python function does the following: <ul><li>Accepts a list of topics as an argument (provided by you, gathered from the website).</li><li>Generates a list of URLs.</li><li>Opens each URL and extracts our 7 points of interest for all listed books.</li><li>Saves the results to a dictionary and returns it as a Pandas DataFrame.</li></ul> It's a lot of work, so feel free to copy the snippet below to a <code>book_scraper.py</code> file if you're not too comfortable with Python: <pre><code class="language-python">import requests import numpy as np import pandas as pd from datetime import datetime from bs4 import BeautifulSoup <br> def scrape_books(topic_list): # Generate full URLs from the argument provided by user all_urls = [] for topic in topic_list: all_urls.append(f'http://books.toscrape.com/catalogue/category/books/{topic}/index.html') <br> # Instantiate an empty list for holding the dictionary objects all_books = [] <br> # Inform the user that the scraping has started start_time = datetime.now() print(f'Book Scraping in Progress...') # Iterate over every URL for url in all_urls: # Fetch HTML from it page = requests.get(url) soup = BeautifulSoup(page.content, 'lxml') # Topic can be extracted from the URL itself # Remove everything that isn't necessary - '_2' from 'travel_2' for example curr_topic = url.split('/')[-2].split('_')[0] <br> # article tag is the starting point books = soup.find_all('article', attrs={'class': 'product_pod'}) <br> # For every article tag on the webpage for book in books: # Initialize the variables so the error isn't thrown if data isn't found book_title = '' book_link = '' thumbnail_link = '' rating = '' price = '' availability = '' <br> # Check if title exists - if does, update book_title if book.find('h3').find('a') != None: book_title = book.find('h3').find('a').get('title') <br> # Check if link exists - if does, update book_link and thumbnail_link if book.find('div', attrs={'class': 'image_container'}).find('a') != None: base_book_url = 'http://books.toscrape.com/catalogue/' book_url = book.find('div', attrs={'class': 'image_container'}).find('a').get('href') book_link = base_book_url + book_url.split('../')[-1] <br> base_thumbnail_url = 'http://books.toscrape.com/' thumbnail_url = book.find('div', attrs={'class': 'image_container'}).find('img').get('src') thumbnail_link = base_thumbnail_url + thumbnail_url.split('../')[-1] <br> # Check if rating exists - if does, update rating if book.find('p', attrs={'class': 'star-rating'}) != None: rating = book.find('p', attrs={'class': 'star-rating'}).get('class')[-1] <br> # Check if price and availability exists - if does, update them if book.find('div', attrs={'class': 'product_price'}) != None: price = book.find('div', attrs={'class': 'product_price'}).find('p', attrs={'class': 'price_color'}).get_text() availability = book.find('div', attrs={'class': 'product_price'}).find('p', attrs={'class': 'instock availability'}).get_text().strip() <br> all_books.append({ 'Topic' : curr_topic, 'Title' : book_title, 'Rating' : rating, 'Price' : price, 'Availability' : availability, 'Link' : book_link, 'Thumbnail' : thumbnail_link }) # Inform the user that scraping has finished and report how long it took end_time = datetime.now() duration = int((end_time - start_time).total_seconds()) print('Scraping Finished!') print(f'\tIt took {duration} seconds to scrape {len(all_books)} books') # Return Pandas DataFrame representation of the list return pd.DataFrame(all_books)</code></pre> Next, we'll test the function from Python. Declare a list of categories and call the function: <pre><code class="language-python">book_categories = ['travel_2', 'mystery_3', 'historical-fiction_4', 'sequential-art_5', 'classics_6', 'philosophy_7', 'romance_8', 'womens-fiction_9', 'fiction_10', 'childrens_11', 'religion_12', 'nonfiction_13', 'music_14', 'science-fiction_16', 'fantasy_19'] <br>books_df = scrape_books(book_categories)</code></pre> You should see this printed on the console: <img class="size-full wp-image-15811" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d7805951232f1a8057a9_defcd202_3-3.webp" alt="Image 3 - Python output when running the code" width="922" height="146" /> Image 3 - Python output when running the code The books dataset is now in memory, and you can use the <code>head()</code> method to view the first 5 rows: <img class="size-full wp-image-15813" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d781d7575130662e7fa5_9bb230d7_4-3.webp" alt="Image 4 - Head of the books dataset in Python" width="3122" height="554" /> Image 4 - Head of the books dataset in Python Everything seems to work, so let's see how to call this function from R reticulate next. <h2 id="usage">How to Run Python Code from R with reticulate</h2> The <code>reticulate</code> package doesn't come with base R, so you'll need to install it: <pre><code class="language-r">install.packages("reticulate")</code></pre> We've created an Anaconda virtual environment earlier in the article, so we'll point reticulate to use it instead of the default system Python interpreter: <pre><code class="language-r">library(reticulate) use_condaenv("py_scrape")</code></pre> Now comes the fun part. To "load in" a Python file, use the <code>source_python()</code> function from reticulate. Simply pass in the file name and you'll be good to go: <pre><code class="language-r">source_python(file = "book_scraper.py")</code></pre> The <code>scrape_books()</code> Python function is now available for you to use in R. Let's declare a list of book categories and call it: <pre><code class="language-r">book_categories <- c('travel_2', 'mystery_3', 'historical-fiction_4', 'sequential-art_5', 'classics_6', 'philosophy_7', 'romance_8', 'womens-fiction_9', 'fiction_10', 'childrens_11', 'religion_12', 'nonfiction_13', 'music_14', 'science-fiction_16', 'fantasy_19') <br>df_books <- scrape_books(book_categories)</code></pre> You should see the following output in the R console: <img class="size-full wp-image-15815" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d7822cd28b714e23d95b_0a203344_5-3.webp" alt="Image 5 - R output when running the code" width="850" height="172" /> Image 5 - R output when running the code The books dataset is available in R memory now, and you can see it in the <i>Environment</i> tab: <img class="size-full wp-image-15817" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d78309973e5f1b984a9f_7c4873a2_6-1.webp" alt="Image 6 - The books dataset in R" width="2866" height="830" /> Image 6 - The books dataset in R That's how easy it is to use the reticulate package! No one is stopping you from continuing to work on the dataset in R. You have access to all the functions available to any other R data frame. Let's use <code>dplyr</code> to print counts of ratings: <pre><code class="language-r">library(dplyr) <br>df_books %>% group_by(Rating) %>% summarise(n = n()) %>% arrange(desc(n))</code></pre> <img class="size-full wp-image-15819" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d78419357fedd7b4f153_0d43a0ed_7-1.webp" alt="Image 7 - Counts of ratings" width="404" height="418" /> Image 7 - Counts of ratings <blockquote>Are you new to R dplyr? <a href="https://appsilon.com/r-dplyr-tutorial/" target="_blank" rel="noopener">Read our complete guide for beginners</a>.</blockquote> And that does it for today's R reticulate use case. Let's wrap things up next. <hr /> <h2 id="summary">Summing up R reticulate Use Case</h2> Being able to communicate between R and Python makes life easier. There's no denying that a larger code base is available for Python, and R reticulate allows you to leverage that code base without rewriting the entire thing from scratch. Just make sure to point reticulate to a correct Python environment with all the libraries installed. It's that easy. <i>What's your favorite way of cross-communication between R and Python?</i> Please let us know in the comment section below. Also, feel free to move the discussion to Twitter - <a href="https://twitter.com/appsilon" target="_blank" rel="noopener">@appsilon</a>. We'd love to hear from you. <blockquote>Want to learn more about web scraping? <a href="https://appsilon.com/webscraping-dynamic-websites-with-r/" target="_blank" rel="noopener">Read our latest guide on web scraping dynamic websites with R</a>.</blockquote>
Explore Possibilities
Share Your Data Goals with Us
From advanced analytics to platform development and pharma consulting, we craft solutions tailored to your needs.
Talk to our Experts