Sentiment Analysis of 5 Popular Romantic Comedies: Text Analysis in R

Reading time:
time
min
By:
Olga Mierzwa-Sulima
February 14, 2018

<em><strong>Updated</strong>: December 30, 2022.</em> With Valentine’s Day coming up, I was thinking about a fun analysis that I could convert into a blog post - an R sentiment analysis based on the top 5 romantic comedies. <!--html_preserve--> <center> <img src="https://media.giphy.com/media/3ohuAc7s47ZTXyWLS0/giphy.gif" /></center> <!--/html_preserve--> Inspired by a beautiful visualization <a href="https://informationisbeautiful.net/visualizations/based-on-a-true-true-story/" target="_blank" rel="noopener noreferrer">“Based on a True True story?”</a>, I decided to do something similar: a sentiment analysis in the most popular romantic comedies. After searching for <em>romantic comedies</em>, Google suggests a list of movies. Top 5 are: <em><a href="http://www.imdb.com/title/tt0098635/?ref_=nv_sr_3" target="_blank" rel="noopener noreferrer">When Harry Met Sally</a></em>, <em><a href="http://www.imdb.com/title/tt0314331/" target="_blank" rel="noopener noreferrer">Love Actually</a></em>, <em><a href="http://www.imdb.com/title/tt0100405/?ref_=nv_sr_1" target="_blank" rel="noopener noreferrer">Pretty Woman</a></em>, <em><a href="http://www.imdb.com/title/tt0125439/?ref_=nv_sr_1" target="_blank" rel="noopener noreferrer">Notting Hill</a></em>, and <em><a href="http://www.imdb.com/title/tt0108160/?ref_=nv_sr_2" target="_blank" rel="noopener noreferrer">Sleepless in Seattle</a></em>. <hr /> <h2 id="how-to-do-text-analysis-in-r">How to do Text Analysis in R</h2> We can use the <code class="highlighter-rouge">subtools</code> package to analyze the movies’ sentiment in R by loading the movie subtitles into R and then use <code class="highlighter-rouge">tidytext</code> to work with the text data. <pre><code class="language-r">library(subtools) library(tidytext) library(dplyr) library(plotly) library(purrr) library(lubridate) library(methods) library(plyr)</code></pre> <h3 id="working-with-movie-data">Working with movie data</h3> I downloaded the srt subtitles for 5 comedies from <a href="https://www.opensubtitles.org/en/?" target="_blank" rel="noopener noreferrer">Open Subtitles</a> before the analysis. Now let’s load them into R and have a sneak peak of what the data looks like. <pre><code class="language-r">romantic_comedies_titles &lt;- c(  "Love Actually", "Notting Hill", "Pretty Woman",  "Sleepless in Seattle", "When Harry Met Sally" ) subtitles_path &lt;- "../assets/data/valentines/" <br>romantic_comedies &lt;- romantic_comedies_titles %&gt;% map(function(title){  title_no_space &lt;- gsub(" ", "_", tolower(title))  title_file_name &lt;- paste0(subtitles_path, title_no_space, ".srt") subtools::read_subtitles(title_file_name) %&gt;%    mutate(movie_title = title) }) <br>head(romantic_comedies[[1]])</code></pre> <img class="size-full wp-image-17235" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b0235a4a4af475c4356630_1-7.webp" alt="Image 1 - Subtitle data" width="1264" height="270" /> Image 1 - Subtitle data <h3 id="subtitles-preprocessing">Subtitles preprocessing</h3> The next step is <strong>tokenization</strong>, chopping up the subtitles into single words. At this stage I also perform a minor cleaning task, which is removing <strong>stop words</strong> and adding information about the line and its duration. <pre><code class="language-r">tokenize_clean_subtitles &lt;- function(subtitles, stop_words) { subtitles %&gt;%    unnest_tokens(word, Text_content) %&gt;%    anti_join(stop_words, by = "word") %&gt;%    left_join(subtitles %&gt;% select(ID, Text_content), by = "ID") %&gt;%    mutate(      line = paste(Timecode_in, Timecode_out),      duration = as.numeric(hms(Timecode_out) - hms(Timecode_in))) } <br>data("stop_words") <br>head(stop_words)</code></pre> <div class="mceTemp"></div> <img class="size-full wp-image-17237" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b0207bce6734eff8380a6b_2-5.webp" alt="Image 2 - Stopwords" width="263" height="266" /> Image 2 - Stopwords <pre><code class="language-r">tokenize_romantic_comedies &lt;- romantic_comedies %&gt;%  map(~tokenize_clean_subtitles(., stop_words))</code></pre> After tokenizing the data I need to classify the <strong>word sentiment</strong>. In this analysis I simply want to know if the word has positive or negative sentiment. <code class="highlighter-rouge">Tidytext</code> package comes with <a href="https://www.tidytextmining.com/sentiment.html#comparing-the-three-sentiment-dictionaries" target="_blank" rel="noopener noreferrer">3 lexicons</a>. The <a href="https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html" target="_blank" rel="noopener noreferrer">bing</a> lexicon categorizes words as positive or negative. I use <em>bing</em> lexicon to assign the extracted words into desired classes. <pre><code class="language-r">bing &lt;- tidytext::get_sentiments("bing") <br>assign_sentiment &lt;- function(tokenize_subtitles, bing) { tokenize_subtitles %&gt;%    left_join(bing, by = "word") %&gt;%    mutate(sentiment = ifelse(is.na(sentiment), "neutral", sentiment)) %&gt;%    mutate(score = ifelse(sentiment == "positive", 1, ifelse(sentiment == "negative", -1, 0))) } <br>tokenize_romantic_comedies_with_sentiment &lt;- tokenize_romantic_comedies %&gt;%  map(~ assign_sentiment(., bing))</code></pre> Since I am interested in deciding the sentiment of the movie line, I need to aggregate the scores on the line level. I create a simple rule: if the overall sentiment score is <code class="highlighter-rouge">&gt;= 1</code> we classify the line as positive, negative when <code class="highlighter-rouge">&lt;= -1</code> and neutral in the other cases. <pre><code class="language-r">summarized_movie_sentiment &lt;- function(tokenize_subtitles_with_sentiment) { tokenize_subtitles_with_sentiment %&gt;%    group_by(line) %&gt;%    summarise(      sentiment_per_minute = sum(score),      sentiment_per_minute = ifelse(sentiment_per_minute &gt;= 1, 1, ifelse(sentiment_per_minute &lt;= -1, -1, 0)), line_duration = max(duration), line_text = dplyr::first(Text_content), movie_title = dplyr::first(movie_title) ) %&gt;%    ungroup() %&gt;%    mutate(perc_line_duration = line_duration / sum(line_duration)) } <br>summarized_sentiment_romantic_comedies &lt;- tokenize_romantic_comedies_with_sentiment %&gt;%  map(~ summarized_movie_sentiment(.)) <br>summarized_sentiment_romantic_comedies</code></pre> <img class="size-full wp-image-17239" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b27075e7bf5d6c2d302459_3-3.webp" alt="Image 3 - Summarised sentiment" width="1542" height="466" /> Image 3 - Summarised sentiment <h3 id="crème-de-la-crème---data-viz">Crème de la crème: data visualization</h3> After I am done with data preparation and munging, the fun begins and I get to visualize the data. In order to achieve a similar look as the authors of “Based on a True True Story?”, I use stack horizontal bar charts in <code class="highlighter-rouge">plotly</code>. The bar length represents the movie duration in minutes. <em>Hint:</em> Hover on the chart to see the actual line and time. <img class="size-full wp-image-17233" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b02117fd01060a37e07506_1-6.webp" alt="Image 4 - Sentiment analysis visualization" width="780" height="186" /> Image 4 - Sentiment analysis visualization <pre><code class="language-r">sentiment_freq &lt;- round(  ldply(summarized_sentiment_romantic_comedies, data.frame) %&gt;%    group_by(factor(sentiment_per_minute)) %&gt;%    summarize(duration = sum(perc_line_duration)) %&gt;% .$duration * 100, 0 ) <br>plot_title &lt;- paste(  "&lt;b&gt;", summarized_sentiment_romantic_comedies$movie_title[1], "&lt;/b&gt;",  '&lt;span style="color: #FA0771"&gt;Positive', paste0(sentiment_freq[3], "%&lt;/span&gt;"),  '&lt;span style="color: #01A8F1"&gt;Negative', paste0(sentiment_freq[1], "%&lt;/span&gt;") ) <br>plot_ly(ldply(summarized_sentiment_romantic_comedies, data.frame),  y = ~movie_title, x = ~perc_line_duration,  type = "bar", orientation = "h", color = ~sentiment_per_minute,  text = ~ paste("Time:", line, "&lt;br&gt;", "Line:", line_text),  hoverinfo = "text", colors = c("#01A8F1", "#f7f7f7", "#FA0771"),  width = 800, height = 200 ) %&gt;%  layout(    xaxis = list(      title = "", showgrid = FALSE, showline = FALSE,      showticklabels = FALSE, zeroline = FALSE,      domain = c(0, 1)    ),    yaxis = list(title = "", showticklabels = FALSE),    barmode = "stack",    title = ~plot_title  ) %&gt;%  hide_colorbar()</code></pre> <hr /> <h2 id="next-steps">Next steps</h2> Recently, I learned about <code class="highlighter-rouge">sentimentR</code> package that lets you analyze the sentiment on the sentence level. This would be interesting to conduct the analysis that way and see what sentiment scores would be received. If you enjoyed this post spread the ♥ and share this post with someone who loves R as much as you!

Have questions or insights?

Engage with experts, share ideas and take your data journey to the next level!

Is Your Software GxP Compliant?

Download a checklist designed for clinical managers in data departments to make sure that software meets requirements for FDA and EMA submissions.
Explore Possibilities

Share Your Data Goals with Us

From advanced analytics to platform development and pharma consulting, we craft solutions tailored to your needs.

Talk to our Experts
community
data analytics
r