

Sentiment Analysis of 5 Popular Romantic Comedies: Text Analysis in R
Updated: December 30, 2022.
With Valentine’s Day coming up, I was thinking about a fun analysis that I could convert into a blog post – an R sentiment analysis based on the top 5 romantic comedies.

Inspired by a beautiful visualization “Based on a True True story?”, I decided to do something similar: a sentiment analysis in the most popular romantic comedies.
After searching for romantic comedies, Google suggests a list of movies. Top 5 are: When Harry Met Sally, Love Actually, Pretty Woman, Notting Hill, and Sleepless in Seattle.
How to do Text Analysis in R
We can use the subtools
package to analyze the movies’ sentiment in R by loading the movie subtitles into R and then use tidytext
to work with the text data.
library(subtools)
library(tidytext)
library(dplyr)
library(plotly)
library(purrr)
library(lubridate)
library(methods)
library(plyr)
Working with movie data
I downloaded the srt subtitles for 5 comedies from Open Subtitles before the analysis.
Now let’s load them into R and have a sneak peak of what the data looks like.
romantic_comedies_titles <- c(
"Love Actually", "Notting Hill", "Pretty Woman",
"Sleepless in Seattle", "When Harry Met Sally"
)
subtitles_path <- "../assets/data/valentines/"
romantic_comedies <- romantic_comedies_titles %>% map(function(title){
title_no_space <- gsub(" ", "_", tolower(title))
title_file_name <- paste0(subtitles_path, title_no_space, ".srt") subtools::read_subtitles(title_file_name) %>%
mutate(movie_title = title)
})
head(romantic_comedies[[1]])
Image 1 – Subtitle data
Subtitles preprocessing
The next step is tokenization, chopping up the subtitles into single words. At this stage I also perform a minor cleaning task, which is removing stop words and adding information about the line and its duration.
tokenize_clean_subtitles <- function(subtitles, stop_words) { subtitles %>%
unnest_tokens(word, Text_content) %>%
anti_join(stop_words, by = "word") %>%
left_join(subtitles %>% select(ID, Text_content), by = "ID") %>%
mutate(
line = paste(Timecode_in, Timecode_out),
duration = as.numeric(hms(Timecode_out) - hms(Timecode_in)))
}
data("stop_words")
head(stop_words)
Image 2 – Stopwords
tokenize_romantic_comedies <- romantic_comedies %>%
map(~tokenize_clean_subtitles(., stop_words))
After tokenizing the data I need to classify the word sentiment. In this analysis I simply want to know if the word has positive or negative sentiment. Tidytext
package comes with 3 lexicons. The bing lexicon categorizes words as positive or negative. I use bing lexicon to assign the extracted words into desired classes.
bing <- tidytext::get_sentiments("bing")
assign_sentiment <- function(tokenize_subtitles, bing) { tokenize_subtitles %>%
left_join(bing, by = "word") %>%
mutate(sentiment = ifelse(is.na(sentiment), "neutral", sentiment)) %>%
mutate(score = ifelse(sentiment == "positive", 1, ifelse(sentiment == "negative", -1, 0)))
}
tokenize_romantic_comedies_with_sentiment <- tokenize_romantic_comedies %>%
map(~ assign_sentiment(., bing))
Since I am interested in deciding the sentiment of the movie line, I need to aggregate the scores on the line level. I create a simple rule: if the overall sentiment score is >= 1
we classify the line as positive, negative when <= -1
and neutral in the other cases.
summarized_movie_sentiment <- function(tokenize_subtitles_with_sentiment) { tokenize_subtitles_with_sentiment %>%
group_by(line) %>%
summarise(
sentiment_per_minute = sum(score),
sentiment_per_minute = ifelse(sentiment_per_minute >= 1, 1, ifelse(sentiment_per_minute <= -1, -1, 0)), line_duration = max(duration), line_text = dplyr::first(Text_content), movie_title = dplyr::first(movie_title) ) %>%
ungroup() %>%
mutate(perc_line_duration = line_duration / sum(line_duration))
}
summarized_sentiment_romantic_comedies <- tokenize_romantic_comedies_with_sentiment %>%
map(~ summarized_movie_sentiment(.))
summarized_sentiment_romantic_comedies
Image 3 – Summarised sentiment
Crème de la crème: data visualization
After I am done with data preparation and munging, the fun begins and I get to visualize the data. In order to achieve a similar look as the authors of “Based on a True True Story?”, I use stack horizontal bar charts in plotly
. The bar length represents the movie duration in minutes.
Hint: Hover on the chart to see the actual line and time.
Image 4 – Sentiment analysis visualization
sentiment_freq <- round(
ldply(summarized_sentiment_romantic_comedies, data.frame) %>%
group_by(factor(sentiment_per_minute)) %>%
summarize(duration = sum(perc_line_duration)) %>% .$duration * 100, 0
)
plot_title <- paste(
"<b>", summarized_sentiment_romantic_comedies$movie_title[1], "</b>",
'<span style="color: #FA0771">Positive', paste0(sentiment_freq[3], "%</span>"),
'<span style="color: #01A8F1">Negative', paste0(sentiment_freq[1], "%</span>")
)
plot_ly(ldply(summarized_sentiment_romantic_comedies, data.frame),
y = ~movie_title, x = ~perc_line_duration,
type = "bar", orientation = "h", color = ~sentiment_per_minute,
text = ~ paste("Time:", line, "<br>", "Line:", line_text),
hoverinfo = "text", colors = c("#01A8F1", "#f7f7f7", "#FA0771"),
width = 800, height = 200
) %>%
layout(
xaxis = list(
title = "", showgrid = FALSE, showline = FALSE,
showticklabels = FALSE, zeroline = FALSE,
domain = c(0, 1)
),
yaxis = list(title = "", showticklabels = FALSE),
barmode = "stack",
title = ~plot_title
) %>%
hide_colorbar()
Next steps
Recently, I learned about sentimentR
package that lets you analyze the sentiment on the sentence level. This would be interesting to conduct the analysis that way and see what sentiment scores would be received.
If you enjoyed this post spread the ♥ and share this post with someone who loves R as much as you!