Investigating words distribution with R - Zipf’s law
Hello again! Typically I would start by describing a complicated problem that can be solved using machine or deep learning methods, but today I want to do something different, I want to show you some interesting probabilistic phenomena! Have you heard of <strong>Zipf’s law</strong>? I hadn't until recently. Zipf’s law is an empirical law that states that many different datasets found in nature can be described using Zipf’s distribution. Most notably, word frequencies in books, documents and even languages can be described in this way. Simplified, Zipf’s law states that if we take a document, book or any collection of words and then the how many times each word occurs, their frequencies will be very similar to Zipf’s distribution. Let’s say that the number of occurrences of the most frequently occurring word is: <p style="text-align: center;">X</p> Zipf’s law states that the number of occurrences of the second most frequently occurring word will be equal to: <p style="text-align: center;">X/2</p> So basically this word will occur half of the number of times the most frequent word did. The number of occurrences of the third most frequently occurring word would be: <p style="text-align: center;">X/3</p> And so on … So the number of occurrences of the Nth most frequent word would be: <p style="text-align: center;">X/N</p> Most recent studies of this phenomena show that in the case of words, typically there is the same value of ?, and the frequency on Nth word is described as: <p style="text-align: center;">X/N<sup>?</sup></p> To check the theory I downloaded a set of the 50,000 most frequent Polish words in subtitles (<a href="https://github.com/hermitdave/FrequencyWords/blob/master/content/2016/pl/pl_50k.txt">https://github.com/hermitdave/FrequencyWords/blob/master/content/2016/pl/pl_50k.txt</a>) from <em>OpenSubtitles.org</em>. Here’s a visualization of real and theoretical frequencies. To see it more clearly we can use logarithmic scales. Try it out yourself: a list of example datasets can be found here: <a href="https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists">https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists</a> You can use this example code to create a similar visualization: <figure class="highlight"> <pre class="language-r"><code class="language-r" data-lang="r"> library(ggplot2) library(dplyr) library(themes) library(gganimate) <br>word_count <- # Data frame containing words and their frequency colnames(word_count) <- c("word", "count") alpha <- 1 # Change it needed word_count <- word_count %>% mutate(word = factor(word, levels = word), rank = row_number(), zipfs_freq = ifelse(rank == 1, count, dplyr::first(count) / rank^alpha)) <br>zipfs_plot <- ggplot(word_count, aes(x = rank, y = count)) + geom_point(aes(color = "observed")) + theme_bw() + geom_point(aes(y = zipfs_freq, color = "theoretical")) + transition_reveal(count, rank) + labs(x = "rank", y = "count", title = "Zipf's law visualization") + scale_colour_manual(name = "Word count", values=c("theoretical" = "red", "observed" = "black")) + theme(legend.position = "top") zipfs_animation <- animate(p) </code></pre> </figure> This experiment is amazing, because language is very complicated: words in text are not random in any sense, and they depend on the previous ones. That’s why it's so surprising to see such patterns here. We should always remember that the world can astonish us in many different ways! See you next time :)