Investigating words distribution with R - Zipf’s law

Reading time:

time

min

tutorial

natural language processing

ai&research

By:

Appsilon Team

February 27, 2019

Hello again! Typically I would start by describing a complicated problem that can be solved using machine or deep learning methods, but today I want to do something different, I want to show you some interesting probabilistic phenomena! Have you heard of Zipf’s law? I hadn't until recently. Zipf’s law is an empirical law that states that many different datasets found in nature can be described using Zipf’s distribution. Most notably, word frequencies in books, documents and even languages can be described in this way. Simplified, Zipf’s law states that if we take a document, book or any collection of words and then the how many times each word occurs, their frequencies will be very similar to Zipf’s distribution. Let’s say that the number of occurrences of the most frequently occurring word is: X Zipf’s law states that the number of occurrences of the second most frequently occurring word will be equal to: X/2 So basically this word will occur half of the number of times the most frequent word did. The number of occurrences of the third most frequently occurring word would be: X/3 And so on … So the number of occurrences of the Nth most frequent word would be: X/N Most recent studies of this phenomena show that in the case of words, typically there is the same value of ?, and the frequency on Nth word is described as: X/N? To check the theory I downloaded a set of the 50,000 most frequent Polish words in subtitles (<a href="https://github.com/hermitdave/FrequencyWords/blob/master/content/2016/pl/pl_50k.txt">https://github.com/hermitdave/FrequencyWords/blob/master/content/2016/pl/pl_50k.txt</a>) from OpenSubtitles.org. Here’s a visualization of real and theoretical frequencies.   To see it more clearly we can use logarithmic scales.   Try it out yourself: a list of example datasets can be found here: <a href="https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists">https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists</a> You can use this example code to create a similar visualization: <figure class="highlight"> <pre class="language-r"><code class="language-r" data-lang="r"> library(ggplot2) library(dplyr) library(themes) library(gganimate) word_count <- # Data frame containing words and their frequency colnames(word_count) <- c("word", "count") alpha <- 1 # Change it needed word_count <- word_count %>% mutate(word = factor(word, levels = word), rank = row_number(), zipfs_freq = ifelse(rank == 1, count, dplyr::first(count) / rank^alpha)) zipfs_plot <- ggplot(word_count, aes(x = rank, y = count)) + geom_point(aes(color = "observed")) + theme_bw() + geom_point(aes(y = zipfs_freq, color = "theoretical")) + transition_reveal(count, rank) + labs(x = "rank", y = "count", title = "Zipf's law visualization") + scale_colour_manual(name = "Word count", values=c("theoretical" = "red", "observed" = "black")) + theme(legend.position = "top") zipfs_animation <- animate(p) </code></pre> </figure> This experiment is amazing, because language is very complicated: words in text are not random in any sense, and they depend on the previous ones. That’s why it's so surprising to see such patterns here. We should always remember that the world can astonish us in many different ways! See you next time :)

Have questions or insights?

Engage with experts, share ideas and take your data journey to the next level!

Stop Struggling with Outdated Clinical Data Systems

Join pharma data leaders from Jazz Pharmaceuticals and Novo Nordisk in our live podcast episode as they share what really works when building modern, compliant Statistical Computing Environments (SCEs).

Save My Spot

Is Your Software GxP Compliant?

Download a checklist designed for clinical managers in data departments to make sure that software meets requirements for FDA and EMA submissions.

Get the Checklist

Ensure Your R and Python Code Meets FDA and EMA Standards

A comprehensive diagnosis of your R and Python software and computing environment compliance with actionable recommendations and areas for improvement.