Customer Segmentation for R Users

Reading time:

time

min

September 13, 2019

<h2>TL; DR </h2> I used a <a href="https://www.kaggle.com/carrie1/ecommerce-data" target="_blank" rel="noopener noreferrer">Kaggle</a> database to show you how to separate your customers into distinct groups based on their purchase behavior. With this method, store managers can customize interactions with existing and potential customers to increase loyalty and eventually, all of the goodies that come with consistent purchases. For the R enthusiasts out there, I demonstrated what you can do with r/stats, <a href="https://github.com/ricardo-bion/ggradar" target="_blank" rel="noopener noreferrer">ggradar</a>, <a href="https://github.com/tidyverse/ggplot2" target="_blank" rel="noopener noreferrer">ggplot2</a>, <a href="https://github.com/yihui/animation" target="_blank" rel="noopener noreferrer">animation</a>, and <a href="https://github.com/kassambara/factoextra" target="_blank" rel="noopener noreferrer">factoextra</a>. <h2>Intro</h2> We find ourselves in a time when humanity has noticed the importance of data collection. Every financial transaction, every trip, or meeting with friends can be registered in one of the billions of databases. The tools to collect data points and store them have improved drastically in the last several years, as well as the tools to make sense of the quantitative and qualitative data. That is what we do at Appsilon -- we help organizations understand and visualize data. We have found that even businesses that collect data points carefully and deliberately are often still sitting on a potential treasure chest of uncovered and, consequently, un-leveraged business intelligence. Imagine a situation in which you lead an online shop. Wouldn’t it be useful to identify separate groups of clients that show different shopping behaviors? For this blog post, I have put myself in the role of an online shop owner. Spoiler alert: based on the available data and <b>Machine Learning</b> methods I extracted three specific customer profiles. You can sneak a peek at the profiles in the radar charts below. <img class="aligncenter size-full wp-image-2580" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65aac34e04e50f159e667b38_1radar1.webp" alt="1st radar plot for customer segmentation ex. " width="1397" height="1040" /> <img class="wp-image-2579 size-full" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b022cb2dc98004783b7c0a_2radar2.webp" alt="2nd radar plot customer segmentation ex." width="512" height="369" /> Made with <a href="https://github.com/ricardo-bion/ggradar" target="_blank" rel="noopener noreferrer">https://github.com/ricardo-bion/ggradar</a> I detected that my customers fall into three groups. Each group can be characterized by product choice, frequency and amount of purchases, as well as the type of purchases. I was even able to propose some promotional strategies to encourage each group to visit my shop in the future. If you want to learn the magic that stands behind the conversion of data to pricing and promotion strategy, as well as what hides behind the above radar charts, I encourage you to read the next sections. <h2>What is customer segmentation?</h2> The magic that allowed me to detect customer profiles is called customer segmentation. Customer Segmentation is a series of activities that aim to separate homogeneous groups of clients (retail or business) into sub-groups based on their behavior during the purchase. As a rule, each of the designated groups reacts differently to the product offered, thanks to which we have the opportunity to offer differently to each of them. <h2>Customer segmentation - analysis</h2> Before each analysis, it’s essential to explicitly state questions and expectations about the data and results. During my analysis I’d like to answer the following questions: <ol><li style="font-weight: 400;">What do I expect from my analysis?</li><li style="font-weight: 400;">Is the data I have sufficient for my analysis expectations?</li><li style="font-weight: 400;">Which algorithm should I use?</li><li style="font-weight: 400;">Hypothesis: Extracted groups allow me to differentiate customers in a visible way.</li><li style="font-weight: 400;">How am I going to use the results?</li></ol> <h2>What do I expect from my analysis?</h2> In this case, I’m the owner of an online shop. I store details about each order and transaction. I’d like to learn more about my customers and find out how can I attract them and encourage them to use my online shop in the future. My first idea is to find groups of similar customers based on shopping behavior, then analyze each group separately and find out what is important for each user while making an order. <h2>Is the data sufficient for my analysis expectations?</h2> For my analysis, I’m going to use E-commerce data that you can find here: <a href="https://www.kaggle.com/carrie1/ecommerce-data" target="_blank" rel="noopener noreferrer">https://www.kaggle.com/carrie1/ecommerce-data</a>. A quick spike into the data: <img class="aligncenter size-full wp-image-2601" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b022cd6738007ca3672d1a_quick-spike-into-data.webp" alt="quick data spike" width="1002" height="512" /> Based on such data I can extract lots of information about a customer’s shopping behavior. Useful info for my analysis can be: <ol><li style="font-weight: 400;">Average basket value.</li><li style="font-weight: 400;">Basket value range (min, max).</li><li style="font-weight: 400;">Order frequency.</li><li style="font-weight: 400;">Tendency to cancel an order.</li><li style="font-weight: 400;">User’s activity (first and last purchase time)</li></ol> To extract the required information, I aggregated the data twice. The first aggregation is based on “InvoiceNo,” the second one is dependent on “CustomerID,” so each row describes one customer. I also skipped using “StockCode” and “Country” variables. The “description” column will be used later. Such information is presented in the table below: <img class="aligncenter size-full wp-image-2584" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b022cec497a144bd9e4ccc_3.data-after-two-aggregations.webp" alt="data table after 2 aggregations" width="933" height="305" /> I still haven’t used the very important variable “Description”. It stores information about which products interest my customers the most. How can we use this information in the analysis? There are currently 3883 distinct products within the data. It would be useful to group the product by category, but this data point wasn’t included in the set. Since I didn’t want to come up with product categories on my own, I decided to scrape the data from a popular online shop that has the notion of a “product category” (I decided to use eBay. If you want to learn how you can scrape such data, check out Paweł Przytuła's post on <a href="https://appsilon.com/how-to-monitor-real-estate-market-data/" target="_blank" rel="noopener noreferrer">How to hack competition in the real estate market with data monitoring</a>; assuming that entering a product category for each item would take 15 seconds, I saved 14 hours with this technique… Maybe I’ll blog about it in the future). <img class="aligncenter size-full wp-image-2613" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b022cf2c75ed484ace0779_product-categories.webp" alt="product categories from ebay" width="537" height="200" /> Below is a list of selected products and the groups we matched after scraping: <img class="aligncenter size-full wp-image-2605" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b022d3aae8a26201c2ea90_after-scraping-product-category-list.webp" alt="after scraping product category list" width="787" height="281" /> Now we can switch from 3883 “Description” values to 41 “Category” values. Let’s use this information to create new sets of variables that store information about how much each customer spends in each category. We now have our final dataset: <img class="aligncenter size-full wp-image-2589" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b022d3df1312180459d345_final-dataset-1.webp" alt="final dataset for customer segmentation ex. " width="944" height="357" /> Going back to the topic question: Is the data I have sufficient for my analysis expectations? The answer is Yes. We now store information about the users’ spending behavior, their products of interest, and some basic information about the users’ activity. <h2>Which algorithm should I use?</h2> There are plenty of algorithms that are commonly used for segmentation. You might have heard about the very popular k-means, hierarchical clustering, latent class analysis, or even self-organizing maps. The question is which algorithm is best for my particular data set. The standard approach is to test out each algorithm and compare them according to existing measures. An example of such validation you may find in “<a href="https://www.datanovia.com/en/lessons/choosing-the-best-clustering-algorithms/" target="_blank" rel="noopener noreferrer">Choosing the Best Clustering Algorithms</a>.” The most popular algorithm used for partitioning a given data set into a set of k groups is k-means. We’ll use this in our case. The k - parameter is set pre-specified, but the post-analysis can help you choose the best value (silhouette or gap statistic). The algorithm tends to minimize inter-cluster variation that should result in separating homogeneous groups. The way the algorithm works is shown below: <img class="wp-image-2590 size-full" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b022d44a80805de208c546_ANIMATION-packagey_ihui.gif" alt="gif k-means customer segmentation" width="480" height="480" /> Made with <a href="https://github.com/yihui/animation" target="_blank" rel="noopener noreferrer">https://github.com/yihui/animation</a> I implemented a standard <a href="https://www.jstor.org/stable/2346830" target="_blank" rel="noopener noreferrer">Hartigan-Wong algorithm (1979)</a> with the <a href="https://www.r-project.org/contributors.html" target="_blank" rel="noopener noreferrer">R stats package</a>, and it is based on Euclidean distance. It is restricted to non-categorical data (numerical) so it works with our particular dataset. <h2>Applying the k-means algorithm </h2> As we learned before, the k-means algorithm doesn’t choose the optimal number of clusters upfront, but there are different techniques to make the selection. The most popular ones are <b>within-cluster sums of squares</b>, <b>average silhouette,</b> and <b>gap statistics</b>. The silhouette statistic for a single element compares its mean inner-cluster distance to the mean distance from the neighboring cluster. It varies from -1 to 1, where high positive values mean the element is correctly assigned to the current cluster, while negative values signify it’s better to assign it to a neighboring one. Here we present the average silhouette across all data points: <img class="wp-image-2591 size-full" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b022d61f8aab1bceb967a9_avg-silhouette-width-.webp" alt="average silhouette width" width="700" height="432" /> Made with <a href="https://github.com/kassambara/factoextra" target="_blank" rel="noopener noreferrer">https://github.com/kassambara/factoextra</a> As you can see above, the optimal number of clusters is 2 or 3. Even if 2 clusters show the maximum of the average silhouette statistic, 3 clusters show similar value and we tend to find more groups in our analysis. So let’s choose 3. To sum up, we’re going to use the k-means algorithm with 3 clusters. We can do it with one line of code: <figure class="highlight"> <pre class="language-r"><code class="language-r" data-lang="r">model <- kmeans(customer_data_scaled, 3) </code></pre> </figure> Let’s extract the chosen clusters from the created model and take a look at the data again: <figure class="highlight"> <pre class="language-r"><code class="language-r" data-lang="r">customer_data$cluster <- model$cluster head(customer_data) </code></pre> </figure> <img class="aligncenter size-full wp-image-2593" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b022d7fbced7e6143b4985_extract-clusters-data.webp" alt="extract clusters data" width="933" height="361" /> How can we verify if the clusters were extracted correctly? Our dataset stores 47 variables, so it’s impossible to compare assigned clusters across all variables (readable visualizations are restricted to a maximum of 3 dimensions). One of the most popular approaches that help solve the problem is <a href="https://towardsdatascience.com/understanding-pca-fae3e243731d" target="_blank" rel="noopener noreferrer">Principal Component Analysis (PCA)</a>. PCA combines variables of a provided dataset to create new ones, called PCA components, that capture most of the dataset variation. Plotting clusters distribution across the first PCA components should allow us to see if the clusters are separated or not. For this case, let’s plot how clusters were distributed comparing the 1st vs. the 2nd, as well as the 1st vs. the 3rd PCA components. <img class="aligncenter size-full wp-image-2595" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b022d91f8aab1bceb968ec_1st-vs-the-2nd.webp" alt="1st vs 2nd pca" width="512" height="316" /> <img class="wp-image-2596 size-full" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b022da08482900b427aae3_1st-vs-3rd-pca.webp" alt="1st vs 3rd principal component analysis" width="512" height="316" /> Made with <a href="https://github.com/kassambara/factoextra" target="_blank" rel="noopener noreferrer">https://github.com/kassambara/factoextra</a> From the above plots, we can certainly conclude that the 2nd (yellow) cluster is separate from the remaining ones. Clusters 1 and 3 are slightly overlapping, but each one covers high concentration groups of data points which is successful information in this analysis. As the PCA for the first three dimensions covers only 21% of the variance we may still expect that the remaining dimensions show even more exact separation of the clusters. To sum up, we’re happy with this result and we can now move to the next part of our analysis. <h2>Hypothesis: Extracted groups will allow me to differentiate customers in a visible way</h2> The plots above show cluster assignments across the first three PCA components (dim1, dim2, and dim3). How can we detect which indicators along 47 variables distinguish our customers? In general, it’s necessary to analyze distributions for each variable grouped by calculated cluster. A good approach that could be of use here is <b>violin plots</b>. Below we present a violin plot to show the differences of “avg_basket” in each cluster: <img class="wp-image-2597 size-full" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b022dcf20e4b4e372c38b3_violin-plot.webp" alt="violin plot " width="700" height="432" /> Made with <a href="https://github.com/tidyverse/ggplot2" target="_blank" rel="noopener noreferrer">https://github.com/tidyverse/ggplot2</a> For this variable, we can detect significant differences in “avg_basked” spending for each group. Nevertheless comparing all 47 variables profiles can be a burdensome approach. For simplification and the needs of this blog post, we’ll just check how the average value for each variable was distributed in each group; to do so I created radar charts that show all of the variables at once. <img class="aligncenter size-full wp-image-2580" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65aac34e04e50f159e667b38_1radar1.webp" alt="1st radar plot for customer segmentation ex." width="1397" height="1040" /> <img class="wp-image-2579 size-full" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b022cb2dc98004783b7c0a_2radar2.webp" alt="2nd radar plot customer segmentation ex." width="512" height="369" /> Made with <a href="https://github.com/ricardo-bion/ggradar" target="_blank" rel="noopener noreferrer">https://github.com/ricardo-bion/ggradar</a> The first chart sums up basket indicators (such as average basket value or a total number of baskets) across the 3 groups of customers. The second one shows the tendency for buying a product in a specific category. From the above summary, we can detect a few simple characteristics of customers in each group. Group no. 1 (red): <ul><li style="font-weight: 400;">Tends to spend a low amount of money on each basket. </li><li style="font-weight: 400;">Is mostly focused on ordering electronics, tickets/travel, and jewelry. </li><li style="font-weight: 400;">The clients on average are also least active in the recent past.</li><li style="font-weight: 400;">We can classify the group as typical <b>bargain hunters.</b> </li></ul> Group no. 2 (yellow): <ul><li style="font-weight: 400;">Tends to spend a lot of money on each basket.</li><li style="font-weight: 400;">They also order the highest number of baskets.</li><li style="font-weight: 400;">Products of interest for the group are varied. </li><li style="font-weight: 400;">The clients on average are also the most active in the recent past.</li><li style="font-weight: 400;">We can classify the group as <b>regular customers</b>.</li></ul> Group no. 3 (green): <ul><li style="font-weight: 400;">Mean value for basket-based indicators (no specific behavior shown).</li><li style="font-weight: 400;">The strong interest of the general group in the product category “Collectibles and Art.”</li><li style="font-weight: 400;">We can classify the group as <b>collectors</b>.</li></ul> <h2>How am I going to use the results?</h2> We were able to group our customers based on their purchase behavior and we managed to detect meaningful factors for each group. The best way forward is to prepare specific interactions for each one. Here are some ideas: <ol><li style="font-weight: 400;">For group no. 1, we can offer selected promotions for products from their groups of interest. We could periodically send the discount offers by email or show the message right after the user logs in to our shop.</li><li style="font-weight: 400;">A big part of regular customers may be entrepreneurs, so they order wholesale quantities of products. We can prepare an offer for them to get an extra discount when they buy in bulk.</li><li style="font-weight: 400;">Collectors might be encouraged to return if we inform them about new and/or unique products from our line. We could even include recommendations from the appropriate influencers. </li></ol> <h2>Summary</h2> To sum up, by answering a few questions about the data and applying the most popular clustering method we managed to get interesting information about our clients. With this small effort, we were able to propose what promotion strategies we should use to encourage the customers to make purchases in our online shop. Thanks for reading!

Customer Segmentation for R Users

Open source, pharma, and AI insights - once a week.

Share Your Data Goals with Us