Monte Carlo Tree Search: An Introduction

Estimated time:
time
min

<h2 id="introduction">Introduction to Monte Carlo Tree Search</h2> Just a few short years ago, we witnessed one of the most important game-related AI events in history – Alpha Go became the first computer program to beat the world champion in a game of Go. The publication can be found <a href="https://www.nature.com/articles/nature24270">here</a>. Different techniques from machine learning and tree search have been combined by developers from DeepMind to achieve this result. One of them is the Monte Carlo Tree Search (MCTS) algorithm. This algorithm is fairly simple to understand and, interestingly, has applications outside of game AI. Below, I will explain the concept behind MCTS algorithm and briefly tell you about how it was used at the European Space Agency for planning interplanetary flights. <h2 id="perfect-information-games">Perfect Information Games</h2> Monte Carlo Tree Search is an algorithm used when playing a so-called perfect information game. In short, perfect information games are games in which, at any point in time, each player has perfect information about all event actions that have previously taken place. Examples of such games are Chess, Go or Tic-Tac-Toe. But just because every move is known, doesn’t mean that every possible outcome can be calculated and extrapolated. For example, the number of possible legal game positions in Go is over <script type="math/tex">10^{170}</script>. <label class="margin-toggle" style="font-size: 0.8em; text-decoration: underline;" for="‘source"><i class="fa fa-sticky-note" aria-hidden="true"></i> sticky note</label><input id="‘source" class="margin-toggle" type="checkbox" /><span class="marginnote"><a href="https://tromp.github.io/go/gostate.pdf"><em>Source</em></a> </span> Every perfect information game can be represented in the form of a tree data structure in the following way. At first, you have a root which encapsulates the beginning state of the game. For Chess that would be 16 white figures and 16 black figures placed in the proper places on the chessboard. For Tic-Tac-Toe it would be simply 3x3 empty matrix. The first player has some number <script type="math/tex">n_1</script> of possible choices to make. In the case of Tic-Tac-Toe this would be 9 possible places to draw a circle. Each such move changes the state of the game. These outcome states are the children of the root node. Then, for each of <script type="math/tex">n_1</script> children, the next player has <script type="math/tex">n_2</script> possible moves to consider, each of them generating another state of the game - generating a child node. Note that <script type="math/tex">n_2</script> might differ for each of <script type="math/tex">n_1</script> nodes. For instance in chess you might make a move which forces your enemy to make a move with their king or consider another move which leaves your opponent with many other options. An outcome of a play is a path from the root node to one of the leaves. Each leaf consist a definite information which player (or players) have won and which have lost the game. <h2 id="making-a-decision-based-on-a-tree">Making a Decision Based on a Tree</h2> There are two main problems we face when making a decision in perfect information game. The first, and main one is the size of the tree. This doesn’t bother us with very limited games such as Tic-Tac-Toe. We have at most 9 children nodes (at the beginning) and this number gets smaller and smaller as we continue playing. It’s a completely different story with Chess or Go. Here the corresponding tree is so huge that you cannot hope to search the entire tree. The way to approach this is to do a random walk on the tree for some time and get a subtree of the original decision tree. This, however, creates a second problem. If every time we play we just walk randomly down the tree, we don’t care at all about the efficiency of our move and do not learn from our previous games. Whoever played Chess during his or her life knows that making random moves on a chessboard won’t get him too far. It might be good for a beginner to get an understanding of how the pieces move. But game after game it’s better to learn how to distinguish good moves from bad ones. So, is there a way to somehow use the facts contained in the previously built decision trees to reason about our next move? As it turns out, there is. <h2 id="multi-armed-bandit-problem">Multi-Armed Bandit Problem</h2> Imagine that you are at a casino and would like to play a slot machine. You can choose one randomly and enjoy your game. Later that night, another gambler sits next to you and wins more in 10 minutes than you have during the last few hours. You shouldn’t compare yourself to the other guy, it’s just luck. But still, it’s normal to ask whether the next time you can do better. Which slot machine should you choose to win the most? Maybe you should play more than one machine at a time? The problem you are facing is the Multi-Armed Bandit Problem. It was already known during II World War, but the most commonly known version today was formulated by Herbert Robbins in 1952. There are <script type="math/tex">N</script> slot machines, each one with a different expected return value (what you expect to net from a given machine). You don’t know the expected return values for any machine. You are allowed to change machines at any time and play on each machine as many times as you’d like. What is the optimal strategy for playing? What does “optimal” mean in this scenario? Clearly your best option would be to play only on the machine with highest return value. An optimal strategy is a strategy for which you do as well as possible compared to the best machine. It was <a href="https://ac.els-cdn.com/0196885885900028/1-s2.0-0196885885900028-main.pdf?_tid=bebd97a8-bda1-11e7-8cfc-00000aab0f6c&amp;acdnat=1509388984_087d17f273327115f07d7cff1d0d294b">actually proven</a> that you cannot do better than <script type="math/tex">O( \ln n )</script> on average. So that’s the best you can hope for. Luckily, it was also proven that you can achieve this bound (again - on average){:target=”_blank”}. One way to do this is to do the following. <label class="margin-toggle" style="font-size: 0.8em; text-decoration: underline;" for="‘source"><i class="fa fa-sticky-note" aria-hidden="true"></i> sticky note</label><input id="‘source" class="margin-toggle" type="checkbox" /><span class="marginnote">Read <a href="http://homes.di.unimi.it/~cesabian/Pubblicazioni/ml-02.pdf">this paper</a> if you are interested in the proof. </span> For each machine <script type="math/tex">i</script> we keep track of two things: how many times we have tried this machine (<script type="math/tex">n_i</script>) and what the mean return value (<script type="math/tex">x_i</script>) was. We also keep track of how many times (<script type="math/tex">n</script>) we have played in general. Then for each i we compute the confidence interval around <script type="math/tex">x_i</script>: <script type="math/tex; mode=display">x_i \pm \sqrt{ 2 \cdot \frac{\ln n}{n_i} }</script> All the time we choose to play on the machine with the highest upper bound for <script type="math/tex">x_i</script> (so “+” in the formula above). This is a solution to Multi-Armed Bandit Problem. Now note that we can use it for our perfect information game. Just treat each possible next move (child node) as a slot machine. Each time we choose to play a move we end up winning, losing or drawing. This is our pay-out. For simplicity, I will assume that we are only interested in winning, so pay-out is 1 if we have won and 0 otherwise. <h2 id="real-world-application-example">Real World Application Example</h2> MAB algorithms have multiple practical implementations in the real world, for example, price engine optimization or finding the best online campaign. Let’s focus on the first one and see how we can implement this in R. Imagine you are selling your products online and want to introduce a new one, but are not sure how to price it. You came up with 4 price candidates based on our expert knowledge and experience: <code class="highlighter-rouge">99$</code>, <code class="highlighter-rouge">100$</code>, <code class="highlighter-rouge">115$</code> and <code class="highlighter-rouge">120$</code>. Now you want to test how those prices will perform and which to choose eventually. During first day of your experiment 4000 people visited your shop when the first price (<code class="highlighter-rouge">99$</code>) was tested and 368 bought the product, for the rest of the prices we have the following outcome: <ul><li><code class="highlighter-rouge">100$</code> 4060 visits and 355 purchases,</li><li><code class="highlighter-rouge">115$</code> 4011 visits and 373 purchases,</li><li><code class="highlighter-rouge">120$</code> 4007 visits and 230 purchases.</li></ul> Now let’s look at the calculations in R and check which price was performing best during the first day of our experiment. <figure class="highlight"> <pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">bandit</span><span class="p">)</span> <br><span class="n">set.seed</span><span class="p">(</span><span class="m">123</span><span class="p">)</span> <br><span class="n">visits_day1</span> <span class="o">&lt;-</span> <span class="nf">c</span><span class="p">(</span><span class="m">4000</span><span class="p">,</span> <span class="m">4060</span><span class="p">,</span> <span class="m">4011</span><span class="p">,</span> <span class="m">4007</span><span class="p">)</span> <span class="n">purchase_day1</span> <span class="o">&lt;-</span> <span class="nf">c</span><span class="p">(</span><span class="m">368</span><span class="p">,</span> <span class="m">355</span><span class="p">,</span> <span class="m">373</span><span class="p">,</span> <span class="m">230</span><span class="p">)</span> <span class="n">prices</span> <span class="o">&lt;-</span> <span class="nf">c</span><span class="p">(</span><span class="m">99</span><span class="p">,</span> <span class="m">100</span><span class="p">,</span> <span class="m">115</span><span class="p">,</span> <span class="m">120</span><span class="p">)</span> <br><span class="n">post_distribution</span> <span class="o">=</span> <span class="n">sim_post</span><span class="p">(</span><span class="n">purchase_day1</span><span class="p">,</span> <span class="n">visits_day1</span><span class="p">,</span> <span class="n">ndraws</span> <span class="o">=</span> <span class="m">10000</span><span class="p">)</span> <span class="n">probability_winning</span> <span class="o">&lt;-</span> <span class="n">prob_winner</span><span class="p">(</span><span class="n">post_distribution</span><span class="p">)</span> <span class="nf">names</span><span class="p">(</span><span class="n">probability_winning</span><span class="p">)</span> <span class="o">&lt;-</span> <span class="n">prices</span> <br><span class="n">probability_winning</span></code></pre> </figure> <figure class="highlight"> <pre><code class="language-text" data-lang="text">##     99    100    115    120 ## 0.3960 0.0936 0.5104 0.0000</code></pre> </figure> We calculated the Bayesian probability that the price performed the best and can see that the price <code class="highlighter-rouge">115$</code> has the highest probability (0.5). On the other hand <code class="highlighter-rouge">120$</code> seems bit too much for the customers. The experiment continues for a few more days. Day 2 results: <figure class="highlight"> <pre><code class="language-r" data-lang="r"><span class="n">visits_day2</span> <span class="o">&lt;-</span> <span class="nf">c</span><span class="p">(</span><span class="m">8030</span><span class="p">,</span> <span class="m">8060</span><span class="p">,</span> <span class="m">8027</span><span class="p">,</span> <span class="m">8037</span><span class="p">)</span> <span class="n">purchase_day2</span> <span class="o">&lt;-</span> <span class="nf">c</span><span class="p">(</span><span class="m">769</span><span class="p">,</span> <span class="m">735</span><span class="p">,</span> <span class="m">786</span><span class="p">,</span> <span class="m">420</span><span class="p">)</span> <br><span class="n">post_distribution</span> <span class="o">=</span> <span class="n">sim_post</span><span class="p">(</span><span class="n">purchase_day2</span><span class="p">,</span> <span class="n">visits_day2</span><span class="p">,</span> <span class="n">ndraws</span> <span class="o">=</span> <span class="m">1000000</span><span class="p">)</span> <span class="n">probability_winning</span> <span class="o">&lt;-</span> <span class="n">prob_winner</span><span class="p">(</span><span class="n">post_distribution</span><span class="p">)</span> <span class="nf">names</span><span class="p">(</span><span class="n">probability_winning</span><span class="p">)</span> <span class="o">&lt;-</span> <span class="n">prices</span> <br><span class="n">probability_winning</span></code></pre> </figure> <figure class="highlight"> <pre><code class="language-text" data-lang="text">##       99      100      115      120 ## 0.308623 0.034632 0.656745 0.000000</code></pre> </figure> After the second day price <code class="highlighter-rouge">115$</code> still shows the best results, with <code class="highlighter-rouge">99$</code> and <code class="highlighter-rouge">100$</code> performing very similar. Using <code class="highlighter-rouge">bandit</code> package we can also perform significant analysis, which is handy for overall proportion comparison using <code class="highlighter-rouge">prop.test</code>. <figure class="highlight"> <pre><code class="language-r" data-lang="r"><span class="n">significance_analysis</span><span class="p">(</span><span class="n">purchase_day2</span><span class="p">,</span> <span class="n">visits_day2</span><span class="p">)</span></code></pre> </figure> <figure class="highlight"> <pre><code class="language-text" data-lang="text">##   successes totals estimated_proportion        lower      upper ## 1       769   8030           0.09576588 -0.004545319 0.01369494 ## 2       735   8060           0.09119107  0.030860453 0.04700507 ## 3       786   8027           0.09791952 -0.007119595 0.01142688 ## 4       420   8037           0.05225831           NA         NA ##   significance rank best       p_best ## 1 3.322143e-01    2    1 3.086709e-01 ## 2 1.437347e-21    3    1 2.340515e-06 ## 3 6.637812e-01    1    1 6.564434e-01 ## 4           NA    4    0 1.548068e-39</code></pre> </figure> At this point we can see that <code class="highlighter-rouge">120$</code> is still performing badly, so we drop it from the experiment and continue for the next day. Chances that this alternative is the best according to the <code class="highlighter-rouge">p_best</code> are very small (p_best has negligible value). Day 3 results: <figure class="highlight"> <pre><code class="language-r" data-lang="r"><span class="n">visits_day3</span> <span class="o">&lt;-</span> <span class="nf">c</span><span class="p">(</span><span class="m">15684</span><span class="p">,</span> <span class="m">15690</span><span class="p">,</span> <span class="m">15672</span><span class="p">,</span> <span class="m">8037</span><span class="p">)</span> <span class="n">purchase_day3</span> <span class="o">&lt;-</span> <span class="nf">c</span><span class="p">(</span><span class="m">1433</span><span class="p">,</span> <span class="m">1440</span><span class="p">,</span> <span class="m">1495</span><span class="p">,</span> <span class="m">420</span><span class="p">)</span> <br><span class="n">post_distribution</span> <span class="o">=</span> <span class="n">sim_post</span><span class="p">(</span><span class="n">purchase_day3</span><span class="p">,</span> <span class="n">visits_day3</span><span class="p">,</span> <span class="n">ndraws</span> <span class="o">=</span> <span class="m">1000000</span><span class="p">)</span> <span class="n">probability_winning</span> <span class="o">&lt;-</span> <span class="n">prob_winner</span><span class="p">(</span><span class="n">post_distribution</span><span class="p">)</span> <span class="nf">names</span><span class="p">(</span><span class="n">probability_winning</span><span class="p">)</span> <span class="o">&lt;-</span> <span class="n">prices</span> <span class="n">probability_winning</span></code></pre> </figure> <figure class="highlight"> <pre><code class="language-text" data-lang="text">##       99      100      115      120 ## 0.087200 0.115522 0.797278 0.000000</code></pre> </figure> <figure class="highlight"> <pre><code class="language-r" data-lang="r"><span class="n">value_remaining</span> <span class="o">=</span> <span class="n">value_remaining</span><span class="p">(</span><span class="n">purchase_day3</span><span class="p">,</span> <span class="n">visits_day3</span><span class="p">)</span> <span class="n">potential_value</span> <span class="o">=</span> <span class="n">quantile</span><span class="p">(</span><span class="n">value_remaining</span><span class="p">,</span> <span class="m">0.95</span><span class="p">)</span> <span class="n">potential_value</span></code></pre> </figure> <figure class="highlight"> <pre><code class="language-text" data-lang="text">##        95% ## 0.02670002</code></pre> </figure> Day 3 results led us to conclude that <code class="highlighter-rouge">115$</code> will generate the highest conversion rate and revenue. We are still unsure about the conversion probability for the best price <code class="highlighter-rouge">115$</code>, but whatever it is, one of the other prices might beat it by as much as 2.67% (the 95% quantile of value remaining). The histograms below show what happens to the value-remaining distribution, the distribution of improvement amounts that another price might have over the current best price, as the experiment continues. With the larger sample we are much more confident about conversion rate. Over time other prices have lower chances to beat price <code class="highlighter-rouge">$115</code>. <img src="/blog/figs/2017-11-29-monte-carlo-tree-search/unnamed-chunk-5-1.png" alt="" /> If this example was interesting to you, checkout our another <a href="https://appsilon.com/blog/business/2017/11/14/dynamic-pricing.html" target="_blank" rel="noopener noreferrer">post about dynamic pricing</a>. <h2 id="monte-carlo-tree-search">Monte Carlo Tree Search</h2> We are ready to learn how the Monte Carlo Tree Search algorithm works. As long as we have enough information to treat child nodes as slot machines, we choose the next node (move) as we would have when solving Multi-Armed Bandit Problem. This can be done when we have some information about the pay-outs for each child node. <img src="/blog-old/assets/article_images/2017-11-23-monte-carlo-tree-search/multi-tree.jpeg" alt="At the first node it's black's turn. The node with highest upper bound is chosen. Then it's white's turn and again the node with the highest upper bound is chosen." /> At some point we reach the node where we can no longer proceed in this manner because there is at least one node with no statistic present. It’s time to explore the tree to get new information. This can be done either completely randomly or by applying some heuristic methods when choosing child nodes (in practice this might be necessary for games with high branching factor - like chess or Go - if we want to achieve good results). <img src="/blog-old/assets/article_images/2017-11-23-monte-carlo-tree-search/explore-tree.jpeg" alt="At some point we cannot apply Multi-Armed Bandit procedure, because there is a node which has no stats. We explore new part of the tree." /> Finally, we arrive at a leaf. Here we can check whether we have won or lost. <img src="/blog-old/assets/article_images/2017-11-23-monte-carlo-tree-search/new-leaf-tree.jpeg" alt="We arrive at a leaf. This determines the outcome of the game." /> It’s time to update the nodes we have visited on our way to the leaf. If the player making a move at the corresponding node turned out to be the winner we increase the number of wins by one. Otherwise we keep it the same. Whether we have won or not, we always increase the number of times the node was played (in the corresponding picture we can automatically deduce it from the number of loses and wins). <img src="/blog-old/assets/article_images/2017-11-23-monte-carlo-tree-search/updated-tree.jpeg" alt="It's time to update the tree." /> That’s it! We repeat this process until some condition is met: timeout is reached or the confidence intervals we mentioned earlier stabilize (convergence). Then we make our final decision based on the information we have gathered during the search. We can choose a node with the highest upper bound for pay-out (as we would have in each iteration of the Multi-Armed Bandit). Or, if you prefer, choose the one with the highest mean pay-out. The decision is made, a move was chosen. Now it’s time for our opponent (or opponents). When they’ve finished we arrive at a new node, somewhere deeper in the tree, and the story repeats. <h2 id="not-just-for-games">Not Just For Games</h2> As you might have noticed, Monte Carlo Tree Search can be used as a general technique for making decisions in perfect information scenarios. Therefore its use does not have to be restrained to games only. The most amazing use case I have heard of is to use it for planning interplanetary flights. You can read about it at <a href="http://www.esa.int/gsp/ACT/ai/projects/mcts_traj.html" target="_blank" rel="noopener noreferrer">this website</a> but I will summarize it briefly. Think of an interplanetary flight as of a trip during which you would like to visit more than one planet. For instance, you are flying from Earth to Jupiter via Mars. An efficient way to do this is to make use of these planets gravitational field (like they did in ‘The Martian’ movie) so you can take less fuel with you. The question is when the best time to arrive and leave from each planets surface or orbit is (for the first and last planet it’s only leave and arrive, respectively). You can treat this problem as a decision tree. If you divide time into intervals, at each planet you make a decision: in which time slot I should arrive and in which I should leave. Each such choice determines the next. First of all, you cannot leave before you arrive. Second - your previous choices determine how much fuel you have left and consequently - what places in the universe you can reach. Each such set of consecutive choices determines where you arrive at the end. If you visited all required checkpoints - you’ve won. Otherwise, you’ve lost. It’s like a perfect information game. There is no opponent and you make a move by determining the time slot for leave/arrival. This can be treated using the above Monte Carlo Tree Search. As you can read <a href="https://fenix.tecnico.ulisboa.pt/downloadFile/563345090414562/Thesis.pdf" target="_blank" rel="noopener noreferrer">here</a>, it fares quite well in comparison with other known approaches to this problem. And at the same time – it does not require any domain-specific knowledge to implement it. <hr /> Write your question and comments below. We’d love to hear what you think!

Contact us!
Damian's Avatar
Damian Rodziewicz
Head of Sales
tutorial
ai&research