How to sample from multidimensional distributions using Gibbs sampling
<strong>We will show how to perform multivariate random sampling using one of the Markov Chain Monte Carlo (MCMC) algorithms, called the Gibbs sampler.</strong> <hr /> <small>Random sampling with rabbit on the bed plane</small> <iframe class="giphy-embed" src="https://giphy.com/embed/49euH7z1UKBy0" width="480" height="480" frameborder="0" allowfullscreen="allowfullscreen"></iframe> <a href="https://giphy.com/gifs/rabbit-49euH7z1UKBy0">via GIPHY</a> <hr /> <strong>To start, what are MCMC algorithms and what are they based on?</strong> Suppose we are interested in generating a random variable <script type="math/tex">X</script> with a distribution of <script type="math/tex">\pi</script>, over <script type="math/tex">\mathcal{X}</script>. If we are not able to do this directly, we will be satisfied with generating a sequence of random variables <script type="math/tex">X_0, X_1, \ldots</script>, which in a sense tending to a distribution of <script type="math/tex">\pi</script>. The MCMC approach explains how to do so: Build a Markov chain <script type="math/tex">X_0, X_1, \ldots</script>, for <script type="math/tex">\mathcal{X}</script>, whose stationary distribution is <script type="math/tex">\pi</script>. If the structure is correct, we should expect random variables to converge to <script type="math/tex">\pi</script>. In addition, we can expect that for function <script type="math/tex">f: \mathcal{X} \rightarrow \mathbb{R}</script>, occurs: <script type="math/tex">\mathbb{E}(f) = \lim_{n \rightarrow \infty} \frac{1}{n}\sum_{i=0}^{n-1}f(X_{i})</script> with probability equals to 1. In essence, we want the above equality to occur for any arbitrary random variable <script type="math/tex">X_0</script>. <hr /> <strong>One of the MCMC algorithms that guarantee the above properties is the so-called Gibbs sampler.</strong> <strong>Gibbs Sampler - description of the algorithm.</strong> Assumptions: <ul><li><script type="math/tex">\pi</script> is defined on the product space <script type="math/tex">\mathcal{X} = \Pi_{i=1}^{d}\mathcal{X_{i}}</script></li><li>We are able to draw from the conditional distributions<script type="math/tex">\pi(X_{i}|X_{-i})</script>,where<script type="math/tex">X_{-i} = (X_{1}, \ldots X_{i-1}, X_{i+1}, \ldots X_{d})</script></li></ul> Algorithm steps: <ol><li>Select the initial values <script type="math/tex">X_{k}^{(0)}, k = 1, \ldots d</script></li><li>For <script type="math/tex">t=1,2,\ldots</script> repeat:<blockquote>For <script type="math/tex">k = 1, \ldots d</script> sample <script type="math/tex">X_{k}^{(t)}</script> from distribution <script type="math/tex">\pi(X_{k}^{(t)}\rvert X_{1}^{(t)}, \ldots X_{k-1}^{(t)}, X_{k+1}^{(t-1)}, \ldots X_{d}^{(t-1)})</script></blockquote></li><li>Repeat step 2 until the distribution of vector <script type="math/tex">(X_{1}^{(t)}, \ldots X_{d}^{(t)})</script> stabilizes.</li><li>The subsequent iterations of point 2 will provide a randomization of <script type="math/tex">\pi</script>.</li></ol> How do we understand point 3? We are most satisfied when average coordinates stabilize to some accuracy. <hr /> <strong>Intuition in two-dimensional case:</strong> <img src="/blog-old/assets/article_images/2017-10-02-gibbs-sampling/gibbs.png" alt="Gibbs sampling on the plane." width="100%" align="center" /> <small>Source: <a href="http://vcla.stat.ucla.edu/old/MCMC/MCMC_tutorial.htm" target="_blank" rel="noopener noreferrer">[3]</a></small> <hr /> <strong>Gibbs sampling for randomization with a two-dimensional normal distribution.</strong> We will sample from the distribution of <script type="math/tex">\theta \sim \mathcal{N}(0, [\sigma_{ij}])</script>, where <script type="math/tex">\sigma_{11} = \sigma_{22} = 1</script> and <script type="math/tex">\sigma_{12} = \sigma_{21} = \rho</script>. Knowing joint density of <script type="math/tex">\theta</script>, it’s easy to show, that: <script type="math/tex">\theta_{1}|\theta_{2} \sim \mathcal{N}(\rho\theta_{2}, 1-\rho^{2})</script> <script type="math/tex">\theta_{2}|\theta_{1} \sim \mathcal{N}(\rho\theta_{1}, 1-\rho^{2})</script> R implementation: <figure class="highlight"> <pre><code class="language-r" data-lang="r"><span class="n">gibbs_normal_sampling</span> <span class="o"><-</span> <span class="k">function</span><span class="p">(</span><span class="n">n_iteration</span><span class="p">,</span> <span class="n">init_point</span><span class="p">,</span> <span class="n">ro</span><span class="p">)</span> <span class="p">{</span> <span class="c1"># init point is some numeric vector of length equals to 2 </span> <span class="n">theta_1</span> <span class="o"><-</span> <span class="nf">c</span><span class="p">(</span><span class="n">init_point</span><span class="p">[</span><span class="m">1</span><span class="p">],</span> <span class="n">numeric</span><span class="p">(</span><span class="n">n_iteration</span><span class="p">))</span> <span class="n">theta_2</span> <span class="o"><-</span> <span class="nf">c</span><span class="p">(</span><span class="n">init_point</span><span class="p">[</span><span class="m">2</span><span class="p">],</span> <span class="n">numeric</span><span class="p">(</span><span class="n">n_iteration</span><span class="p">))</span> <span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="k">in</span> <span class="m">2</span><span class="o">:</span><span class="p">(</span><span class="n">n_iteration</span><span class="m">+1</span><span class="p">))</span> <span class="p">{</span> <span class="n">theta_1</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o"><-</span> <span class="n">rnorm</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="n">ro</span> <span class="o">*</span> <span class="n">theta_2</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">],</span> <span class="nf">sqrt</span><span class="p">(</span><span class="m">1</span> <span class="o">-</span> <span class="n">ro</span><span class="o">^</span><span class="m">2</span><span class="p">))</span> <span class="n">theta_2</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o"><-</span> <span class="n">rnorm</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="n">ro</span> <span class="o">*</span> <span class="n">theta_1</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="nf">sqrt</span><span class="p">(</span><span class="m">1</span> <span class="o">-</span> <span class="n">ro</span><span class="o">^</span><span class="m">2</span><span class="p">))</span> <span class="p">}</span> <span class="nf">list</span><span class="p">(</span><span class="n">theta_1</span><span class="p">,</span> <span class="n">theta_2</span><span class="p">)</span> <span class="p">}</span></code></pre> </figure> Using the above function, let’s see how the 10 points were scored at <script type="math/tex">\rho = 0.5</script>: <img src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d717c6822ae9d7485d26_277860fe_animation.gif" alt="Sampling ten points via gibbs sampler - animation." /> And for 10000 iterations: <img style="align: center;" src="/blog-old/assets/article_images/2017-10-02-gibbs-sampling/sampler.png" alt="Sampling from bivariate normal distribution." width="100%" /> We leave you a comparison of how the stability of the parameters changes depending on the selected <script type="math/tex">\rho</script> parameter. <hr /> <strong>Let’s move on to use the Gibbs sampler to estimate the density parameters.</strong> We will show the use of the Gibbs sampler and bayesian statistics to estimate the mean parameters in the mix of normal distributions. Assumptions (simplified case): <ul><li>iid. sample <script type="math/tex">Y=Y_{1},\ldots Y_{N}</script> comes from a mixture of normal distributions <script type="math/tex">(1-\pi)\mathcal{N}(\mu_{1}, \sigma_{1}^{2})+\pi\mathcal{N}(\mu_{2}, \sigma_{2}^{2})</script>, where <script type="math/tex">\pi</script>, <script type="math/tex">\sigma_{1}</script> i <script type="math/tex">\sigma_{2}</script> are known.</li><li>For i=1,2 <script type="math/tex">\mu_{i} \sim \mathcal{N}(0, 1)</script> (a priori distributions) and <script type="math/tex">\mu_{1}</script> with <script type="math/tex">\mu_{2}</script> are independent.</li><li><script type="math/tex">\Delta = (\Delta_{1}, \ldots, \Delta_{N})</script> - the classification vector (unobserved) from which Y is derived (when <script type="math/tex">\Delta_{i} = 0</script>, <script type="math/tex">Y_{i}</script> is drawn from <script type="math/tex">\mathcal{N}(\mu_{1}, \sigma_{1}^{2})</script>, when <script type="math/tex">\Delta_{i} = 1</script> then drawn from <script type="math/tex">\mathcal{N}(\mu_{2}, \sigma_{2}^{2})</script>).</li><li><script type="math/tex; mode=display">\mathbb{P}(\Delta_{i} = 1) = \pi</script></li></ul> With above assumptions it can be shown that: <ul><li><script type="math/tex; mode=display">f(\Delta) = (1-\pi)^{N-\sum_{i=1}^{N}\Delta_{i}}\cdot\pi^{\sum_{i=1}^{N}\Delta_{i}}</script></li><li><script type="math/tex; mode=display">(\mu_{1}\rvert\Delta,Y) \sim \mathcal{N}(\frac{\sum_{i=1}^{N}(1-\Delta_{i})y_{i}}{1 + \sum_{i=1}^{N}(1-\Delta_{i})}, \frac{1}{1+\sum_{i=1}^{N}(1-\Delta_{i})})</script></li><li><script type="math/tex; mode=display">(\mu_{2}\rvert\Delta,Y) \sim \mathcal{N}(\frac{\sum_{i=1}^{N}\Delta_{i}y_{i}}{1 + \sum_{i=1}^{N}\Delta_{i}}, \frac{1}{1+\sum_{i=1}^{N}\Delta_{i}})</script></li><li><script type="math/tex; mode=display">\mathbb{P} (\Delta_{i} = 1\rvert\mu_{1}, \mu_{2}, Y) = \frac{\pi \phi_{(\mu_{2}, \sigma_{2})}(y_{i})}{(1-\pi) \phi_{(\mu_1, \sigma_{1})}(y_{i})+\pi \phi_{(\mu_2, \sigma_{2})}(y_{i})}</script></li></ul> The form of the algorithm: <ol><li>Choose starting point for the mean <script type="math/tex">(\mu_{1}^{0}, \mu_{2}^{0})</script></li><li>In the <script type="math/tex">k</script>-th iteration do:<blockquote><ul><li>With the probability:<script type="math/tex">\frac{\pi \phi_{(\mu_{2}^{(k-1)}, \sigma_{2})}(y_{i})}{(1-\pi) \phi_{(\mu_{1}^{(k-1)}, \sigma_{1})}(y_{i})+\pi \phi_{(\mu_{2}^{(k-1)}, \sigma_{2})}(y_{i})}</script> draw <script type="math/tex">\Delta_{i}^{(k)}</script> for <script type="math/tex">i = 1, \ldots, N</script></li><li>Calculate:<script type="math/tex">\hat{\mu_{1}} = \frac{\sum_{i=1}^{N}(1-\Delta_{i}^{(k)})y_{i}}{1 + \sum_{i=1}^{N}(1-\Delta_{i}^{(k)})}</script><script type="math/tex">\hat{\mu_{2}} = \frac{\sum_{i=1}^{N}\Delta_{i}^{(k)}y_{i}}{1 + \sum_{i=1}^{N}\Delta_{i}^{(k)}}</script></li><li>Generate:<script type="math/tex">\mu_{1}^{(k)} \sim \mathcal{N}(\hat{\mu_{1}}, \frac{1}{1 + \sum_{i=1}^{N}(1-\Delta_{i}^{(k)})})</script><script type="math/tex">\mu_{2}^{(k)} \sim \mathcal{N}(\hat{\mu_{2}}, \frac{1}{1 + \sum_{i=1}^{N}\Delta_{i}^{(k)}})</script></li></ul></blockquote></li><li>Perform step 2. until the distribution of vector <script type="math/tex">(\Delta, \mu_{1}, \mu_{2})</script> stabilizes.</li></ol> <hr /> How to do this in R? At start let’s generate random sample from mixture of normals with parameters <script type="math/tex">(\pi, \mu_1, \mu_2, \sigma_1, \sigma_2) = (0.7, 10, 2, 1, 2)</script>. <figure class="highlight"> <pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">12345</span><span class="p">)</span> <span class="n">mu_1</span> <span class="o"><-</span> <span class="m">10</span> <span class="n">mu_2</span> <span class="o"><-</span> <span class="m">2</span> <span class="n">sigma_1</span> <span class="o"><-</span> <span class="m">1</span> <span class="n">sigma_2</span> <span class="o"><-</span> <span class="m">2</span> <span class="n">pi_known</span> <span class="o"><-</span> <span class="m">0.7</span> <span class="n">n</span> <span class="o"><-</span> <span class="m">2000</span> <span class="n">pi_sampled</span> <span class="o"><-</span> <span class="n">rbinom</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="n">pi_known</span><span class="p">)</span> <span class="n">y_1</span> <span class="o"><-</span> <span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">mu_1</span><span class="p">,</span> <span class="n">sigma_1</span><span class="p">)</span> <span class="n">y_2</span> <span class="o"><-</span> <span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">mu_2</span><span class="p">,</span> <span class="n">sigma_2</span><span class="p">)</span> <span class="n">y</span> <span class="o"><-</span> <span class="p">(</span><span class="m">1</span> <span class="o">-</span> <span class="n">pi_sampled</span><span class="p">)</span> <span class="o">*</span> <span class="n">y_1</span> <span class="o">+</span> <span class="n">pi_sampled</span> <span class="o">*</span> <span class="n">y_2</span></code></pre> </figure> Take a quick look at a histogram of our data: <div id="htmlwidget-09148139d13cd95c30ee" class="plotly html-widget" style="width: 504px; height: 504px;"></div> The following task is to estimate <script type="math/tex">\mu_1</script> and <script type="math/tex">\mu_2</script> from above model. <figure class="highlight"> <pre><code class="language-r" data-lang="r"><span class="n">mu_init</span> <span class="o"><-</span> <span class="nf">c</span><span class="p">(</span><span class="m">12</span><span class="p">,</span> <span class="m">0</span><span class="p">)</span> <span class="n">n_iterations</span> <span class="o"><-</span> <span class="m">300</span> <span class="n">delta</span> <span class="o"><-</span> <span class="n">vector</span><span class="p">(</span><span class="s2">"list"</span><span class="p">,</span> <span class="n">n_iterations</span><span class="p">)</span> <span class="n">mu_1_vec</span> <span class="o"><-</span> <span class="nf">c</span><span class="p">(</span><span class="n">mu_init</span><span class="p">[</span><span class="m">1</span><span class="p">],</span> <span class="n">numeric</span><span class="p">(</span><span class="n">n_iterations</span><span class="p">))</span> <span class="n">mu_2_vec</span> <span class="o"><-</span> <span class="nf">c</span><span class="p">(</span><span class="n">mu_init</span><span class="p">[</span><span class="m">2</span><span class="p">],</span> <span class="n">numeric</span><span class="p">(</span><span class="n">n_iterations</span><span class="p">))</span> <br><span class="n">delta_probability</span> <span class="o"><-</span> <span class="k">function</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">mu_1_vec</span><span class="p">,</span> <span class="n">mu_2_vec</span><span class="p">,</span> <span class="n">sigma_1</span><span class="p">,</span> <span class="n">sigma_2</span><span class="p">,</span> <span class="n">pi_known</span><span class="p">)</span> <span class="p">{</span> <span class="n">pi_known</span> <span class="o">*</span> <span class="n">dnorm</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">mu_2_vec</span><span class="p">[</span><span class="n">i</span> <span class="o">-</span> <span class="m">1</span><span class="p">],</span> <span class="n">sigma_2</span><span class="p">)</span> <span class="o">/</span> <span class="p">((</span><span class="m">1</span><span class="o">-</span> <span class="n">pi_known</span><span class="p">)</span> <span class="o">*</span> <span class="n">dnorm</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">mu_1_vec</span><span class="p">[</span><span class="n">i</span> <span class="o">-</span> <span class="m">1</span><span class="p">],</span> <span class="n">sigma_1</span><span class="p">)</span> <span class="o">+</span> <span class="n">pi_known</span> <span class="o">*</span> <span class="n">dnorm</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">mu_2_vec</span><span class="p">[</span><span class="n">i</span> <span class="o">-</span> <span class="m">1</span><span class="p">],</span> <span class="n">sigma_2</span><span class="p">))</span> <span class="p">}</span> <br><span class="n">mu_1_mean</span> <span class="o"><-</span> <span class="k">function</span><span class="p">(</span><span class="n">delta</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span> <span class="p">{</span> <span class="nf">sum</span><span class="p">((</span><span class="m">1</span> <span class="o">-</span> <span class="n">delta</span><span class="p">[[</span><span class="n">i</span> <span class="o">-</span> <span class="m">1</span><span class="p">]])</span> <span class="o">*</span> <span class="n">y</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="m">1</span> <span class="o">+</span> <span class="nf">sum</span><span class="p">(</span><span class="m">1</span> <span class="o">-</span> <span class="n">delta</span><span class="p">[[</span><span class="n">i</span> <span class="o">-</span> <span class="m">1</span><span class="p">]]))</span> <span class="p">}</span> <br><span class="n">mu_2_mean</span> <span class="o"><-</span> <span class="k">function</span><span class="p">(</span><span class="n">delta</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span> <span class="p">{</span> <span class="nf">sum</span><span class="p">(</span><span class="n">delta</span><span class="p">[[</span><span class="n">i</span> <span class="o">-</span> <span class="m">1</span><span class="p">]]</span> <span class="o">*</span> <span class="n">y</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="m">1</span> <span class="o">+</span> <span class="nf">sum</span><span class="p">(</span><span class="n">delta</span><span class="p">[[</span><span class="n">i</span> <span class="o">-</span> <span class="m">1</span><span class="p">]]))</span> <span class="p">}</span> <br><span class="k">for</span> <span class="p">(</span><span class="n">j</span> <span class="k">in</span> <span class="m">2</span><span class="o">:</span><span class="p">(</span><span class="n">n_iterations</span> <span class="o">+</span> <span class="m">1</span><span class="p">))</span> <span class="p">{</span> <span class="n">delta</span><span class="p">[[</span><span class="n">j</span> <span class="o">-</span> <span class="m">1</span><span class="p">]]</span> <span class="o"><-</span> <span class="n">y</span> <span class="o">%>%</span> <span class="n">map_int</span><span class="p">(</span> <span class="o">~</span> <span class="n">rbinom</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="n">delta_probability</span><span class="p">(</span><span class="n">j</span><span class="p">,</span> <span class="n">.</span><span class="p">,</span> <span class="n">mu_1_vec</span><span class="p">,</span> <span class="n">mu_2_vec</span><span class="p">,</span> <span class="n">sigma_1</span><span class="p">,</span> <span class="n">sigma_2</span><span class="p">,</span> <span class="n">pi_known</span><span class="p">)))</span> <span class="n">mu_1_vec</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o"><-</span> <span class="n">rnorm</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="n">mu_1_mean</span><span class="p">(</span><span class="n">delta</span><span class="p">,</span> <span class="n">j</span><span class="p">,</span> <span class="n">y</span><span class="p">),</span> <span class="nf">sqrt</span><span class="p">(</span><span class="m">1</span> <span class="o">/</span> <span class="p">(</span><span class="m">1</span> <span class="o">+</span> <span class="nf">sum</span><span class="p">(</span><span class="m">1</span> <span class="o">-</span> <span class="n">delta</span><span class="p">[[</span><span class="n">j</span> <span class="o">-</span> <span class="m">1</span><span class="p">]]))))</span> <span class="n">mu_2_vec</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o"><-</span> <span class="n">rnorm</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="n">mu_2_mean</span><span class="p">(</span><span class="n">delta</span><span class="p">,</span> <span class="n">j</span><span class="p">,</span> <span class="n">y</span><span class="p">),</span> <span class="nf">sqrt</span><span class="p">(</span><span class="m">1</span> <span class="o">/</span> <span class="p">(</span><span class="m">1</span> <span class="o">+</span> <span class="nf">sum</span><span class="p">(</span><span class="n">delta</span><span class="p">[[</span><span class="n">j</span> <span class="o">-</span> <span class="m">1</span><span class="p">]]))))</span> <span class="p">}</span></code></pre> </figure> Let’s see the relation of sampled data to known one: The following plot presents the mean of the <script type="math/tex">\Delta</script> vector at each iteration: <div id="htmlwidget-40bed787eb289217bb47" class="plotly html-widget" style="width: 504px; height: 200px;"></div> Let’s check how mean of parameters <script type="math/tex">\mu_1</script> and <script type="math/tex">\mu_2</script> stabilize at used algorithm: <div id="htmlwidget-6ddf2507d65d69358578" class="plotly html-widget" style="width: 504px; height: 400px;"></div> Note how little iteration was enough to stabilize the parameters. Finally let’s see estimated density with our initial sample: <div id="htmlwidget-f04073a4f5ef54a41c66" class="plotly html-widget" style="width: 504px; height: 504px;"></div> <hr /> To those concerned about the topic, refer to <a href="http://gauss.stat.su.se/rr/RR2006_1.pdf" target="_blank" rel="noopener noreferrer">[1]</a> where you can find a generalization of normal distribution mixture by extending a priori distributions to other parameters. It is also worth to compare the above algorithm with its deterministic counterpart, Expectation Maximization (EM) algorithm see <a href="http://web.stanford.edu/~hastie/ElemStatLearn/" target="_blank" rel="noopener noreferrer">[2]</a>. <hr /> Write your question and comments below. We’d love to hear what you think. <hr /> Resources: [1] http://gauss.stat.su.se/rr/RR2006_1.pdf [2] http://web.stanford.edu/~hastie/ElemStatLearn/ [3] http://vcla.stat.ucla.edu/old/MCMC/MCMC_tutorial.htm