How to sample from multidimensional distributions using Gibbs sampling

Reading time:
time
min
By:
Krystian Igras
October 9, 2017

<strong>We will show how to perform multivariate random sampling using one of the Markov Chain Monte Carlo (MCMC) algorithms, called the Gibbs sampler.</strong> <hr /> <small>Random sampling with rabbit on the bed plane</small> <iframe class="giphy-embed" src="https://giphy.com/embed/49euH7z1UKBy0" width="480" height="480" frameborder="0" allowfullscreen="allowfullscreen"></iframe> <a href="https://giphy.com/gifs/rabbit-49euH7z1UKBy0">via GIPHY</a> <hr /> <strong>To start, what are MCMC algorithms and what are they based on?</strong> Suppose we are interested in generating a random variable <script type="math/tex">X</script> with a distribution of <script type="math/tex">\pi</script>, over <script type="math/tex">\mathcal{X}</script>. If we are not able to do this directly, we will be satisfied with generating a sequence of random variables <script type="math/tex">X_0, X_1, \ldots</script>, which in a sense tending to a distribution of <script type="math/tex">\pi</script>. The MCMC approach explains how to do so: Build a Markov chain <script type="math/tex">X_0, X_1, \ldots</script>, for <script type="math/tex">\mathcal{X}</script>, whose stationary distribution is <script type="math/tex">\pi</script>. If the structure is correct, we should expect random variables to converge to <script type="math/tex">\pi</script>. In addition, we can expect that for function <script type="math/tex">f: \mathcal{X} \rightarrow \mathbb{R}</script>, occurs: <script type="math/tex">\mathbb{E}(f) = \lim_{n \rightarrow \infty} \frac{1}{n}\sum_{i=0}^{n-1}f(X_{i})</script> with probability equals to 1. In essence, we want the above equality to occur for any arbitrary random variable <script type="math/tex">X_0</script>. <hr /> <strong>One of the MCMC algorithms that guarantee the above properties is the so-called Gibbs sampler.</strong> <strong>Gibbs Sampler - description of the algorithm.</strong> Assumptions: <ul><li><script type="math/tex">\pi</script> is defined on the product space <script type="math/tex">\mathcal{X} = \Pi_{i=1}^{d}\mathcal{X_{i}}</script></li><li>We are able to draw from the conditional distributions<script type="math/tex">\pi(X_{i}|X_{-i})</script>,where<script type="math/tex">X_{-i} = (X_{1}, \ldots X_{i-1}, X_{i+1}, \ldots X_{d})</script></li></ul> Algorithm steps: <ol><li>Select the initial values <script type="math/tex">X_{k}^{(0)}, k = 1, \ldots d</script></li><li>For <script type="math/tex">t=1,2,\ldots</script> repeat:<blockquote>For <script type="math/tex">k = 1, \ldots d</script> sample <script type="math/tex">X_{k}^{(t)}</script> from distribution <script type="math/tex">\pi(X_{k}^{(t)}\rvert X_{1}^{(t)}, \ldots X_{k-1}^{(t)}, X_{k+1}^{(t-1)}, \ldots X_{d}^{(t-1)})</script></blockquote></li><li>Repeat step 2 until the distribution of vector <script type="math/tex">(X_{1}^{(t)}, \ldots X_{d}^{(t)})</script> stabilizes.</li><li>The subsequent iterations of point 2 will provide a randomization of <script type="math/tex">\pi</script>.</li></ol> How do we understand point 3? We are most satisfied when average coordinates stabilize to some accuracy. <hr /> <strong>Intuition in two-dimensional case:</strong> <img src="/blog-old/assets/article_images/2017-10-02-gibbs-sampling/gibbs.png" alt="Gibbs sampling on the plane." width="100%" align="center" /> <small>Source: <a href="http://vcla.stat.ucla.edu/old/MCMC/MCMC_tutorial.htm" target="_blank" rel="noopener noreferrer">[3]</a></small> <hr /> <strong>Gibbs sampling for randomization with a two-dimensional normal distribution.</strong> We will sample from the distribution of <script type="math/tex">\theta \sim \mathcal{N}(0, [\sigma_{ij}])</script>, where <script type="math/tex">\sigma_{11} = \sigma_{22} = 1</script> and <script type="math/tex">\sigma_{12} = \sigma_{21} = \rho</script>. Knowing joint density of <script type="math/tex">\theta</script>, it’s easy to show, that: <script type="math/tex">\theta_{1}|\theta_{2} \sim \mathcal{N}(\rho\theta_{2}, 1-\rho^{2})</script> <script type="math/tex">\theta_{2}|\theta_{1} \sim \mathcal{N}(\rho\theta_{1}, 1-\rho^{2})</script> R implementation: <figure class="highlight"> <pre><code class="language-r" data-lang="r"><span class="n">gibbs_normal_sampling</span> <span class="o">&lt;-</span> <span class="k">function</span><span class="p">(</span><span class="n">n_iteration</span><span class="p">,</span> <span class="n">init_point</span><span class="p">,</span> <span class="n">ro</span><span class="p">)</span> <span class="p">{</span>  <span class="c1"># init point is some numeric vector of length equals to 2 </span>  <span class="n">theta_1</span> <span class="o">&lt;-</span> <span class="nf">c</span><span class="p">(</span><span class="n">init_point</span><span class="p">[</span><span class="m">1</span><span class="p">],</span> <span class="n">numeric</span><span class="p">(</span><span class="n">n_iteration</span><span class="p">))</span>  <span class="n">theta_2</span> <span class="o">&lt;-</span> <span class="nf">c</span><span class="p">(</span><span class="n">init_point</span><span class="p">[</span><span class="m">2</span><span class="p">],</span> <span class="n">numeric</span><span class="p">(</span><span class="n">n_iteration</span><span class="p">))</span>  <span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="k">in</span> <span class="m">2</span><span class="o">:</span><span class="p">(</span><span class="n">n_iteration</span><span class="m">+1</span><span class="p">))</span> <span class="p">{</span>    <span class="n">theta_1</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">&lt;-</span> <span class="n">rnorm</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="n">ro</span> <span class="o">*</span> <span class="n">theta_2</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">],</span> <span class="nf">sqrt</span><span class="p">(</span><span class="m">1</span> <span class="o">-</span> <span class="n">ro</span><span class="o">^</span><span class="m">2</span><span class="p">))</span>    <span class="n">theta_2</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">&lt;-</span> <span class="n">rnorm</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="n">ro</span> <span class="o">*</span> <span class="n">theta_1</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="nf">sqrt</span><span class="p">(</span><span class="m">1</span> <span class="o">-</span> <span class="n">ro</span><span class="o">^</span><span class="m">2</span><span class="p">))</span>  <span class="p">}</span>  <span class="nf">list</span><span class="p">(</span><span class="n">theta_1</span><span class="p">,</span> <span class="n">theta_2</span><span class="p">)</span> <span class="p">}</span></code></pre> </figure> Using the above function, let’s see how the 10 points were scored at <script type="math/tex">\rho = 0.5</script>: <img src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d717c6822ae9d7485d26_277860fe_animation.gif" alt="Sampling ten points via gibbs sampler - animation." /> And for 10000 iterations: <img style="align: center;" src="/blog-old/assets/article_images/2017-10-02-gibbs-sampling/sampler.png" alt="Sampling from bivariate normal distribution." width="100%" /> We leave you a comparison of how the stability of the parameters changes depending on the selected <script type="math/tex">\rho</script> parameter. <hr /> <strong>Let’s move on to use the Gibbs sampler to estimate the density parameters.</strong> We will show the use of the Gibbs sampler and bayesian statistics to estimate the mean parameters in the mix of normal distributions. Assumptions (simplified case): <ul><li>iid. sample <script type="math/tex">Y=Y_{1},\ldots Y_{N}</script> comes from a mixture of normal distributions <script type="math/tex">(1-\pi)\mathcal{N}(\mu_{1}, \sigma_{1}^{2})+\pi\mathcal{N}(\mu_{2}, \sigma_{2}^{2})</script>, where <script type="math/tex">\pi</script>, <script type="math/tex">\sigma_{1}</script> i <script type="math/tex">\sigma_{2}</script> are known.</li><li>For i=1,2 <script type="math/tex">\mu_{i} \sim \mathcal{N}(0, 1)</script> (a priori distributions) and <script type="math/tex">\mu_{1}</script> with <script type="math/tex">\mu_{2}</script> are independent.</li><li><script type="math/tex">\Delta = (\Delta_{1}, \ldots, \Delta_{N})</script> - the classification vector (unobserved) from which Y is derived (when <script type="math/tex">\Delta_{i} = 0</script>, <script type="math/tex">Y_{i}</script> is drawn from <script type="math/tex">\mathcal{N}(\mu_{1}, \sigma_{1}^{2})</script>, when <script type="math/tex">\Delta_{i} = 1</script> then drawn from <script type="math/tex">\mathcal{N}(\mu_{2}, \sigma_{2}^{2})</script>).</li><li><script type="math/tex; mode=display">\mathbb{P}(\Delta_{i} = 1) = \pi</script></li></ul> With above assumptions it can be shown that: <ul><li><script type="math/tex; mode=display">f(\Delta) = (1-\pi)^{N-\sum_{i=1}^{N}\Delta_{i}}\cdot\pi^{\sum_{i=1}^{N}\Delta_{i}}</script></li><li><script type="math/tex; mode=display">(\mu_{1}\rvert\Delta,Y) \sim \mathcal{N}(\frac{\sum_{i=1}^{N}(1-\Delta_{i})y_{i}}{1 + \sum_{i=1}^{N}(1-\Delta_{i})}, \frac{1}{1+\sum_{i=1}^{N}(1-\Delta_{i})})</script></li><li><script type="math/tex; mode=display">(\mu_{2}\rvert\Delta,Y) \sim \mathcal{N}(\frac{\sum_{i=1}^{N}\Delta_{i}y_{i}}{1 + \sum_{i=1}^{N}\Delta_{i}}, \frac{1}{1+\sum_{i=1}^{N}\Delta_{i}})</script></li><li><script type="math/tex; mode=display">\mathbb{P} (\Delta_{i} = 1\rvert\mu_{1}, \mu_{2}, Y) = \frac{\pi \phi_{(\mu_{2}, \sigma_{2})}(y_{i})}{(1-\pi) \phi_{(\mu_1, \sigma_{1})}(y_{i})+\pi \phi_{(\mu_2, \sigma_{2})}(y_{i})}</script></li></ul> The form of the algorithm: <ol><li>Choose starting point for the mean <script type="math/tex">(\mu_{1}^{0}, \mu_{2}^{0})</script></li><li>In the <script type="math/tex">k</script>-th iteration do:<blockquote><ul><li>With the probability:<script type="math/tex">\frac{\pi \phi_{(\mu_{2}^{(k-1)}, \sigma_{2})}(y_{i})}{(1-\pi) \phi_{(\mu_{1}^{(k-1)}, \sigma_{1})}(y_{i})+\pi \phi_{(\mu_{2}^{(k-1)}, \sigma_{2})}(y_{i})}</script> draw <script type="math/tex">\Delta_{i}^{(k)}</script> for <script type="math/tex">i = 1, \ldots, N</script></li><li>Calculate:<script type="math/tex">\hat{\mu_{1}} = \frac{\sum_{i=1}^{N}(1-\Delta_{i}^{(k)})y_{i}}{1 + \sum_{i=1}^{N}(1-\Delta_{i}^{(k)})}</script><script type="math/tex">\hat{\mu_{2}} = \frac{\sum_{i=1}^{N}\Delta_{i}^{(k)}y_{i}}{1 + \sum_{i=1}^{N}\Delta_{i}^{(k)}}</script></li><li>Generate:<script type="math/tex">\mu_{1}^{(k)} \sim \mathcal{N}(\hat{\mu_{1}}, \frac{1}{1 + \sum_{i=1}^{N}(1-\Delta_{i}^{(k)})})</script><script type="math/tex">\mu_{2}^{(k)} \sim \mathcal{N}(\hat{\mu_{2}}, \frac{1}{1 + \sum_{i=1}^{N}\Delta_{i}^{(k)}})</script></li></ul></blockquote></li><li>Perform step 2. until the distribution of vector <script type="math/tex">(\Delta, \mu_{1}, \mu_{2})</script> stabilizes.</li></ol> <hr /> How to do this in R? At start let’s generate random sample from mixture of normals with parameters <script type="math/tex">(\pi, \mu_1, \mu_2, \sigma_1, \sigma_2) = (0.7, 10, 2, 1, 2)</script>. <figure class="highlight"> <pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">12345</span><span class="p">)</span> <span class="n">mu_1</span> <span class="o">&lt;-</span> <span class="m">10</span> <span class="n">mu_2</span> <span class="o">&lt;-</span> <span class="m">2</span> <span class="n">sigma_1</span> <span class="o">&lt;-</span> <span class="m">1</span> <span class="n">sigma_2</span> <span class="o">&lt;-</span> <span class="m">2</span> <span class="n">pi_known</span> <span class="o">&lt;-</span> <span class="m">0.7</span> <span class="n">n</span> <span class="o">&lt;-</span> <span class="m">2000</span> <span class="n">pi_sampled</span> <span class="o">&lt;-</span> <span class="n">rbinom</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="n">pi_known</span><span class="p">)</span> <span class="n">y_1</span> <span class="o">&lt;-</span> <span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">mu_1</span><span class="p">,</span> <span class="n">sigma_1</span><span class="p">)</span> <span class="n">y_2</span> <span class="o">&lt;-</span> <span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">mu_2</span><span class="p">,</span> <span class="n">sigma_2</span><span class="p">)</span> <span class="n">y</span> <span class="o">&lt;-</span> <span class="p">(</span><span class="m">1</span> <span class="o">-</span> <span class="n">pi_sampled</span><span class="p">)</span> <span class="o">*</span> <span class="n">y_1</span> <span class="o">+</span> <span class="n">pi_sampled</span> <span class="o">*</span> <span class="n">y_2</span></code></pre> </figure> Take a quick look at a histogram of our data: <div id="htmlwidget-09148139d13cd95c30ee" class="plotly html-widget" style="width: 504px; height: 504px;"></div> The following task is to estimate <script type="math/tex">\mu_1</script> and <script type="math/tex">\mu_2</script> from above model. <figure class="highlight"> <pre><code class="language-r" data-lang="r"><span class="n">mu_init</span> <span class="o">&lt;-</span> <span class="nf">c</span><span class="p">(</span><span class="m">12</span><span class="p">,</span> <span class="m">0</span><span class="p">)</span> <span class="n">n_iterations</span> <span class="o">&lt;-</span> <span class="m">300</span> <span class="n">delta</span> <span class="o">&lt;-</span> <span class="n">vector</span><span class="p">(</span><span class="s2">"list"</span><span class="p">,</span> <span class="n">n_iterations</span><span class="p">)</span> <span class="n">mu_1_vec</span> <span class="o">&lt;-</span> <span class="nf">c</span><span class="p">(</span><span class="n">mu_init</span><span class="p">[</span><span class="m">1</span><span class="p">],</span> <span class="n">numeric</span><span class="p">(</span><span class="n">n_iterations</span><span class="p">))</span> <span class="n">mu_2_vec</span> <span class="o">&lt;-</span> <span class="nf">c</span><span class="p">(</span><span class="n">mu_init</span><span class="p">[</span><span class="m">2</span><span class="p">],</span> <span class="n">numeric</span><span class="p">(</span><span class="n">n_iterations</span><span class="p">))</span> <br><span class="n">delta_probability</span> <span class="o">&lt;-</span> <span class="k">function</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">mu_1_vec</span><span class="p">,</span> <span class="n">mu_2_vec</span><span class="p">,</span> <span class="n">sigma_1</span><span class="p">,</span> <span class="n">sigma_2</span><span class="p">,</span> <span class="n">pi_known</span><span class="p">)</span> <span class="p">{</span>  <span class="n">pi_known</span> <span class="o">*</span> <span class="n">dnorm</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">mu_2_vec</span><span class="p">[</span><span class="n">i</span> <span class="o">-</span> <span class="m">1</span><span class="p">],</span> <span class="n">sigma_2</span><span class="p">)</span> <span class="o">/</span>      <span class="p">((</span><span class="m">1</span><span class="o">-</span> <span class="n">pi_known</span><span class="p">)</span> <span class="o">*</span> <span class="n">dnorm</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">mu_1_vec</span><span class="p">[</span><span class="n">i</span> <span class="o">-</span> <span class="m">1</span><span class="p">],</span> <span class="n">sigma_1</span><span class="p">)</span> <span class="o">+</span> <span class="n">pi_known</span> <span class="o">*</span> <span class="n">dnorm</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">mu_2_vec</span><span class="p">[</span><span class="n">i</span> <span class="o">-</span> <span class="m">1</span><span class="p">],</span> <span class="n">sigma_2</span><span class="p">))</span> <span class="p">}</span> <br><span class="n">mu_1_mean</span> <span class="o">&lt;-</span> <span class="k">function</span><span class="p">(</span><span class="n">delta</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span> <span class="p">{</span>  <span class="nf">sum</span><span class="p">((</span><span class="m">1</span> <span class="o">-</span> <span class="n">delta</span><span class="p">[[</span><span class="n">i</span> <span class="o">-</span> <span class="m">1</span><span class="p">]])</span> <span class="o">*</span> <span class="n">y</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="m">1</span> <span class="o">+</span> <span class="nf">sum</span><span class="p">(</span><span class="m">1</span> <span class="o">-</span> <span class="n">delta</span><span class="p">[[</span><span class="n">i</span> <span class="o">-</span> <span class="m">1</span><span class="p">]]))</span> <span class="p">}</span> <br><span class="n">mu_2_mean</span> <span class="o">&lt;-</span> <span class="k">function</span><span class="p">(</span><span class="n">delta</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span> <span class="p">{</span>  <span class="nf">sum</span><span class="p">(</span><span class="n">delta</span><span class="p">[[</span><span class="n">i</span> <span class="o">-</span> <span class="m">1</span><span class="p">]]</span> <span class="o">*</span> <span class="n">y</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="m">1</span> <span class="o">+</span> <span class="nf">sum</span><span class="p">(</span><span class="n">delta</span><span class="p">[[</span><span class="n">i</span> <span class="o">-</span> <span class="m">1</span><span class="p">]]))</span> <span class="p">}</span> <br><span class="k">for</span> <span class="p">(</span><span class="n">j</span> <span class="k">in</span> <span class="m">2</span><span class="o">:</span><span class="p">(</span><span class="n">n_iterations</span> <span class="o">+</span> <span class="m">1</span><span class="p">))</span> <span class="p">{</span>  <span class="n">delta</span><span class="p">[[</span><span class="n">j</span> <span class="o">-</span> <span class="m">1</span><span class="p">]]</span> <span class="o">&lt;-</span> <span class="n">y</span> <span class="o">%&gt;%</span> <span class="n">map_int</span><span class="p">(</span>    <span class="o">~</span> <span class="n">rbinom</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="n">delta_probability</span><span class="p">(</span><span class="n">j</span><span class="p">,</span> <span class="n">.</span><span class="p">,</span> <span class="n">mu_1_vec</span><span class="p">,</span> <span class="n">mu_2_vec</span><span class="p">,</span> <span class="n">sigma_1</span><span class="p">,</span> <span class="n">sigma_2</span><span class="p">,</span> <span class="n">pi_known</span><span class="p">)))</span>  <span class="n">mu_1_vec</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">&lt;-</span> <span class="n">rnorm</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="n">mu_1_mean</span><span class="p">(</span><span class="n">delta</span><span class="p">,</span> <span class="n">j</span><span class="p">,</span> <span class="n">y</span><span class="p">),</span> <span class="nf">sqrt</span><span class="p">(</span><span class="m">1</span> <span class="o">/</span> <span class="p">(</span><span class="m">1</span> <span class="o">+</span> <span class="nf">sum</span><span class="p">(</span><span class="m">1</span> <span class="o">-</span> <span class="n">delta</span><span class="p">[[</span><span class="n">j</span> <span class="o">-</span> <span class="m">1</span><span class="p">]]))))</span>  <span class="n">mu_2_vec</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">&lt;-</span> <span class="n">rnorm</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="n">mu_2_mean</span><span class="p">(</span><span class="n">delta</span><span class="p">,</span> <span class="n">j</span><span class="p">,</span> <span class="n">y</span><span class="p">),</span> <span class="nf">sqrt</span><span class="p">(</span><span class="m">1</span> <span class="o">/</span> <span class="p">(</span><span class="m">1</span> <span class="o">+</span> <span class="nf">sum</span><span class="p">(</span><span class="n">delta</span><span class="p">[[</span><span class="n">j</span> <span class="o">-</span> <span class="m">1</span><span class="p">]]))))</span> <span class="p">}</span></code></pre> </figure> Let’s see the relation of sampled data to known one: The following plot presents the mean of the <script type="math/tex">\Delta</script> vector at each iteration: &nbsp; <div id="htmlwidget-40bed787eb289217bb47" class="plotly html-widget" style="width: 504px; height: 200px;"></div> &nbsp; Let’s check how mean of parameters <script type="math/tex">\mu_1</script> and <script type="math/tex">\mu_2</script> stabilize at used algorithm: <div id="htmlwidget-6ddf2507d65d69358578" class="plotly html-widget" style="width: 504px; height: 400px;"></div> Note how little iteration was enough to stabilize the parameters. &nbsp; Finally let’s see estimated density with our initial sample: <div id="htmlwidget-f04073a4f5ef54a41c66" class="plotly html-widget" style="width: 504px; height: 504px;"></div> <hr /> To those concerned about the topic, refer to <a href="http://gauss.stat.su.se/rr/RR2006_1.pdf" target="_blank" rel="noopener noreferrer">[1]</a> where you can find a generalization of normal distribution mixture by extending a priori distributions to other parameters. It is also worth to compare the above algorithm with its deterministic counterpart, Expectation Maximization (EM) algorithm see <a href="http://web.stanford.edu/~hastie/ElemStatLearn/" target="_blank" rel="noopener noreferrer">[2]</a>. <hr /> Write your question and comments below. We’d love to hear what you think. <hr /> Resources: [1] http://gauss.stat.su.se/rr/RR2006_1.pdf [2] http://web.stanford.edu/~hastie/ElemStatLearn/ [3] http://vcla.stat.ucla.edu/old/MCMC/MCMC_tutorial.htm

Have questions or insights?

Engage with experts, share ideas and take your data journey to the next level!
Explore Possibilities

Share Your Data Goals with Us

From advanced analytics to platform development and pharma consulting, we craft solutions tailored to your needs.

Talk to our Experts
r
data analytics
tutorials