Bloghttp://www.henrycharlesworth.com/blog/feed.php2018-09-27T15:23:51+00:00Machine Learning - A Probabilistic Perspective Exercises - Chapter 4<p>This is a continuation of the exercises in "<a href="https://www.cs.ubc.ca/~murphyk/MLbook/" target="_blank">Machine learning - a probabilistic perspective</a>" by Kevin Murphy. Chapter 4 is on "Gaussian Models". Let's get started!</p>
<p><span style="text-decoration: underline;"><strong>4.1 Uncorrelated does not imply independent</strong></span></p>
<p>Let \( X \sim U(-1,1) \) and \(Y = X^2\). Clearly Y is dependent on X, show \(\rho(X,Y)=0\).</p>
<p>\(\rho(X,Y)\) is just a normalised version of the covariance, so we just need to show the covariance is zero, i.e.:</p>
<p>\(\text{Cov}(X,Y) = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]\)</p>
<p>Clearly \( \mathbb{E}[X] = 0\) and so we just need to calculate \(\mathbb{E}[XY]\) and show this is zero. We can write:</p>
<p>\( \mathbb{E}[XY] = \int_{-1}^1 dx \int_0^1 dy \ xy p(x,y) \)</p>
<p>Then we say \(p(x,y) = p(y|x) p(x)\), but \(p(y|x) = \delta(y - x^2)\), i.e. a dirac-delta function, and \(p(x)=1/2\), i.e. just a constant. This means we can evaluate the integral over y to get:</p>
<p>\( \mathbb{E}[XY] = 1/2 \int_{-1}^1 x^3\)</p>
<p>This is the integral of an odd function and so is clearly equal to zero.</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>4.2 Uncorrelated and Gaussian does not imply independent, unless jointly Gaussian</strong></span></p>
<p>Let \(X \sim \mathcal{N}(0,1)\) and \(Y=WX\), where W takes values \( \pm 1\) with equal probability. Clearly X and Y are not independent, as Y is a function of X.</p>
<p><span style="text-decoration: underline;"><strong>(a) Show</strong></span>\(Y \sim \mathcal{N}(0,1)\)</p>
<p>This is kind of obvious from symmetry because \(\mathcal{N}(0,1)\) is symmetric, i.e. \( \mathcal{N}(x|0,1) = \mathcal{N}(-x|0,1) \). This means we can write:</p>
<p>\(P(Y=y) = P(W=1)P(X=y) + P(W=-1)P(X=-y) = P(X=y) = \mathcal{N}(0,1) \)</p>
<p><span style="text-decoration: underline;"><strong>(b) Show covariance between X and Y is zero</strong></span></p>
<p>We know that \(\mathbb{E}[X] = \mathbb{E}[Y] = 0\), so we just need to evaluate \( \mathbb{E}[XY]\):</p>
<p>\( \mathbb{E}[XY] = \int \int dx \ dy \ xy p(x,y)\)</p>
<p>But again \(p(x,y) = p(y|x)p(x)\), and we can write \(p(y|x) = 0.5 \delta(y-x) + 0.5 \delta(y+x)\). This means we are left with:</p>
<p>\(\mathbb{E}[XY] = \int_{-\infty}^{\infty} x \mathcal{N}(x|0,1)(0.5(x-x)) dx = 0\)</p>
<p>which proves the result.</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>4.3 Prove </strong></span>\(-1 \le \rho(X,Y) \le 1\)</p>
<p>Let us start with the definitions:</p>
<p>\(\rho(X,Y) = \frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X) \text{Var}(Y)}}\)</p>
<p>\(\text{Cov}(X,Y) = \mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])]\)</p>
<p>\(\text{Var}(X) = \mathbb{E}[(X-\mathbb{E}[X])^2] \)</p>
<p>Let us write \(\mu_X = \mathbb{E}[X]\) and \(\mu_Y = \mathbb{E}[Y]\), for notational convenience. If now for any constants a and b we consider:</p>
<p>\( \mathbb{E}[(a(X-\mu_X) + b(Y-\mu_Y))^2] \)</p>
<p>which is clearly greater than or equal to zero. Multiplying out, this inequality gives:</p>
<p>\(a^2 \mathbb{E}[(X-\mu_X)^2] + b^2 \mathbb{E}[(Y-\mu_Y)^2] + 2ab \mathbb{E}[(X-\mu_X)(Y-\mu_Y)] \ge 0 \)</p>
<p>Which we can re-write as:</p>
<p>\(2ab \text{Cov}(X,Y) \ge -a^2 \text{Var}(X) - b^2 \text{Var}(Y)\)</p>
<p>Now let us substitute in \(a^2 = \text{Var}(Y)\) and \(b^2 = \text{Var}(X)\):</p>
<p>\(2 \sqrt{\text{Var}(X) \text{Var}(Y)} \text{Cov}(X,Y) \ge -2 \text{Var}(X) \text{Var}(Y) \)</p>
<p>\( \implies \frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X) \text{Var}(Y)}} = \rho(X,Y) \ge -1\)</p>
<p>If we do the same thing, but instead now consider \(\mathbb{E}[(a(X-\mu_X) - b(Y-\mu_Y))^2]\), with the same definitions of a and b, it's easy to show that \( \rho(X,Y) \le 1\) as well.</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>4.4 Correlation coefficient for linearly related variables</strong></span></p>
<p>If \(Y=aX + b\), then if \( a > 0 \) show that \( \rho(X,Y)=1\), and if \(a < 0\) that \( \rho(X,Y) = -1\).</p>
<p>Let's say \(\mathbb{E}[X] = \mu_X\) and \(\text{Var}(X) = \sigma_X^2\). It follows that:</p>
<p>\( \mathbb{E}[Y] = a \mu_X + b\) and \( \text{Var}(Y) = a^2 \sigma_X^2\).</p>
<p>Now, to evaluate the correlation we need \(\mathbb{E}[XY] = \mathbb{E}[aX^2 + bX] = a \mathbb{E}[X^2] + b \mu_X\)</p>
<p>This means that the covariance is:</p>
<p>\( \text{Cov}(X,Y) = a \mathbb{E}[X^2] + b \mu_X - \mu_X(a \mu_X + b) = a \sigma_X^2\)</p>
<p>This allows us to get the correlation:</p>
<p>\( \rho(X,Y) = \frac{ \text{Cov}(X,Y)}{ \sqrt{\sigma_X^2 \sigma_Y^2}} = \frac{a \sigma_X^2}{\sqrt{a^2 \sigma_X^4}} = \frac{a \sigma_X^2}{|a| \sigma_X^2} = sgn(a)\)</p>
<p>Which is all we were asked to show!</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>4.5 Normalization constant for MV Gaussian</strong></span></p>
<p>Prove that: \( (2 \pi)^{d/2} | \mathbf{\Sigma}|^{1/2} = \int \exp(-\frac{1}{2} (\mathbf{x-\mu}^T \mathbf{\Sigma}^{-1} (\mathbf{x-\mu})) d \mathbf{x} \)</p>
<p>We are told to diagonalize the covariance matrix, which can always be done since it is symmetric. That is, we can write:</p>
<p>\(D = P^{-1} \Sigma P\)</p>
<p>Where D is a diagonal matrix where the entries are the eigenvalues of \(\Sigma\) and the columns of P are the eigenvectors. In fact, since \(\Sigma\) is symmetric the eigenvectors can form an orthogonal basis, and it is possible to make P an orthogonal matrix, such that \(P^{-1} = P^T\). This allows us to say:</p>
<p>\(D^{-1} = P^T \Sigma^{-1} P \implies \Sigma^{-1} = P D^{-1} P^T\)</p>
<p>As such, we can write the integral as:</p>
<p>\( \int \exp(-\frac{1}{2}(x-\mu)^T P D^{-1} P^T(x-\mu)) dx = \int \exp(-\frac{1}{2} (P(x-\mu))^T \begin{bmatrix} \frac{1}{\lambda_1} & & \\ & \ddots & \\ & & \frac{1}{\lambda_d} \end{bmatrix} (P(x-\mu))) dx \)</p>
<p>Now let us define \(y = P(x-\mu)\). Because P is an orthogonal matrix (which has determinant 1), the Jacobian is 1 and we can replace \(dx\) with \(dy\). The term inside the exponential is then:</p>
<p>\( \sum_{ij} y_i \delta_{ij} \frac{1}{\lambda_i} y_j = \sum_i \frac{y_i^2}{\lambda_i}\). Effectively by transforming to the eigenbasis we have decoupled the components of y, so we can write:</p>
<p>\( = \int_{-\infty}^{\infty} dy_1 e^{-\frac{y_1^2}{2 \lambda_1}} \dots \int_{-\infty}^{\infty} dy_d e^{-\frac{y_d^2}{2 \lambda_d}}\)</p>
<p>i.e. just the product of many one-dimensional Gaussians. This is equal to:</p>
<p>\( \sqrt{2 \pi \lambda_1} \sqrt{2 \pi \lambda_2} \dots \sqrt{2 \pi \lambda_d} = (2 \pi)^{d/2} \sqrt{\lambda_1 \dots \lambda_d}\)</p>
<p>We then use that \(det(\Sigma) = \prod_{i=1}^d \lambda_i\), which gives us the final answer we want!</p>
<p>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/MathJax.js?config=TeX-MML-AM_CHTML" async="">// <![CDATA[
// ]]></script>
</p>http://www.henrycharlesworth.com/blog/index.php?controller=post&action=view&id_post=102018-09-27T15:23:51+00:00Continuous Blackjack<p>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/MathJax.js?config=TeX-MML-AM_CHTML" async="">// <![CDATA[
// ]]></script>
</p>
<p>Recently I had an interview where as an extra "bonus" question at the end I was asked an interesting maths problem. With a couple of hints from the interviewer, I was able to sketch out a rough solution, however afterwards I wanted to look up a proper solution just to verify it. Interestingly, I wasn't able to find one (I'm sure it's out there, I just need to look harder). Anyway, I thought it was a nice little problem and so I thought it was worth posting what I believe to be the correct solution.</p>
<p>The problem is this - consider that you are playing a game with one other person, and you are going to be going first. The rules are you pick a number randomly from the Uniform(0,1) distribution (i.e. a random number between 0 and 1). You then decide either to stick with this total, or you play on and choose another such random number. You can do this as many times as you like, however if the sum total of the numbers you pick goes over 1, you go bust and automatically lose. If you decide to stick with a number less than 1, the other player has a go and plays by the same rules. The person who sticks at the higher number (or doesn't go bust when their opponent does) is the winner. Clearly it is an advantage to go second, and the optimal strategy for player 2 is extremely simple - keep playing until you get a higher total than your opponent or until you go bust. The question is, given that you are player 1 what is the best strategy you can adopt?</p>
<p>The first thing to realise is that our optimal strategy will be divided at some number t, which I shall call the "decision boundary", and where if we have a sum less than t we will draw a new number, and if we have a sum greater than t we will stick. We can then think about what the probability of winning is, given that we stick at a particular value t. This is equal to 1 minus the probability that we lose - and the probability that we lose is the probability that the second player is able to land their sum within the interval \([t,1]\), given that they play on until they either reach this interval or go bust. To go about calculating this, let us define \(P_t[x]\) to be the probability that we are able to land in the interval \([t,1]\), given that we are currently at x and definitely going to play on if we have not yet reached t. We can write down an equation for this as follows:</p>
<p>\(P_t[x] = (1-t) + \int_x^t P_t[y]dy\)</p>
<p>where the first term is the probability that we reach the interval in the next turn, and the second term is the integral of the probability of reaching y < t (=1 as we are drawing uniform(0,1) random variables) multiplied by the probability that we reach the interval \([t,1]\) starting from y. We can convert this from an integral equation into an ODE:</p>
<p>\(\frac{dP_t[x]}{dx} = -P_t[x] \ \ \ \implies P_t[x] = A e^{-x} \)</p>
<p>We can obtain the constant A by noting that \(P_t[t] = 1-t \), and hence that \(A = (1-t)e^t\). This means that:</p>
<p>\(P_t[x] = (1-t)e^{t-x}\). Now, player 2 starts from \(x=0\), and so the probability of losing given that we stuck at a value t is simply \(P_t[0] = (1-t)e^t\). The probability that we win is \(1-P_t[0] = 1-(1-t)e^t\).</p>
<p>The final step we need to solve this is to say that the threshold at which our strategy changes should be the following point: where the probability of winning <em>given that we stick at t</em> is exactly the same as the <em>probability that we win given that we choose one more number</em>. We can write this condition as:</p>
<p>\(1-(1-t)e^t = \int_t^1 \left[ 1-(1-t')e^{t'}\right]dt' = (1-t) - e^t(t-2) - e \)</p>
<p>This gives a non-linear equation for the optimal decision boundary t, which cannot be re-arranged nicely but numerically we can solve to find that \(t \approx 0.57\). That is, if our sum is less than approximately 0.57 we should pick another number, and if it's more we should stick!</p>http://www.henrycharlesworth.com/blog/index.php?controller=post&action=view&id_post=92018-09-18T20:44:58+00:00Machine Learning - A Probabilistic Perspective Exercises - Chapter 3<p>This is a continuation of the exercises in "<a href="https://www.cs.ubc.ca/~murphyk/MLbook/" target="_blank">Machine learning - a probabilistic perspective</a>" by Kevin Murphy. Chapter 3 is on "Generative Models for Discrete Data".</p>
<p> <span style="text-decoration: underline;"><strong>3.1 MLE for the Bernoulli/ binomial model</strong></span></p>
<p>We start off with a nice simple one. If we have data D consisting of N1 heads out of a total number N trials, then the likelihood is \(P(D|\theta) = \theta^{N_1} (1-\theta)^{N-N_1} \). It's a lot easier to work with the log likelihood here:</p>
<p>\( \log(P(D|\theta)) = N_1 \log(\theta) + (N-N_1) \log(1-\theta) \)</p>
<p>Taking the derivative:</p>
<p>\(\frac{d}{d \theta} = \frac{N_1}{\theta} - \frac{N-N_1}{1-\theta} = 0 \ \ \implies \frac{N_1}{\theta} = \frac{N-N_1}{1-\theta} \)</p>
<p>Rearrange this and we find \( \theta = \frac{N_1}{N} \)</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>3.2 Marginal likelihood for the Beta-Bernoulli model.</strong></span></p>
<p>This question is looking at deriving the marginal likelihood, \(P(D) = \int P(D|\theta) P(\theta) d\theta \). We are told to use the chain rule of probability: \(P(x_{1:N}) = p(x_1) p(x_2 | x_1) p(x_3: x_{1:2})\dots \)</p>
<p>and reminded that in the chapter we derived the posterior predictive distribution:</p>
<p>\( P(X=k | D_{1:N}) = \frac{N_k + \alpha_k}{\sum_i N_i + \alpha_i} \)</p>
<p>We are the given an example - suppose D = H,T,T,H,H (or D=1,0,0,1,1). It follows that:</p>
<p>\(P(D) = \frac{\alpha_1}{\alpha} \frac{\alpha_0}{\alpha + 1} \frac{\alpha_0+1}{\alpha+2} \frac{\alpha_1 + 1}{\alpha + 3} \frac{\alpha_1+2}{\alpha+4} \)</p>
<p>where we have just applied the chain rule, using the posterior predictive distribution after each data point has been collected. It's clear that if we do this more generally (for any collection of data), that we will be left with:</p>
<p>\(P(D) = \frac{\left[ \alpha_0 \dots (\alpha_0 + N_0 - 1) \right] \left[ \alpha_1 \dots (\alpha_1 + N_1 - 1) \right]}{\alpha \dots (\alpha + N-1)} \)</p>
<p>We then note that this can be re-written using factorials as follows:</p>
<p>\(P(D) = \frac{(\alpha_0+N_0-1)! (\alpha_1 + N_1 -1)! (\alpha-1)!}{(\alpha_0-1)! (\alpha_1-1)! (\alpha + N -1)!} \)</p>
<p>Remembering that \( \Gamma(N) = (N-1)! \), and that \( \alpha = \alpha_0 + \alpha_1 \), we get the result which is given in the question:</p>
<p>\(P(D) = \frac{ \Gamma(\alpha_0 + N_0) \Gamma(\alpha_1 + N_1) \Gamma(\alpha_1 + \alpha_0) }{\Gamma(\alpha_1 + \alpha_0 + N) \Gamma(\alpha_1) \Gamma(\alpha_0) } \)</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>3.3 Posterior predictive for Beta-Binomial model</strong></span></p>
<p>In the text the posterior predictive distribution for the Beta-Binomial model was derived for the case of predicting the outcome of multiple future trials given the data:</p>
<p>\(P(x|n,D) = \frac{B(x+\alpha_1', n-x+\alpha_0')}{B(\alpha_1', \alpha_0')} \binom{n}{x} \)</p>
<p>where \(\alpha_1'\) and \(\alpha_0'\) involve the prior parameters and the data. The question simply asks to show that when \(n=1\) that we have: \(P(x=1|D) = \frac{\alpha_1'}{\alpha_0' + \alpha_1'}\).</p>
<p>We need to remember that by definition, \(B(a,b) = \frac{\Gamma(a) \Gamma(b)}{\Gamma(a+b)} \), hence:</p>
<p>\( \frac{B(1+\alpha_1', \alpha_0')}{B(\alpha_1', \alpha_0')} = \frac{\Gamma(1+\alpha_1') \Gamma(\alpha_0')}{\Gamma(1 + \alpha_0' + \alpha_1')} \frac{\Gamma(\alpha_0' + \alpha_1')}{\Gamma(\alpha_0') \Gamma(\alpha_1')} \)</p>
<p> But then we simply note the following: \(\Gamma(1+\alpha_1') = \alpha_1'! = \alpha_1' (\alpha_1-1)! = \alpha_1' \Gamma(\alpha_1')\). Using this and simplifying clearly leaves us with the desired result.</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>3.4 Beta updating from censored likelihood</strong></span></p>
<p>Suppose we toss a coin \(n=5\) times. Let X be the number of heads. We observe there are fewer than 3 heads, but we don't know how many precisely. Prior we use is \(P(\theta) = \text{Beta}(\theta|1,1)\). Compute posterior, \(P(\theta | X < 3) \).</p>
<p>Now \( \text{Beta}(\theta|1,1)\) is an uninformative prior, so this is just a constant. So the posterior, \(P(\theta | X<3) \propto P(X < 3|\theta) P(\theta) \propto P(X<3 | \theta) \). So we need to consider the likelihood, \(P(X<3|\theta)\). This is straightforward to calculate as it is the sum of the probability of no heads, one head and two heads, i.e.:</p>
<p>\(P(X<3 | \theta) = (1-\theta)^5 + \binom{5}{4}(1-\theta)^4 \theta + \binom{5}{3} (1-\theta)^3 \theta^2 = Bin(0|\theta, 5) + Bin(1|\theta,5) + Bin(2|\theta,5) \)</p>
<p>It follows that:</p>
<p>\(P(X<3 | \theta) \propto \text{Beta}(1,6) + \text{Beta}(2,5) + \text{Beta}(3,4) \)</p>
<p>which is a mixture distribution.</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>3.5 Uninformative prior for log-odds ratio</strong></span></p>
<p>Let \( \phi = \text{logit}(\theta) = \log(\frac{\theta}{1-\theta}) \). If \(p(\phi) \propto 1\), show \(p(\theta) \propto \text{Beta}(\theta| 0,0) \).</p>
<p>If we just apply the change of variables formula from chapter 2:</p>
<p>\(p(\theta) = \bigg| \frac{d\phi}{d\theta} \bigg| p(\phi) \)</p>
<p>but \(p(\phi)\) is a constant, and \(\phi = \log(\theta) - \log(1-\theta)\), so:</p>
<p>\( \frac{d\phi}{d\theta} = \frac{1}{\theta} + \frac{1}{1-\theta} = \frac{1}{\theta(1-\theta)} \)</p>
<p>Remembering the definition for the Beta distribution: \(\text{Beta}(x|a,b) = \frac{1}{B(a,b)} x^{a-1}(1-x)^{b-1}\), and so clearly \(p(\theta) \propto \text{Beta}(\theta|0,0) \).</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>3.6 MLE for Poisson distribution</strong></span></p>
<p>Definition: \( \text{Poi}(x|\lambda) = \frac{\lambda^x}{x!} e^{-\lambda}\)</p>
<p>So the likelihood of a set of data \( \{x_i\} \) is:</p>
<p>\(L(\lambda ; \{x_i\}) =\prod_i \frac{\lambda^{x_i}}{x_i!} e^{-\lambda} \)</p>
<p>Unsurprisingly, it's easier to work with the log-likelihood:</p>
<p>\( l(\lambda; \{x_i\}) = \sum_i \left[ -\lambda + x_i \log(\lambda) - \log(x_i !) \right]\)</p>
<p>If we ignore the term that doesn't depend on \(\lambda\) then we are left with \(-\lambda N + \log(\lambda \left( \sum_i x_i \right) \). Differentiating w.r.t \(\lambda\) and setting equal to zero:</p>
<p>\( -N + \frac{1}{\lambda} \sum_i x_i = 0 \ \ \ \implies \hat{\lambda} = \frac{1}{N} \sum_i x_i \)</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>3.7 Bayesian analysis of the Poisson distribution</strong></span></p>
<p><span style="text-decoration: underline;"><strong>(a) Derive the posterior assuming a Gamma prior</strong></span></p>
<p>The prior is: \(P(\lambda) = Ga(\lambda |a,b) \propto \lambda^{a-1} e^{-\lambda b}\).</p>
<p>and the posterior is proportional to the likelihood times the prior, i.e. \(P(\lambda | D) \propto P(D | \lambda) P(\lambda) \).</p>
<p>We already looked at the likelihood for the Poisson distribution in the previous section, so:</p>
<p>\(P(\lambda | D) \propto \prod_{i=1}^N \frac{e^{-\lambda} \lambda^{x_i}}{x_1!} \lambda^{a-1} e^{-\lambda b} \propto e^{-\lambda(N+b)}\lambda^{a-1+\sum_i x_i} = Ga(\lambda | a + \sum_i x_i, N+b) \)</p>
<p>So we see that the posterior is also a Gamma distribution, making the Gamma distribution a conjugate prior for the Poisson distribution.</p>
<p><span style="text-decoration: underline;"><strong>(b) Posterior mean as a->0, b->0</strong></span></p>
<p>We use that the fact that we derived the mean of a Gamma distribution in the text, finding that it is equal to a/b. So clearly it's just:</p>
<p>\( \frac{1}{N} \sum_{i=1}^N x_i \), which is the MLE we found in the previous section.</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>3.8 MLE for the uniform distribution</strong></span></p>
<p>Consider a uniform distribution centered on 0 with width 2a. \(p(x) = \frac{1}{2a} I(x \in [-a,a]) \) is the density function.</p>
<p><span style="text-decoration: underline;"><strong>(a) Given a data set x1, ..., xn, what is the MLE estimate of a?</strong></span></p>
<p>The key point here is that \(P(D|a) = 0\) for any a which is less than the data point with the largest magnitude, and equal to \(\frac{1}{(2a)^n}\) for any a larger than this. This is clearly minimised when a is made as small as possible, i.e. \(\hat{a} = max|x_i|\).</p>
<p><strong>(b)What probability would the model assign to a new data point x_n+1 using the MLE estimate for a?</strong></p>
<p>Clearly \( \frac{1}{2\hat{a}}\) if \(|x_{n+1}| \le \hat{a}\) and 0 otherwise.</p>
<p><span style="text-decoration: underline;"><strong>(c) Do you see a problem with this approach?</strong></span></p>
<p>Clearly there is an issue here as any value with an absolute value larger than \(max |x_i|\) is assigned zero probability. For relatively small data sets this will be a big issue, but even for larger data sets it seems far from ideal.</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>3.9 Bayesian analysis of the uniform distribution</strong></span></p>
<p>This is very much a continuation of the previous question, although here we are told to consider a \(Unif(0,\theta)\) distribution. The MLE is now max(D). Overall I'm fairly sure the question is either extremely poorly worded or has some mistakes, so I'm just going to go through it in the way that makes sense to me.</p>
<p>We are told to use a Pareto prior - the Pareto distribution is defined as:</p>
<p>\(p(x | k,m) = k m^k x^{-(k+1)} I(x \ge m) \)</p>
<p>where I is an indicator function. So a \(\text{Pareto}(\theta | b, K)\) prior is:</p>
<p>\( P(\theta) = K b^K \theta^{-(K+1)} I(\theta \ge b) \)</p>
<p>We more or less established the likelihood in the previous section is given by:</p>
<p>\(P(D | \theta) = \frac{1}{\theta^N} I(\theta \ge max(D)) \)</p>
<p>This means that the joint distribution, \(P(D,\theta)\) is given by \(P(D,\theta) = P(D | \theta) P(\theta) = \frac{K b^K}{\theta^{N+K+1}} I(\theta \ge m)\) where \(m = \text{max}(b, D)\).</p>
<p>We can use this to write the marginal likelihood:</p>
<p>\(P(D) = \int P(D,\theta) d\theta = \int_m^{\infty} \frac{K b^K}{\theta^{N+K+1}} d\theta = \frac{K b^K}{(N+K) m^{N+K}} \)</p>
<p>Now the posterior is given by:</p>
<p>\(P(\theta | D) = \frac{P(\theta, D)}{P(D)} = \frac{(N+K) m^{N+K}}{\theta^{N+K+1}} I(\theta \ge m) = \text{Pareto}(\theta | N+K, m=\text{max}(D,b))\)</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>3.11 Bayesian analysis of the exponential distribution</strong></span></p>
<p>\(p(x|\theta) = \theta e^{-\theta x}\) for \(x \ge 0, \theta > 0\) defines the exponential distribution.</p>
<p><span style="text-decoration: underline;"><strong>(a) Derive the MLE</strong></span></p>
<p>We can write the likelihood function as:</p>
<p>\(L(\mathbf{x};\theta) = \theta^N e^{-\theta \sum_{i=1}^N x_i} \)</p>
<p>Clearly working with the log-likelihood will be better here:</p>
<p>\(l(\mathbf{x}; \theta) = N \log(\theta) - \theta \sum_i x_i \)</p>
<p>Taking the derivative wrt theta and setting it equal to zero:</p>
<p>\(0 = \frac{N}{\theta} - \sum_i x_i \)</p>
<p>and so clearly the MLE is: \( \hat{\theta} = \frac{1}{\frac{1}{N} \sum_i x_i} = \frac{1}{\bar{x}} \)</p>
<p><span style="text-decoration: underline;"><strong>(b) Suppose we observe X1=5, X2=6, X3=4. What is the MLE?</strong></span></p>
<p>The mean is 5, so \(\hat{\theta} = 1/5\).</p>
<p><span style="text-decoration: underline;"><strong>(c) Assume a prior \(p(\theta) = Expon(\theta | \lambda) \). What value should lambda take to give \( \mathbb{E} [\theta] = 1/3\)?</strong></span></p>
<p>The exponential distribution is a special case of the Gamma distribution where \(a=1\) and \(b=\lambda\). We derived that the mean of a Gamma distribution is just given by a/b, and so we want:</p>
<p>\( 1/3 = \frac{1}{\hat{\lambda}}\) and hence \(\hat{\lambda}=3\).</p>
<p><span style="text-decoration: underline;"><strong>(d) What is the posterior?</strong></span></p>
<p>\(P(\theta | D) \propto \theta^N e^{-N \theta \bar{x}} \lambda e^{-\lambda \theta} \propto \theta^N e^{-\theta(N \bar{x} + \lambda)} = Ga(\theta | N+1, N\bar{x}+\lambda)\)</p>
<p><span style="text-decoration: underline;"><strong>(e) Is exponential prior conjugate to exponential likelihood?</strong></span></p>
<p>Kind of, in the sense that both the prior and the posterior are Gamma distributions. But the posterior is not also an Exponential distribution.</p>
<p><span style="text-decoration: underline;"><strong>(f) What is the posterior mean?</strong></span></p>
<p>Again, mean of Gamma is a/b so we have \(\frac{N+1}{N \bar{x} + \lambda}\). If \(\lambda = 0\) and as \(N \to \infty\) we recover the MLE.</p>
<p><span style="text-decoration: underline;"><strong>(g) Why do they differ?</strong></span></p>
<p>Bit of a stupid question - because we calculated them in a different manner. We should expect the posterior mean to be less prone to overfitting though, and generally be a bit better.</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>3.12 MAP estimation for Bernoulli with non-conjugate priors</strong></span></p>
<p>In the book we discussed Bayesian inference of a Bernoulli rate parameter when we used a \(\text{Beta}(\theta|\alpha, \beta)\) prior. In this case, the MAP estimate was given by:</p>
<p>\( \theta_{MAP} = \frac{N_1 + \alpha - 1}{N + \alpha + \beta - 2} \)</p>
<p><span style="text-decoration: underline;"><strong>(a) Now consider the following prior:</strong></span></p>
<p>\(p(\theta) = 0.5\) if \(\theta = 0.5\), \(p(\theta) = 0.5\) if \(\theta = 0.4\) and 0 otherwise. Clearly this is a weird prior to use, but I guess it's just for an exercise to make a point so let's go with it. The question is to derive the MAP estimate as a function of N1 and N.</p>
<p>We can write the posterior as:</p>
<p>\(P(\theta | D) \propto \theta^{N_1} (1-\theta)^{N-N_1} I( \theta \in \{0.4, 0.5 \}) \)</p>
<p>So the MAP is simply: \(\text{max}(0.5^{N_1} 0.5^{N-N_1}, 0.4^{N_1} 0.6^{N-N_1}) = \text{max}(0.5^{N}, 0.4^{N_1} 0.6^{N-N_1})\).</p>
<p>With some algebraic manipulations we can show that the MAP is 0.5 if: \(N_1 > \frac{\log(6/5)}{\log(6/4)} N \approx 0.45 N\). I was kind of surprised that it wasn't <em>exactly</em> 0.45N, but I guess it has something with it being constrained to be between 0 and 1. I'm actually not sure though.</p>
<p><span style="text-decoration: underline;"><strong>(b) Suppose the true theta is 0.41. Which prior leads to the better estimate?</strong></span></p>
<p>When N is small, we would expect this second approach to work better (as it is choosing only between 0.4 and 0.5), however as N becomes larger eventually the other prior, where \(\theta\) can take any value, will give a better estimate.</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>3.13 Posterior predictive for a batch of data using the dirichlet-multinomial model</strong></span></p>
<p>Derive an expression for \(P(\tilde{D} | D, \alpha)\), i.e. use the posterior to predict the results for a whole batch of data. Now, the definition of the Dirichlet distribution is:</p>
<p>\(\text{Dir}(\mathbf{\theta} | \mathbf{\alpha}) = \frac{1}{B(\mathbf{\alpha})} \prod_{k=1}^K \theta_k^{\alpha_k-1} I(\mathbf{\theta} \in S_k) \) (the identity function is ensuring the components of \(\theta\) sum to one.)</p>
<p>where \(B(\mathbf{\alpha}) = \frac{\prod_k \Gamma(\alpha_k)}{\Gamma[\sum_k \alpha_k]}\).</p>
<p>Using a \(\text{Dir}(\mathbf{\theta} | \mathbf{\alpha})\) prior for \(\theta\) we showed in the book that the posterior was given by:</p>
<p>\( P(\mathbf{\theta} | D) = \text{Dir}(\mathbf{\theta} | \alpha_1 + N_1, \dots, \alpha_K + N_K) \)</p>
<p>This means that we can write:</p>
<p>\(P(\tilde{D} | D, \alpha) = \int P(\tilde{D} | \theta) P(\theta | D) d\theta = \int \prod_{k=1}^K \theta_k^{x_k} \text{Dir}(\theta | \alpha_1 + N_1, \dots, \alpha_K + N_K) d\theta\)</p>
<p>where \( \mathbf{x} = \{x_k\} \) is the numbers of each class in the data we are predicting. This is equal to:</p>
<p>\( \int \prod_k \theta_k^{x_k} \frac{1}{B(\mathbf{\alpha + N})} \prod_k \theta_k^{\alpha_k + N_k} I(\theta \in S_k) d\theta \)</p>
<p>\( = \int \frac{B(\mathbf{\alpha + N + X})}{B(\mathbf{\alpha+N})} \text{Dir}(\theta | \alpha_1 + N_1 + x_1, \dots, \alpha_K + N_K + x_K) d\theta = \frac{B(\mathbf{\alpha + N + X})}{B(\mathbf{\alpha+N})} \)</p>
<p>where we just converted the parameters in the Dirichlet distribution and introduced the correct normalisation parameter. Since it is a probability distribution, it then of course integrates to 1 (the final step).</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>3.14 Posterior predictive for the Dirichlet-Multinomial</strong></span></p>
<p>(a) Suppose we compute the empirical distribution over letters of the Roman alphabet plus the space character (27 values), from 2000 samples. Suppose we see the letter "e" 260 times. What is \(P(x_{2001} = e | D)\), assuming a Dirichlet prior with alpha_k = 10 for all k?</p>
<p>We showed in the text that the posterior predictive is given by:</p>
<p>\(P(X=j | D) = \frac{\alpha_j + N_j}{\sum_k (\alpha_k + N_k)} = \frac{10 + 260}{270 + 2000} \simeq 0.119 \)</p>
<p>(b) Suppose actually we saw "e" 260 times, "a" 100 times and "p" 87 times. What is \(P(x_{2001}=p, x_{2002} = a | D)\) under the same assumptions?</p>
<p>We basically just derived what we need for this in the previous question. We are looking for the probability of the data vector \(\mathbf{X} = (1,0,\dots, 1, 0, \dots, 0)\), where the non-zero components are at indices 1 ("a") and 16 ("p"). We showed:</p>
<p>\(P(X | D) = \frac{B(\mathbf{\alpha + N + X})}{B(\mathbf{\alpha+N})} \)</p>
<p> This is equal to:</p>
<p>\( \frac{\prod_k \Gamma(\alpha_k + N_k + x_k) \Gamma(\sum_k \alpha_k + N_k)}{\Gamma(\sum_k \alpha_k + N_k + x_k) \prod_k \Gamma(\alpha_k + N_k)} \)</p>
<p>Now in the product terms, everything cancels except the components corresponding to p and a, where we pick up factors of \(\frac{\Gamma(98)}{\Gamma(97)}\) and \(\frac{\Gamma(111)}{\Gamma(110)} \) respectively. Overall, we are left with:</p>
<p>\(P(X|D) = \frac{ \Gamma(111) \Gamma(98) \Gamma(2270)}{\Gamma(110) \Gamma(97) \Gamma(2272)} = \frac{(110!)(97!)(2269!)}{(109!)(96!)(2271!)} = \frac{110*97}{2270*2271} \simeq 0.002 \)</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>3.17 Marginal likelihood for beta-binomial under uniform prior</strong></span></p>
<p>Suppose we toss a coin N times and observe N1 heads. Let \(N_1 \sim Bin(N,\theta)\) and \(\theta \sim Beta(1,1)\). Show that the marginal likelihood is given by: \(P(N_1 | N) = \frac{1}{N+1}\).</p>
<p>We can write:</p>
<p>\(P(N_1 | N) = \int P(N_1 | N, \theta) P(\theta) d\theta \)</p>
<p>But a \(Beta(1,1)\) distribution is just uniform, so we don't need to take this into account. So we can say:</p>
<p>\(P(N_1 | N) = \int_0^1 \text{Bin}(N_1 | N, \theta) d\theta = \int_0^1 \frac{N!}{N_1! (N-N_1)!} \theta^{N_1} (1-\theta)^{N-N_1} d\theta \)</p>
<p>It helps to re-write the factorials in terms of Gamma functions (\(\Gamma(n+1) = n!\)):</p>
<p>\( P(N_1 | N) = \int_0^1 \frac{\Gamma(N+1)}{\Gamma(N_1+1) \Gamma(N-N_1 + 1)} \theta^{N_1} (1-\theta)^{N-N_1} d\theta \)</p>
<p>Now, by definition \(\text{Beta}(\theta | N_1 + 1, N-N_1+1) = \frac{\Gamma(N+2)}{\Gamma(N_1+1) \Gamma(N-N_1+1)} \theta^{N_1} (1-\theta)^{N-N_1}\). This means that we can say:</p>
<p>\(P(N_1 | N) = \int_0^1 \frac{\Gamma(N+1)}{\Gamma(N+2)} \text{Beta}(\theta | N_1+1, N-N_1+1) d\theta = \frac{N!}{(N+1)!} = \frac{1}{N+1}\)</p>
<p>since a probability distribution must integrate to 1. This result kind of surprised me to be honest - I guess it kind of makes sense although intuitively I would have expected a uniform prior over \(\theta\) to lead to it being most likely to have \(N_1 = N/2\), rather than being completely uniform!</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>3.18 Bayes factor for coin tossing</strong></span></p>
<p>Suppose we toss a coin \(N=10\) times and observe \(N_1 = 9\) heads. Let the null hypothesis be that the coin is fair, and the alternative be that the coin can have any bias - \(p(\theta) = Unif(0,1)\). Derive the Baye's factor in favour of the biased coin hypothesis. What if \(N_1=90\) and \(N=100\).</p>
<p>I think this just means we need to look at the ratios of the likelihoods under each assumption. Under the fair assumption:</p>
<p>\(P(N_1 | \theta = 1/2) = \frac{N!}{N_1!(N-N_1)!} \frac{1}{2}^N\)</p>
<p>Under the biased assumption, we just calculated this in the previous exercise:</p>
<p>\(P(N_1 | \theta \sim Unif(0,1)) = \frac{1}{N+1} \)</p>
<p>So \(BF = \frac{N_1! (N-N_1)! 2^N}{(N+1)!} \) which for \(N_1 = 9\) and \(N=10\) I find \(BF \simeq 9.31 \). For \(N_1 = 90\), \(N=100\) I find: \(BF \simeq 7.25 \times 10^{14} \) - clearly the coin is amazingly unlikely to be fair in the second case.</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>3.19 Irrelevant features with Naive Bayes</strong></span></p>
<p>Let \(x_{iw}=1\) if word w occurs in document i, and be 0 otherwise. Let \(\theta_{cw}\) be the estimated probability that word w occurs in documents of class c. The log-likelihood that document x belongs to class c is:</p>
<p>\( \log(p(\mathbf{x_i}|c,\theta)) = \log \prod_w \theta_{cw}^{x_{iw}}(1-\theta_{cw})^{1-x_{iw}} \)</p>
<p>\( = \sum_w x_{iw} \log(\frac{\theta_{cw}}{1-\theta_{cw}}) + \sum_w \log(1-\theta_{cw}) \)</p>
<p>This can be written more succinctly as \( \log(p(\mathbf{x_i}) = \mathbf{\phi(x_i)}^T \mathbf{\beta_c} \), where \( \mathbf{\phi(x_i)} = (\mathbf{x_i},1) \) and:</p>
<p>\(\mathbf{\beta_c} = (\log \frac{\theta_{c,1}}{1-\theta_{c,1}}, \dots, \log \frac{\theta_{c,W}}{1-\theta_{c,W}} , \sum_w \log(1-\theta_{cw}))^T \)</p>
<p>i.e. a linear classifier, as the class-conditional density is a linear function of the params \(\mathbf{\beta_c}\).</p>
<p><span style="text-decoration: underline;"><strong>(a) Assuming P(C=1) = P(C=2) = 0.5, write an expression for the log posterior odds ratio in terms of the features and the parameters.</strong></span></p>
<p>We just use Bayes's theorem: \( P(C=1 | \mathbf{x_i}) = P(\mathbf{x_i} | C=1) P(C) / P(\mathbf{x_i}) \), and likewise for C=2. However, as \(P(C=1) = P(C=2)\) we get a cancellation such that:</p>
<p>\( \log \frac{P(C=1 | \mathbf{x_i})}{P(C=2 | \mathbf{x_i})} = \mathbf{\phi(x_i)}^T(\mathbf{\beta_1 - \beta_2}) \)</p>
<p><span style="text-decoration: underline;"><strong>(b) Consider a particular word w. State the conditions on \(\theta_{1,w}\) and \(\theta_{2,w}\) under which the presence or absence of the word will have no effect on the class posterior.</strong></span></p>
<p>For this, we want to poster odds ratio to be 1, and hence for the logarithm of this to be zero. This means that \( \beta_{1,w} = \beta_{2,w} \).</p>
<p><span style="text-decoration: underline;"><strong>(c) The posterior mean estimate of theta, using a Beta(1,1) prior, is given by:</strong></span></p>
<p>\( \hat{\theta_{cw}} = \frac{1+ \sum_{i \in c} x_{iw}}{2 + n_c} \)</p>
<p><span style="text-decoration: underline;"><strong>where the sum is over the nc documents in class c. Consider a word w, and suppose it occurs in every document, regardless of class. Let there be n1 documents of class 1, and n2 of class 2, with n1 not equal to n2. Will this word be ignored by our classifier?</strong></span></p>
<p>Clearly not, as we are told that \(\hat{\theta_{1,w}} = \frac{1+n_1}{2+n_c}\) and \(\hat{\theta_{2,w}} = \frac{1+n_2}{2+n_c}\), which are not equal as \(n_1 \neq n_2\), and hence the necessary condition derived in (b) does not hold.</p>
<p><span style="text-decoration: underline;"><strong>(d) What other ways can you think of to encourage "irrelevant" words to be ignored?</strong></span></p>
<p>I guess things like preprocessing and feature selection would be good for this.</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>3.21 Mutual information for Naive Baye's with binary features</strong></span></p>
<p>The result was stated in the chapter, here we are asked to derive it. We are looking for the mutual information between feature j and the class label Y, i.e. \(I(X_j, Y)\). By definition, this is equal to:</p>
<p>\(I(X_j;Y) = \sum_{x_j \in \{0,1\}} \sum_{y \in C} P(x_j, y) \log \left( \frac{P(x_j, y)}{p(x_j) p(y)} \right) \)</p>
<p>To get the joint values, we can say \(P(x_j = 1, y=c) = P(x_j=1 | y=c) P(y=c) = \theta_{jc} \pi_c \)</p>
<p>and then \( P(x_j=0, y=c) = P(x_j=0 | y=c) P(y=c) = (1-\theta_{jc}) \pi_c \).</p>
<p>By definition, \(P(y=c) = \pi_c\), and then we can say:</p>
<p>\(P(x_j=1) = \sum_{c'} P(x_j=1, y=c') = \sum_{c'} \theta_{jc'} \pi_{c'} \equiv \theta_j\)</p>
<p>\(P(x_j=0) = \sum_{c'} P(x_j=0, y=c') = \sum_{c'} (1-\theta_{jc'}) \pi_{c'} = 1-\theta_j \)</p>
<p>Putting this together we get the desired result:</p>
<p>\( I(X_j, Y) = \sum_c \left[ \theta_{jc} \pi_c \log \left( \frac{\theta_{jc}}{\theta_j} \right) + (1-\theta_{jc})\pi_c \log \left( \frac{1-\theta_{jc}}{1-\theta_j} \right) \right] \)</p>
<p>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/MathJax.js?config=TeX-MML-AM_CHTML" async="">// <![CDATA[
// ]]></script>
</p>http://www.henrycharlesworth.com/blog/index.php?controller=post&action=view&id_post=82018-09-17T18:07:03+00:00Machine Learning - A Probabilistic Perspective Exercises - Chapter 2<p>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/MathJax.js?config=TeX-MML-AM_CHTML" async="">// <![CDATA[
// ]]></script>
</p>
<p>Recently I've been becoming more and more interested in machine learning, and so far my attention has been primarily focused on reinforcement learning. I had a lot of fun working on the Big 2 AI, but I feel like I really need to invest more time to studying the fundamentals of machine learning. I've got myself a copy of "Machine Learning - A Probabilistic Perspective", which seems like a great text book, and so I'm going to work my way through it. I've decided to make a decent attempt at doing as many of the exercises as possible, and I feel like actually writing up an explanation for them is quite useful for me in making sure I actually understand what's going on. Potentially it might also be useful for other people too, so I thought I would post my answers as I go! I may skip some exercises if I think they're boring (or too "wordy"), or of course if I'm unable to do them (although I will probably mention them in this case)!</p>
<p><strong><span style="text-decoration: underline;">2.1 My neighbor has two children. Assuming that the gender of a child is like a coin flip, it is most likely, a priori, that my neighbour has one boy and one girl, with probability 1/2. The other possibilities—two boys or two girls—have probabilities 1/4 and 1/4.<br /></span> (<span style="text-decoration: underline;">a) Suppose I ask him whether he has any boys, and he says yes. What is the probability that one child is a girl?</span></strong></p>
<p> This is a standard probability puzzle. The key here is in the wording of the question we ask him - whether he has <em>any boys</em>? A priori, there are four possible options for child 1 and child 2: BB (boy boy), BG, GB and GG, each with equal probability. If he says yes to our question, this is compatible with three of the initial four options: BB, BG and GB. That is, we can only exclude the possibility GG. In two out of these three remaining possibilities he has one girl, and so the probability is 2/3.</p>
<p><br /><strong> (<span style="text-decoration: underline;">b) Suppose instead that I happen to see one of his children run by, and it is a boy. What is the probability that the other child is a girl</span></strong></p>
<p>In this case the question is more specific with its wording - we have seen one specific child, but this tells us nothing about the other child, and so we get the more intuitive answer of 1/2.</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>2.3 Show Var[X+Y] = Var[X] + Var[Y] + 2 Cov[X,Y] for any two random variables X and Y<br /></strong></span></p>
<p>This one is relatively straightforward and just involves starting from a basic relationship for the variance:</p>
<p>\( Var[X+Y] = \mathbb{E}[(X+Y)^2] - \mathbb{E}[X+Y]^2 = \mathbb{E}[(X+Y)^2] - (\mathbb{E}[X] + \mathbb{E}[Y])^2 \)</p>
<p>(using the linearity of expectation). Expanding this out:</p>
<p>\( \mathbb{E}[X]^2 + \mathbb{E}[Y]^2 + 2 \mathbb{E}[XY] - \mathbb{E}[X]^2 - 2\mathbb{E}[X] \mathbb{E}[Y] - \mathbb{E}[Y]^2 \)</p>
<p>which clearly gives the required result (as \( Cov[X,Y] = \mathbb{E}[XY]-\mathbb{E}[X]\mathbb{E}[Y] \) ).</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>2.4 After your yearly checkup, the doctor has bad news and good news. The bad news is that you tested positive for a serious disease, and that the test is 99% accurate (i.e., the probability of testing positive given that you have the disease is 0.99, as is the probability of testing negative given that you don’t have the disease). The good news is that this is a rare disease, striking only one in 10,000 people. What are the chances that you actually have the disease?</strong></span></p>
<p>This is another famous example of the use of Baye's theorem. If we define two events as follows:</p>
<p>A - you have the disease</p>
<p>B - the test returns a positive result</p>
<p>If you go and take the test, and you get a positive result saying that you have the disease, then what you want to calculate is \(p(A|B)\). That is, the probability that you have the disease given that you tested positive. By Baye's theorem, this is equivalent to:</p>
<p>\( p(A|B) = \frac{p(B|A) p(A)}{p(B)} \)</p>
<p>Now, since the test is 99% accurate, \( P(B|A) = 0.99 \), i.e. if you have the disease you will test positive 99% of the time. But p(A) = 1/10000, i.e. the probability of an "average" person in the population having the disease (with no other information about them). Finally, we can calculate p(B) as follows:</p>
<p>\(P(B) = 0.99 \frac{1}{10000} + 0.01 \frac{9999}{10000} \)</p>
<p>where the first term is the probability of being positive if you have the disease*prob of actually having the disease and the second term is prob of testing positive if you don't have the disease (false positive) * probability of not having the disease.</p>
<p>Putting the numbers in you find that \(P(B|A) \approx 0.98 \% \), which makes it extremely unlikely that you have the disease despite testing positive! This is one example of why it would be a good idea if doctors learned some stats!</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>2.5 Monty Hall problem - On a game show, a contestant is told the rules as follows: There are three doors, labelled 1, 2, 3. A single prize has been hidden behind one of them. You get to select one door. Initially your chosen door will not be opened. Instead, the gameshow host will open one of the other two doors, and he will do so in such a way as not to reveal the prize. For example, if you first choose door 1, he will then open one of doors 2 and 3, and it is guaranteed that he will choose which one to open so that the prize will not be revealed. At this point, you will be given a fresh choice of door: you can either stick with your first choice, or you can switch to the other closed door. All the doors will then be opened and you will receive whatever is behind your final choice of door. Imagine that the contestant chooses door 1 first; then the gameshow host opens door 3, revealing nothing behind the door, as promised. Should the contestant (a) stick with door 1, or (b) switch to door 2, or (c) does it make no diﬀerence? You may assume that initially, the prize is equally likely to be behind any of the 3 doors.</strong></span></p>
<p>This is another famous problem which has quite an interesting <a href="http://www.stayorswitch.com/history.html">history</a>, apparently leading to many genuine mathematicians complaining about the correct solution which was given in a newspaper column (although this was a little while ago, I still find this extremely surprising!) It also comes up online every now and then with people arguing/trying to explain why you should switch, so I'll add my best attempt at explaining it here I guess!</p>
<p>For me, the crucial detail that helps in understanding it is that the host has <em>information about where the prize is</em>, which means that he is not choosing a door at random to open, since he will never open the door with the prize behind. If you take the starting point as the prize having equal probability of being behind each door, then you only need to consider two possibilities - 1) the initial door you chose does not have the prize behind it (which happens 2/3 of the time), and 2) the initial door does have the prize behind it. In case 1), you have selected one of the doors without the prize behind it - which means the host has no choice but to open the only other door that doesn't have the prize behind it. This means in this case, if you switch, you win the prize - and remember this happens 2/3 of the time. Obviously if you did initially pick the winning door and you switch then you lose, but this only happens 1/3 of the time. So it's better to switch!</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>2.6 Conditional Independence</strong></span></p>
<p><span style="text-decoration: underline;"><strong>(a) Let \(H \in \{1,\dots,K\}\) be a discrete RV, and let e1 and e2 be the observed values of two other RVs E1 and E2. Suppose we wish to calculate the vector:</strong></span></p>
<p><span style="text-decoration: underline;"><strong>\( \vec{P}(H|e_1, e_2) = (P(H=1|e_1, e_2), \dots, P(H=K|e_1, e_2)) \)</strong></span></p>
<p><span style="text-decoration: underline;"><strong>Which of the following are sufficient for the calculation?</strong></span></p>
<p><span style="text-decoration: underline;"><strong>(i) P(e1,e2), P(H), P(e1|H), P(e2|H)</strong></span></p>
<p><span style="text-decoration: underline;"><strong>(ii) P(e1,e2), P(H), P(e1,e2|H)</strong></span></p>
<p><span style="text-decoration: underline;"><strong>(iii) P(e1|H), P(e2|H), P(H)</strong></span></p>
<p>We can use Baye's theorem to write:</p>
<p>\( P(H| E_1, E_2) = \frac{P(E_1, E_2 | H) P(H)}{ P(E_1, E_2)} \)</p>
<p>so clearly (ii) is sufficient. The others are not in general sufficient.</p>
<p><span style="text-decoration: underline;"><strong>(b) Now suppose we assume \(E_1 \bot E_2 | H\). Now which are sufficient?</strong></span></p>
<p>Well, clearly (ii) still is. But conditional independence of E1 and E2 given H means that we can write:</p>
<p>\( P(E_1, E_2 | H) = P(E_1 | H) P(E_2 | H) \)</p>
<p>which means that (i) is also sufficient now.</p>
<p>However, we can also write:</p>
<p>\( P(e_1, e_2) = \sum_h p(e_1, e_2, h) = \sum_h p(e_1, e_2 | h) p(h) \)</p>
<p>which means that (iii) is also sufficient too!</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>2.7 Show that pairwise independence does not imply mutual independence</strong></span></p>
<p>The best way to do this is to show an example where we have pairwise independence, but not mutual independence. Consider rolling a fair, four-sided dice. If we define 3 events, A = {1,2}, B={1,3} and C={1,4}, then clearly P(A) = P(B) = P(C) = 1/2. But also, P(A,B) = P(A,C) = P(B,C) = P({1}) = 1/4 = P(A)P(B) = P(A)P(C) = P(B)P(C). This means we have pairwise independence.</p>
<p>However, P(A,B,C) = P({1}) also, which is not equal to P(A)P(B)P(C) = 1/8, and so we do not have mutual independence.</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>2.9 Conditional Independence. Are the following properties true?</strong></span></p>
<p><span style="text-decoration: underline;"><strong>(a) \( (X \bot W|Z,Y) \ AND \ (X \bot Y|Z) \implies (X \bot Y , W|Z) \)</strong></span></p>
<p>OK initially this looks pretty confusing, but once you break it down it's not too bad.</p>
<p>The first condition tells us that:</p>
<p>\( P(X,W | Z,Y) = P(X | Z,Y) P(W| Z,Y) \)</p>
<p>and the second tells us:</p>
<p>\( P(X,Y | Z) = P(X | Z) P(Y | Z) \)</p>
<p>The condition we need to show is:</p>
<p>\( P(X,Y,W | Z) = P(X | Z) P(Y, W | Z) \)</p>
<p>Completely generally, we can use the chain rule of probability to say:</p>
<p>\( P(X,Y,W | Z) = P(W| X,Y,Z) P(Y| X,Z) P(X | Z) \)</p>
<p>However, since X and W are conditionally independent given Z,Y, we can say that \(P(W| X,Y,Z) = P(W | Y,Z) \) (this is because generally \(P(X,W | Y,Z) = P(W | X, Y, Z) P(X | Y, Z) \), so if conditional dependence is true, then so must be the previous identity. Similarly, P(Y|X,Z) = P(Y|Z) since X and Y are conditionally independent given Z. This means that:</p>
<p>\(P(X,Y,W | Z) = P(W|Y,Z)P(Y|Z) P(X|Z) \)</p>
<p>Then we can say that \(P(W|Y,Z)P(Y|Z) = P(W,Y|Z) \), which gives us exactly what we need.</p>
<p>I couldn't be bothered to do part (b), but I expect it's kind of similar.</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>2.10 Deriving the inverse Gamma density</strong></span></p>
<p>Let \( X \sim Ga(a,b) \), such that:</p>
<p>\( p(x|a,b) = \frac{b^a}{\Gamma(a)} x^{a-1} e^{-xb} \).</p>
<p>If \( Y = 1/X \), show that \( Y \sim IG(a,b) \), where:</p>
<p>\(IG(x |a,b) = \frac{b^a}{\Gamma(a)} x^{-(a+1)}e^{-b/x} \)</p>
<p>Let's start from the CDF of Y:</p>
<p>\( P(Y \le y) = P(\frac{1}{X} \le y) = P(X \ge \frac{1}{y}) = 1 - P(X \le \frac{1}{y}) \)</p>
<p>If we let \( u = 1/y \), then:</p>
<p>\( p(y) = \frac{d}{dy} P(Y \le y) = \frac{du}{dy} \frac{d}{du}(1-P(X \le u)) = \frac{1}{y^2} Ga(u | a, b) \)</p>
<p>Then we substitute y back in to find:</p>
<p>\( p(y) = \frac{b^a}{\Gamma(a)} y^{-(a+1)} e^{-b/y} \)</p>
<p>which is the result we require.</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>2.11 Basically asking to evaluate: \( \int_{-\infty}^{\infty} e^{-\frac{x^2}{2 \sigma^2}} dx \)</strong></span></p>
<p>I remember seeing how to do this before - the trick is to consider the squared value and then convert to circular polar coordinates, so that you're just left with an integral from \(r=0 \to \infty \) and a constant integral over \(\theta\). Since we change from \( dx dy \) to \(r dr d\theta \) this allows the integral in r to be evaluated. I can't really be bothered to fill in the details but it should be easy to find elsewhere.</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>2.12 Expressing mutual information in terms of entropies: show that \( I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) \)<br /></strong></span></p>
<p>If we start from the definition in the textbook of the MI being equal to the KL divergence between \(P(X,Y)\) and \(P(X)P(Y)\) then we have:</p>
<p>\( I(X;Y) = \sum_{x,y} P(x,y) \log(\frac{P(x,y)}{P(x)P(y)}) = \sum_{x,y} P(x,y) [ \log(P(x,y)) - \log(P(x)) - \log(P(y)) ] \)</p>
<p>If we write \(P(x,y) = P(x | y) p(y) \) then we have:</p>
<p>\(I(X;Y) = \sum_{x,y} P(x|y)P(y) [ \log(P(x|y) - \log(P(x))] \)</p>
<p>We can then use that the conditional entropy \(H(X|Y) = -\sum_y p(y) \sum_x p(x|y) \log(p(x|y)) \), which is the negative of the first term we have. The second term, \(-\sum_{x,y} P(x,y) \log(P(x)) = H(X)\), clearly, and so we have found that \(I(X;Y) = H(X) - H(X|Y)\). It is easy to show the second identity, replacing \(P(x,y) = P(y|x)p(x)\) instead.</p>
<p> </p>
<p> <span style="text-decoration: underline;"><strong>2.13 Mutual information for correlated normals. Find the mutual information I(X1; X2) where X has a bivariate normal distribution:</strong></span></p>
<p><span style="text-decoration: underline;"><strong>\( \begin{bmatrix} X_1 \\ X_2 \end{bmatrix} \sim \mathcal{N} \left( \mathbf{0}, \begin{bmatrix} \sigma^2 & \rho \sigma^2 \\ \rho \sigma^2 & \sigma^2 \end{bmatrix} \right) \)</strong></span></p>
<p><span style="text-decoration: underline;"><strong>Evaluate when \(\rho = -1, 0, 1 \)</strong></span></p>
<p>The question actually gives you the form for the differential entropy for a multivariate normal, but I found a derivation which I thought was pretty nice and so I'm going to include it here (this is for a completely general MV normal of any dimension).</p>
<p>The pdf of a MV normal is:</p>
<p>\( p(\mathbf{x}) = \frac{1}{\sqrt{det(2 \pi \Sigma)}} e^{-\frac{(\mathbf{x}-\mathbf{\mu})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu})}{2}} \)</p>
<p>Now the differential entropy is:</p>
<p>\( H(\mathbf{x}) = \int \int d\mathbf{x} \left[ \log(\sqrt{det(2 \pi \Sigma)}) + \frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu}) \right] \mathcal{N}(\mathbf{x}; \mathbf{\mu}, \Sigma) \)</p>
<p>Now the nice trick here is that, by definition, the covariance matrix \( \Sigma = \mathbb{E} \left[ (\mathbf{x}-\mathbf{\mu}) (\mathbf{x}-\mathbf{\mu}) ^T \right] \), and the second term in the differential entropy is 1/2 times \( \mathbb{E} \left[ (\mathbf{x}-\mathbf{\mu})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu}) \right] \). Crucially, since this is a scalar we can say it's equal to it's trace, i.e.</p>
<p>\( \mathbb{E} \left[ (\mathbf{x}-\mathbf{\mu})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu}) \right] = \mathbb{E} \left[ Tr\left( (\mathbf{x}-\mathbf{\mu})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu}) \right) \right] \)</p>
<p>Then we can use the cyclic properties of traces, i.e. \( Tr(ABC) = Tr(BCA) = Tr(CAB) \), such that this is equal to:</p>
<p>\( \mathbb{E} \left[ Tr\left( (\mathbf{x} -\mathbf{u})(\mathbf{x}-\mathbf{u})^T \Sigma^{-1} \right) \right] = Tr \left[ \Sigma \Sigma^{-1} \right] = d \)</p>
<p>where d is the number of random variables (i.e. the dimension). Now \( d = \log(e^d) \) and \(det(2\pi \Sigma) = (2 \pi)^d det(\Sigma) \), so we are left with:</p>
<p>\(H(\mathbf{x}) = \frac{1}{2} \log((2 \pi)^d det(\Sigma)) \)</p>
<p>Now that we have this the rest of the question is relatively straightforward as we can rewrite the mutual information as \(I(X;Y) = H(X) + H(Y) - H(X,Y)\), and so in this case:</p>
<p>\(I(X_1; X_2) = \log(2 \pi e \sigma^2) - \frac{1}{2} \log((2 \pi e)^2 det(\Sigma)) \)</p>
<p>Now, \(det(\Sigma) = \sigma^4 - \rho^2 \sigma^4\), and so making a few algebraic manipulations we arrive at:</p>
<p>\(I(X_1; X_2) = \frac{1}{2} \log(\frac{1}{1-\rho^2}) \)</p>
<p>As \( \rho \to 0\), \(I \to 0 \), and as \( \rho \to \pm 1 \), \(I \to \infty \). Intuitively this makes sense - if \(\rho = 0\), there is no correlation and so the variables give us no information about each other. If they are perfectly correlated, we learn everything about \(X_1\) from \(X_2\), and an infinite amount of information can be stored in a real number.</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>2.14 Normalised mutual information. Let X and Y be discrete and identically distributed RVs (so H(X) = H(Y)). Let:</strong></span></p>
<p><span style="text-decoration: underline;"><strong>\(r=1 - \frac{H(Y|X)}{H(X)}\)</strong></span></p>
<p><span style="text-decoration: underline;"><strong>(a) Show \( r = \frac{I(X;Y)}{H(X)} \):</strong></span></p>
<p>This is quite straightforward: \(r = \frac{H(X) - H(Y|X)}{H(X)} = \frac{I(X;Y)}{H(X)} \)</p>
<p><span style="text-decoration: underline;"><strong>(b) Show \( 0 \le r \le 1 \)</strong></span></p>
<p>Entropy is always positive, so clearly \( r \le 1 \) from it's initial definition. Then we know the mutual information is \( \ge 0 \) too, and so it's greater than 0. I guess we should prove that \(I(X;Y) \ge 0 \). We do this starting from the initial definition:</p>
<p>\(I(X;Y) = -\sum_{x,y} p(x,y) \log(\frac{p(x)p(y)}{p(x,y)}) \)</p>
<p>Now, since the negative logarithm is convex we can apply Jensen's inequality \(\sum_i \lambda_i f(x_i) \ge f(\sum_i \lambda_i x_i) \), where \( \sum_i \lambda_i = 1\)):</p>
<p>\( I(X;Y) \ge -\log \left( \sum_{x,y} p(x,y) \frac{p(x) p(y)}{p(x,y)} \right) = 0 \)</p>
<p><span style="text-decoration: underline;"><strong>(c) When is r=0?</strong></span></p>
<p>When \(I(X;Y) = 0\), i.e. when the variables give us no information about each other.</p>
<p><span style="text-decoration: underline;"><strong>(d) When is r=1?</strong></span></p>
<p>When \(H(Y|X) = 0\), i.e. when the variables tell us everything about each other (i.e. once you know x, you get no information by learning y). I guess this is why r can be thought of as a normalised mutual information!</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>2.15 MLE minimises the KL divergence with empirical distribution</strong></span></p>
<p>\(P_{emp}(x_i) = N_i / N\) (sticking with the discrete case for now, where \(N_i\) is the number of occurrences of \(x_i\) and N is the total amount of data.</p>
<p>\(KL(p_{emp} || q(x;\theta)) = \sum_i p_{emp}(x_i) \log(\frac{p_{emp}(x_i)}{q(x_i;\theta)}) = \sum p_{emp} \log(p_{emp}) - \sum p_{emp} \log(q(x_i;\theta)) \)</p>
<p>The first term here is fixed by the data and so it is clear that:</p>
<p>\( argmin_{\theta} KL (p_{emp} || q) = argmax_{\theta} \frac{1}{N} \sum_i N_i \log(q(x_i;\theta)) \)</p>
<p>which is the same as maximising the log-likelihood, and hence the likelihood.</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>2.16 Derive the mean, mode and variance of the Beta(a,b) distribution.</strong></span></p>
<p>\(Beta(x; a,b) = \frac{\Gamma(a+b)}{\Gamma(a) \Gamma(b)} x^{a-1} (1-x)^{b-1} \equiv \frac{x^{a-1}(1-x)^{b-1}}{B(a,b)} \)</p>
<p>To get the mode you just differentiate wrt x and set equal to zero - I really couldn't be bothered.</p>
<p>For the mean:</p>
<p>\( \text{mean} = \mathbb{E}[X] = \frac{1}{B(a,b)} \int_0^1 x^a (1-x)^{b-1} dx \)</p>
<p>Integrating by parts we find:</p>
<p>\( \mathbb{E}[X] = \frac{a}{B(a,b) b} \int_0^1 x^{a-1} (1-x)^b dx = \frac{a}{B(a,b) b} \left[ \int_0^1 x^{a-1} (1-x)^{b-1}dx - \int_0^1 x^a (1-x)^{b-1} dx \right]\)</p>
<p>However, one of these just integrates to 1 because it is a Beta distribution, and the second is exactly what we started with, i.e. \( \mathbb{E}[X]\), and hence:</p>
<p>\( \mathbb{E}[X] = \frac{a}{b}(1-\mathbb{E}[X])\)</p>
<p>rearranging we find: \( \mathbb{E}[X] = \frac{a}{a+b} \). You should be able to get the variance using the same kind of trick but it looked like a lot more algebra and I was too tired to attempt it!</p>
<p> </p>
<p><span style="text-decoration: underline;"><strong>2.17 Expected value of the minimum. Let X, Y be iid U(0,1) RVs. What is the expected value of min(X,Y).</strong></span></p>
<p>If we let the CDF of X (and Y) be \(F(x) = x\) (for x between 0 and 1), then:</p>
<p>\(P(\text{min}(X,Y) \ge x) = 1-P(\text{min}(X,Y) \le x) = P( X \ge x, Y \ge x) = (1-F(x))^2 = (1-x)^2\)</p>
<p>Therefore \(P(\text{min}(X,Y) \le x) = 1 - (1-x)^2 \), and so the can write the probability density of the minimum as:</p>
<p>\(p_{min}(x) = 2(1-x)\)</p>
<p>as such, the expected value of the minimum, which is what we want, is:</p>
<p>\( \mathbb{E}[min] = 2 \int_0^1 x(1-x) dx = 1/3 \)</p>
<p> </p>
<p>There are a couple of questions I skipped, maybe at some point I'll get back to them but I don't think they were especially interesting/informative. If there are any questions, or if you spot any errors, please leave a comment! I have done most of the CH3 exercises now, and so will be writing them up soon as well!</p>http://www.henrycharlesworth.com/blog/index.php?controller=post&action=view&id_post=72018-09-12T22:42:23+00:00