Yulong Niu

个人博客

A brief intruction of mutual information and demonstration with R

Posted at — Nov 10, 2013

$\newcommand{\entropfrac}[2]{\frac{#1}{#2} \log \left( \frac{#1}{#2} \right)}$

Mututal Information (MI)

Introduction

Mutual Information (MI) is used to measure the distance between two genes vectors, for example $x_1 = {{1, 0, 1, 1, 1, 1, 0}}$ and $y_1 = {{0, 1, 1, 1, 1, 1, 0}}$. It is easily to transfer the two vectors into a binary table:

X/Y1 (Presence)0 (Absence)Sum
1(Presence)aba+b
0(Absence)cdc+d
Suma+cb+dn=a+b+c+d

Typically, here we give the example of two discrete variables, the mutual information between $x_1$ and $y_1$ is

$$ \begin{align} \begin{split} I(X;Y) &= H(X) + H(Y) - H(X,Y)\newline &= -\sum_{x \in {0, 1}}p(x)\log(p(x)) - \sum_{y \in {0, 1}}p(y)\log(p(y))\newline & \quad -\left( -\sum_{x \in {0, 1}}\sum_{y \in {0, 1}}p(x,y)\log(p(x,y)) \right)\ \end{split} \label{eq:1} \end{align} $$

The $\eqref{eq:1}$ is equal to

$$ \begin{equation} I(X;Y) = \sum_{x \in {0, 1}}\sum_{y \in {0, 1}} p(x,y) \log\left(\frac{p(x,y)}{p(x)p(y)}\right) \label{eq:2} \end{equation} $$

$p(x)$ is the probability that a symbol (here is 0 or 1) appears in the gene vector X regardless that what the symbol is in gene vector Y. $p(y)$ has a similar definition of $p(x). $$p(x, y)$ is probability of a symbol combination appears in gene vector X and Y. In this example, there are four kinds of symbol combination $(1, 1)$, $(1, 0)$, $(0, 1)$ and $(0, 0)$.

If we use the binary table to illustrate this equation, the $\eqref{eq:1}$ is:

$$ \begin{align} \begin{split} I(X; Y) &= -\left( \entropfrac{a+c}{n} + \entropfrac{b+d}{n} \right)\newline & \quad -\left( \entropfrac{a+b}{n} + \entropfrac{c+d}{n} \right)\newline & \quad - \left( - \left( \entropfrac{a}{n} + \entropfrac{b}{n} + \entropfrac{c}{n} + \entropfrac{d}{n} \right) \right)\newline \end{split} \label{eq:3} \end{align} $$

The $\eqref{eq:3}$ is mathmatically equal to:

$$ \begin{align} \begin{split} I(X; Y) &= \frac{a}{n}\log\frac{na}{(a+b)(a+c)} + \frac{b}{n}\log\frac{nc}{(a+b)(b+d)}\newline & \quad \frac{c}{n}\log\frac{nc}{(a+c)(c+d)} + \frac{d}{n}\log\frac{nd}{(d+c)(d+b)} \end{split} \label{eq:4} \end{align} $$

Example

We can use R to directly calculate the MI between two gene vectors mentioned above.

  1. Use basic R function
x1 <- c(1, 0, 1, 1, 1, 1, 0)
y1 <- c(0, 1, 1, 1, 1, 1, 0)
table(x1, y1)
   y1
x1  0 1
  0 1 1
  1 1 4
# calculate MI
4/7 * log(28/25) + 1/7 * log(7/10) + 1/7 * log(7/10) + 1/7 * log(7/4)
[1] 0.04279723
  1. Use R package bioDist
library('bioDist')
mutualInfo(rbind(x1, y1))
           x1
y1 0.04279723

Reference

Update record

02/11/2016