$\newcommand{\entropfrac}[2]{\frac{#1}{#2} \log \left( \frac{#1}{#2} \right)}$
Mutual Information (MI) is used to measure the distance between two genes vectors, for example $x_1 = {{1, 0, 1, 1, 1, 1, 0}}$ and $y_1 = {{0, 1, 1, 1, 1, 1, 0}}$. It is easily to transfer the two vectors into a binary table:
X/Y | 1 (Presence) | 0 (Absence) | Sum |
---|---|---|---|
1(Presence) | a | b | a+b |
0(Absence) | c | d | c+d |
Sum | a+c | b+d | n=a+b+c+d |
Typically, here we give the example of two discrete variables, the mutual information between $x_1$ and $y_1$ is
$$ \begin{align} \begin{split} I(X;Y) &= H(X) + H(Y) - H(X,Y)\newline &= -\sum_{x \in {0, 1}}p(x)\log(p(x)) - \sum_{y \in {0, 1}}p(y)\log(p(y))\newline & \quad -\left( -\sum_{x \in {0, 1}}\sum_{y \in {0, 1}}p(x,y)\log(p(x,y)) \right)\ \end{split} \label{eq:1} \end{align} $$
The $\eqref{eq:1}$ is equal to
$$ \begin{equation} I(X;Y) = \sum_{x \in {0, 1}}\sum_{y \in {0, 1}} p(x,y) \log\left(\frac{p(x,y)}{p(x)p(y)}\right) \label{eq:2} \end{equation} $$
$p(x)$ is the probability that a symbol (here is 0 or 1) appears in the gene vector X regardless that what the symbol is in gene vector Y. $p(y)$ has a similar definition of $p(x). $$p(x, y)$ is probability of a symbol combination appears in gene vector X and Y. In this example, there are four kinds of symbol combination $(1, 1)$, $(1, 0)$, $(0, 1)$ and $(0, 0)$.
If we use the binary table to illustrate this equation, the $\eqref{eq:1}$ is:
$$ \begin{align} \begin{split} I(X; Y) &= -\left( \entropfrac{a+c}{n} + \entropfrac{b+d}{n} \right)\newline & \quad -\left( \entropfrac{a+b}{n} + \entropfrac{c+d}{n} \right)\newline & \quad - \left( - \left( \entropfrac{a}{n} + \entropfrac{b}{n} + \entropfrac{c}{n} + \entropfrac{d}{n} \right) \right)\newline \end{split} \label{eq:3} \end{align} $$
The $\eqref{eq:3}$ is mathmatically equal to:
$$ \begin{align} \begin{split} I(X; Y) &= \frac{a}{n}\log\frac{na}{(a+b)(a+c)} + \frac{b}{n}\log\frac{nc}{(a+b)(b+d)}\newline & \quad \frac{c}{n}\log\frac{nc}{(a+c)(c+d)} + \frac{d}{n}\log\frac{nd}{(d+c)(d+b)} \end{split} \label{eq:4} \end{align} $$
We can use R to directly calculate the MI between two gene vectors mentioned above.
x1 <- c(1, 0, 1, 1, 1, 1, 0)
y1 <- c(0, 1, 1, 1, 1, 1, 0)
table(x1, y1)
y1
x1 0 1
0 1 1
1 1 4
# calculate MI
4/7 * log(28/25) + 1/7 * log(7/10) + 1/7 * log(7/10) + 1/7 * log(7/4)
[1] 0.04279723
library('bioDist')
mutualInfo(rbind(x1, y1))
x1
y1 0.04279723
Seung-Seok Choi, Sung-Hyuk Cha, Charles C. Tappert: A Survey of Binary Similarity and Distance Measures.
Huynen M, Snel B, Lathe W 3rd, Bork P: Predicting protein function by genomic context: quantitative evaluation and qualitative inferences.Genome Res. 2000;10(8):1204-10.
Korber BT, Farber RM, Wolpert DH, Lapedes AS: Covariation of mutations in the V3 loop of human immunodeficiency virus type 1 envelope protein: an information theoretic analysis. Proc Natl Acad Sci U S A. 1993;90(15):7176-80.
Kensche PR, van Noort V, Dutilh BE, Huynen MA: Practical and theoretical advances in predicting the function of a protein by its phylogenetic distribution. J R Soc Interface. 2008;5(19):151-70.
02/11/2016