ML/AI: Probability and statistics-V. Expectation and correlation

Expectation (also called mean or average) is computed as follows

$E[X]=\sum_{x:p(x)>0} x \, p(x)$

To compute this for a discrete distribution introduce generating functions

$G(t)=\sum_n p(n) \, t^n, \qquad E[n]=\sum_n n \, p(n)={d\over dt}G(t)\Big{|}_{t=1}$

Binomial:

$G(t)=\Big(pt+(1-p)\Big)^n=\sum_{m=0}^n{n\choose m} p^m (1-p)^{n-m} \, t^m,$

$E[n]={d\over dt}\Big(pt+(1-p)\Big)^n\Big{|}_{t=1}=np$

Geometric:

$\wp\{X=n\}=p(1-p)^{n-1}, \qquad G(t)$

$=\sum_{n=1}p(1-p)^{n-1} \, t^n=pt \, \sum_{n=1} ((1-p)t)^{n-1}=\Big({pt\over 1-(1-p)t}\Big)$

$E[n]=p\Big({1\over 1-(1-p)}\Big)+p\Big({1-p\over (1-(1-p))^2}\Big)={1\over p}$

Note that $\wp\{n\}$ is the probability of success on the $n^{th}$ try (failure up until then) for independent trials.

Poisson:

$\wp\{X=n\}={\lambda^n\over n!} \, e^{-\lambda}$

$G(t)=\sum_{n=0}{\lambda^n t^n\over n!} \, e^{-\lambda}=e^{-\lambda +\lambda t}, \qquad E[n]={d\over dt}\Big(e^{-\lambda +\lambda t}\Big{|}_{t=1}=\lambda$

For the continuous distributions we perform integrations to compute expectations.
Exponential:

$\wp\{X=x\}=\lambda \, e^{-\lambda x}, \qquad E[x]=\int_0^\infty \lambda \, x \, e^{-\lambda x} \, dx={1\over \lambda}$

Beta:

$\wp\{X=x\}=\beta^{-1}(a,b) \, x^{a-1}(1-x)^{b-1}$

$E[x]=\beta^{-1}(a,b)\int_0^1 x^{a}(1-x)^{b-1} \, dx={\beta(a+1,b)\over \beta(a,b)}={a\over a+b}$

Normal:

$E[x]={1\over \sqrt{2\pi \sigma^2}}\int_{-\infty}^\infty x \, e^{-(x-\mu)^2/2\sigma^2} \, dx$

$={1\over \sqrt{2\pi}}\int_{-\infty}^\infty(\xi+\mu) \, e^{-\xi^2/2} \, d\xi=\mu$

Expectations of functions $y=g(x)$

$E[g(x)]=E[y]=\int y \, f_{Y}(y) \, dy, \qquad F_Y(y)=\wp\{Y\le y\}$

$=\wp\{g(x)\le y\}=\wp\{x\le g^{-1}(y)\}=\int_{-\infty}^{g^{-1}(y)} f_X(x) dx$

and of course

$f_Y(y)={dF_Y(y)\over dy}$

but we have the Law of the Unconscious Statistician

$E[g(x)]=\int g(x) \, f_X(x) \, dx$

Note that the expectation is linear

$E[ax+b]=a\int x f_X(x) \, dx+b\int f_X(x) \, dx=a \, E[x]+b$

and for sums of jointly distributed random variables

$E[x+y]=\int(x+y)f_{X,Y}(x,y) \, dx \, dy$

$=\int \Big(x \Big(\int f_{X,Y}(x,y) \, dy\Big) \, dx+\int \Big(y \Big(\int f_{X,Y}(x,y) \, dx\Big) \, dy$

$=\int x \, f_X(x) \, dx+\int y \, f_Y(y) \, dy=E[x]+E[y]$

Variance
Variance (expectation of the squared deviation from expected value) is defined as

$var(X)=E[(X-\mu)^2]=E[X^2]-E^2[X]$

$\mbox{Normal:} \quad var(X)=\sigma^2, \qquad \mbox{Poisson:} \quad var(X)=\lambda$

$\mbox{Binomial:} \quad var(X)=np(1-p)$

For example the for the binomial mass function

$E[X^2]=\sum_{m=0}^n {n\choose m} m^2 \, p^m \,(1-p)^{n-m}=\sum_{m=0}^n {n\choose m} m^2 \, p^m \,(1-p)^{n-m} \, e^{-mx}\Big{|}_{x=0}$

$={d^2\over dx^2}\sum_{m=0}^n {n\choose m} \, p^m \,(1-p)^{n-m} \, e^{-mx}\Big{|}_{x=0}$

$={d^2\over dx^2}\Big(1-p+p \, e^{-x}\Big)^n\Big{|}_{x=0}=np(np+1-p)$

and for the Poisson

$E[x^2]=\sum_{m=0}^\infty m^2 {\lambda^m\over m!} \, e^{-\lambda}=\sum_{m=0}^\infty m^2 {\lambda^m\over m!} \, e^{-\lambda x}\Big{|}_{x=1}$

$={d^2\over dx^2} \sum_{m=0}^\infty {\lambda^m \over m!} e^{-mx}\, e^{-\lambda }\Big{|}_{x=0}$

$={d^2\over dx^2}\Big(e^{-\lambda}e^{\lambda e^{-x}}\Big)\Big{|}_{x=0}=\lambda(\lambda+1)$

the properties are simple:

$var(aX+b)=a^2 \, var(X)$

Relevance to quantum computation
Expectations and unbiased estimators of the mean are important because of the
Law of large numbers: Let $\{x_1,x_2,\cdots, x_n\}$ be a time series for i.i.d random numbers (independently perform same experiment)

$\overline{x}_n=\lim_{n\rightarrow\infty}{1\over n}\sum_{i=1}^n x_i\rightarrow E(x)$

rate of convergence is influenced by large or small variance of $x$ .
Weak law of large numbers; for any specified margin $\epsilon>0$ , for a sufficiently large sample the probability that the mean is within the margin around the expectation tends to one.

$\lim_{n\rightarrow\infty} \wp(|\overline{x}_n-E(x)|>\epsilon)=0$

Proof. Let $var(x_i)=\sigma^2$ for all $i$ , then

$var(\overline{x}_n)=var({1\over n}(x_1+\cdots+x_n))={1\over n^2}var(x_1+\cdots+x_n)={\sigma^2\over n}$

Chebyshevs’s inequality (a weak bound, see Ross)

$\wp(|x-E(x)|\ge k\sigma)\le {1\over k^2}, \quad k\quad \mbox{any real number}$

then says

$\wp(|\overline{x}_n-E(x)|\ge \epsilon)\le {\sigma^2\over n\epsilon^2}, \qquad \wp(|\overline{x}_n-E(x)|< \epsilon)\le 1-{\sigma^2\over n\epsilon^2}\rightarrow 1$

Strong law of large numbers: the probability that the mean converges on the expectation as $n\rightarrow\infty$ is one

$\lim_{n\rightarrow\infty} \wp(|\overline{x}_n=E(x)|)=1$

To extract an answer from a quantum computer something must be measured, and the outcome of a quantum measurement is a random number, so the answer gotten is “correct” with some probability. The strategy then, since such computers can be very fast, is to repeat the experiment to get to a high probability of getting a correct answer.
Suppose that $n$ measurements $x_1,x_2,\cdots,x_n$ are made (independently), each with a probability $1-p$ of being correct within some tolerance. If the outcome is verifiable to be correct, such as the case of a factorization of a number, the probability that the algorithm must be run exactly $k$ times in order to get a correct answer (within the acceptable tolerance) is

$\wp(k)=p^{k-1}(1-p)$

(the number of repetitions needed is a discrete geometric random variable), the cumulative probability or probability that we will geta correct answer within $n$ trials is

$\wp_n=\sum_{k=1}^n w(k)={p\over 1-p}\sum_{k=1}^np^k=\Big({1-p\over p}\Big)\Big({1-p^{n+1}\over 1-p}-1\Big)=1-p^n$

In order to attain a correct answer with probability $\sim 0.999$ with an algorithm that produces a verifiably correct answer with probability $0.7$ we need 5 runs

$1-0.3^5=0.99757$

If the answer can’t be verified, we make many measurements $x_1,x_2,\cdots,x_n$ and work under the assumption that the average of these is an estimator of the “correct” value $x$ . Apply an alpha-trimmed mean filter as in digital signal processing (DSP) to get rid of outliers

$\bullet$ Place a window over element;
$\bullet$ Pick up elements;
$\bullet$ Order elements;
$\bullet$ Discard elements at the beginning and at the end of the got ordered set;
$\bullet$ Take the unweighted average sum up the remaining elements and divide the sum by their number.
so in our case we order the set $\{x_1,x_2,\cdots, x_n\}$ from smallest to largest and discard the first $\alpha n$ elements and the last $\alpha n$ elements. The distribution of the number of correct results among the $n$ is binomial

$\wp(k)={n\choose k}p^{n-k}(1-p)^k\approx {1\over \sqrt{2\pi np(1-p)}}e^{-{(k-n(1-p))^2\over 2np(1-p)}}$

so the probability that more than $n-2n\alpha$ of the measurements are within tolerance is

$\wp\approx \int_{n-2n\alpha}^n {1\over \sqrt{2\pi np(1-p)}}e^{-{(k-n(1-p))^2\over 2np(1-p)}} dk$

$=1-\int_{-\infty}^{n-2n\alpha} {1\over \sqrt{2\pi np(1-p)}}e^{-{(k-n(1-p))^2\over 2np(1-p)}} dk$

$=1-{1\over 2\pi}\int_{-\infty}^{{np-2n\alpha\over \sqrt{np(1-p)}}} e^{-\xi^2/2}d\xi=1-\Phi\Big({np-2n\alpha\over \sqrt{np(1-p)}}\Big)$

with

$\Phi(x)={1\over \sqrt{2\pi}}\int_{-\infty}^x e^{-\xi^2/2}d\xi={1\over 2}\Big(1+erf(x/\sqrt{2})\Big)$

for which $\Phi(-x)=1-\Phi(x)$ . To obtain a correct answer with $0.999$ probability from a quantum algorithm in this case, with $0.7$ probability of correct results per run and $\alpha=0.2$ we must have

$3\approx {np-2n\alpha\over \sqrt{np(1-p)}}, \qquad n\approx 18.9$

run the algorithm about $20$ times ( $0.9986={1\over 2}(1+erf(3/\sqrt{2}))$ ).
Moment generators
In many examples of counting problems used to find the size of a sample space, or to construct a simple means of computing probabilities or expectations, we used generating functions such as

$G(t)=\sum_k \wp\{X=k\} \, t^k$

which is a standard combinatorial method. Ross’s section 7.7 introduces such a gadget as the generating function for moments

$\phi(t)=\sum_k \wp\{X=k\} \, e^{\pm kt}\equiv E[e^{\pm tX}],$

$\phi(t)=\int_{-\infty}^\infty f_X(x) \, e^{\pm tx} \, dx\equiv E[e^{\pm tX}]$

(we select the sign needed to make the sum or integral converge. Then

${d^m\over dt^m}\phi(t)={d^m\over dt^m}E[e^{\pm tX}]=E\big[{d^m\over dt^m}e^{\pm tX}\big]$

$=E\big[(\pm X)^m \, e^{\pm tX}\big], \qquad E\big[(\pm X)^m\big]=E\big[(\pm X)^m \, e^{\pm tX}\big]_{t=0}$

This expectation is called the $m^{th}$ moment of $f_X(x)$ . The $m=3$ moment is called the skew, and $m=4$ moment the kurtosis of the distribution. A list of moment generating functions is created between pages 223 and 229 of Ross, it is very useful in the problems, you should try to make use of it to compute variances.
Bernoulli.

$\phi(t)=pe^{0 \cdot t}+(1-p) \, e^{1\cdot t}=p+(1-p)e^t$

Binomial.

$\phi(t)=\sum_{k=0}^n {n\choose k} (p^k e^{kt})(1-p)^{n-k}=\Big(p \, e^{t}+(1-p)\Big)^n$

Poisson

$\phi(t)=\sum_{k=0}^n {\lambda^k\over k!} e^{kt} e^{-\lambda}=e^{-\lambda}e^{\lambda e^{t}}$

Exponential

$\phi(t)=\int_0^\infty \lambda e^{tx}e^{-\lambda x} \, dx={\lambda\over \lambda-t}, \qquad t<\lambda$

Normal

$\phi(t)={1\over \sqrt{2\pi\sigma^2}} \, \int_{-\infty}^\infty e^{tx} \, e^{-(x-\mu)^2/2\sigma^2} \, dx$

$={1\over \sqrt{2\pi\sigma^2}} \, \int_{-\infty}^\infty e^{-(x^2-x(t+2\mu)+\mu^2)/2\sigma^2} \, dx=e^{\sigma^2 t^2/2+\mu t}$

Properties
If $X$ and $Y$ are independent

$\phi_{X+Y}(t)=E\big[e^{t(X+Y)}\big]=E\big[e^{tX}\big]E\big[e^{tY}\big]=\phi_X(t) \, \phi_Y(t)$

This generalizes to sums $Y=\sum_i X_i$ of random variables

$\phi_Y(t)=\prod_i \, \phi_{X_i}(t), \qquad \phi_Y(t)=\Big(\phi_X(t)\Big)^n$

The generating function lets us say something sensible about the Cauchy distribution: when it was introduced in the text it was shown that you can’t compute its expectation by standard methods, but consider

$\phi(it)=\int_{-\infty}^\infty e^{itx} {dx\over \pi(1+(x-\theta)^2)}$

$=\int_{-\infty}^\infty e^{itx} {dx\over \pi(x-\theta+i)(x-\theta-i)}=e^{-t+it\theta}$

(using Cauchy’s theorem from Complex variables), which clearly converges and any moment can be computed from it

$i \, E[X]={d\over dt}\phi(it)\Big{|}_{t=0}=i\theta$

which in fact should be obvious from a graph of the distribution, it is sharply maximal for $x\approx \theta$ . If you need to work with the Cauchy distribution, the moment generating function lets you do it.
You can see that the moment generating function is either a Laplace or Fourier transformation of the distribution function

$\phi(-t)=\mathcal{L}[f](t), \quad \mbox{or}\quad \phi(it)=\mathcal{F}[f](t)$

which are invertible transformation (which you discover in your DEQ courses). If you are given $\phi(t)$ , you can recover $f(x)$ by inverting the transform. If you are given only the integer moments $E[X^m]={d^m\phi\over dt^m}\Big{|}_{t=0}$ you {\bf cannot in general} reconstruct $f(x)$ from them unless $f(x)$ has finite support: $f(x)>0$ only for $a\le x\le b$ for $a,b$ finite. This is called the classical moment problem.
Correlation
Recall the definitions of variance (standard deviation squared) and covariance

$var(X)=E[X^2]-E^2[X]=E\big[(X-E[X])^2\big],$

$cov(X,Y)=E[XY]-E[X]E[Y]$

and correlation

$cor(X,Y)=\rho(X,Y)={cov(X,Y)\over \sqrt{var(X) \cdot var(Y)}}, \qquad -1\le \rho(X,Y)\le 1$

$\rho(X,Y)$ measures the extent to which $Y$ is a linear function $Y=aX+b$ of $X$ , so an increase/decrease in $X$ results in an increase/decrease in $Y$ .

Example (Ross theoretical exercise 7.20). Let $Y=aX+b$ , then

$E[XY]=E[aX^2+bX]=aE[X^2]+bE[X], \quad E[Y]=aE[X]+b$

$cov(X,Y)=aE[X^2]+bE[X]-\Big(aE[X]+b\Big) E[x]=a \, var(X)$

$var(Y)=E[a^2X^2+2abX+b^2]-\Big(aE[X]+b\Big)^2$

$=a^2E[X^2]+2abE[X]+b^2-\Big(a^2E^2[X]+2abE[X]+b^2\Big)=a^2 \, var(X)$

therefore

$\rho(X,Y)={a \, var(X)\over \sqrt{a \, var(X)\cdot var(X)}}={a\over |a|}=\pm 1$

For two variables $X,Y$ , if $\rho(X,Y)\approx \pm 1$ then they are strongly linearly related ( $Y$ and increasing $(+1)$ or decreasing $(-1)$ function of $X$ ), if $\rho(x,Y)\approx 0$ they are not so related. In statistical and quantum physics we call the covariance the correlation function (we don’t use the denominator) $\langle XY\rangle=cov(X,Y)$

An important example
For two jointly distributed variables

$E[XY]=\sum_{x,y} xy \, \wp_{X,Y}\{X=x,Y=y)$

and if they are Bernoulli, then $x=0,1$ and $y=0,1$ so

$E[XY]=\wp_{X,Y}\{X=1,Y=1\}=\wp_{X|Y}\{X=1|Y=1\}\wp_Y\{Y=1\}$

and of course in the previous ( $AB$ -pairing) example we saw how the pool is diminished by two with each pairing, and if a pair is an $AB$ -pair, the pool of $B$ -types is diminished as well, so for our pairing problem

$E[X_i X_j]=\Big({N_1-1\over N_1+N_2-2}\Big)\Big({N_1\over N_1+N_2}\Big)$

$cov(X_i,X_j)=\Big({N_1-1\over N_1+N_2-2}\Big)\Big({N_1\over N_1+N_2}\Big)-\Big({N_1\over N_1+N_2}\Big)^2$

$={N_1(N_1-N_2)\over (N_1+N_2)^2(N_1+N_2-2)}$

and we can use this to get

$var(X)=var(\sum_i X_i)=N_2\Big({N_1\over N_1+N_2}\Big)\Big(1-{N_1\over N_1+N_2}\Big)$

$+2{N_2\choose 2}\Big({N_1(N_1-N_2)\over (N_1+N_2)^2(N_1+N_2-2)}\Big)$

Another important example

(1) $\begin{eqnarray*}cov(a+bX, c+dy)&=&E[ac+ad Y+bc X+cd XY]\nonumber\\ &-&\Big(a+b E[X]\Big)\Big(c+d E[Y]\Big)\nonumber\\ &=&ac+ad E[Y]+bc E[X]+cd E[XY]\nonumber\\ &-&\Big(ac+ad E[Y]+ bc E[X]+bd E[X]E[Y]\Big)\nonumber\\ &=&bd \Big(E[XY]-E[X]E[Y]\Big)=bd \, cov(X,Y)\nonumber\\ cov(X+Y,Z)&=&E[XZ+YZ]-E[X+Y]E[Z]\nonumber\\ &=&E[XZ]+E[YZ]-E[X]E[Z]-E[Y]E[Z]\nonumber\\ &=&cov(X,Z)+cov(Y,Z)\nonumber\end{eqnarray*}$

This linearity extends

$cov(\sum_i X_i, Z)=\sum_i cov(X_i,Z), \qquad cov(X,\sum_j Z_j)=\sum_j cov(X,Z_j)$

substitute one into the other

$cov(\sum_i X_i, \sum_j Z_j)=\sum_i\sum_j cov(X_i,Z_j)$

Ross problem 7.32
Let $X_i$ be the event of die roll $i$ producing a $1$ , $Z_i$ be the event of die roll $i$ producing a $2$ , then $\wp\{X_i\}=1/6=\wp\{Z_i\}=p$ , both are Bernoulli variables, and are independent (uncorrelated) if $i\ne j$ .