We have a sample space, \(\Omega \) consisting of elementary events.

All elementary events are disjoint sets.

We have a \(\sigma\)-algebra over \(\Omega \) called \(F\). A \(\sigma\)-algebra takes a set a provides another set containing subsets closed under complement. The power set is an example.

All events \(E\) are subsets of \(\Omega\)

\(\forall E\in F E\subseteq \Omega\)

Events are mutually exclusive if they are disjoint sets.

For each event \(E\), there is a complementary event \(E^C\) such that:

\(E\lor E^C=\Omega\)

\(E\land E^C=\varnothing\)

This exists by construction in the measure space.

As events are sets, we can define algebra on sets. For example for two events \(E_i\) and \(E_j\) we can define:

\(E_i\land E_j\)

\(E_i\lor E_j\)

For all events \(E\) in \(F\), the probability function \(P\) is defined.

This gives us the following measure space:

\((\Omega, F, P)\)

First axiom

The probability of all events is a non-negative real number.

\(\forall E \in F [(P(E)\ge 0)\land (P(E)\in \mathbb{R})]\)

Second axiom

The probability of one of the elementary events occuring is \(1\).

The probability of the outcome set is \(1\).

\(P(\Omega )=1\)

Third axiom

The probability of union for mutually exclusive events is:

\(P(\cup^\infty_{i=1}E_i)=\sum_{i=1}^\infty P(E_i)\)

\(P(\Omega )=1\)

\(P(\Omega \lor \varnothing )=1\)

\(P(\Omega )+P(\varnothing )=1\)

\(P(\varnothing )=0\)

Consider \(E_i\subseteq E_j\):

\(E_j=E_i\lor E_k\)

\(P(E_j)=P(E_i\lor E_k)\)

Disjoint so:

\(P(E_j)=P(E_i)+P(E_k)\)

We know that \(P(E_k)\ge 0\) from axiom \(1\) so:

\(P(E_j)\ge P(E_i)\)

As all events are subsets of the sample space:

\(P(\Omega )\ge P(E)\)

\(1\ge P(E)\)

From axiom \(1\) then know:

\(\forall E\in F [0\le P(E)\le 1]\)

\(P(E\land \varnothing )=P(\varnothing )=0\)

\(P(E\lor \Omega )=P(\Omega )=1\)

\(P(E\lor \varnothing)=P(E)\)

\(P(E\land \Omega )=P(E)\)

Firstly:

\(P(E_i)=P(E_i\land \Omega)\)

\(P(E_i)=P(E_i\land (E_j\lor E_j^C))\)

\(P(E_i)=P((E_i\land E_j)\lor (E_i\land E_j^C))\)

As the latter are disjoint:

\(P(E_i)=P((E_i\land E_j)+(E_i\land E_j^C))\)

We know that:

\(P(E_i\lor E_j)=P((E_i\lor E_j)\land (E_j\lor E_j^C))\)

By the distributive law of sets:

\(P(E_i\lor E_j)=P((E_i\land E_j^C)\lor E_j)\)

\(P(E_i\lor E_j)=P((E_i\land E_j^C)\lor (E_j\land (E_i\lor E_i^C))\)

By the distributive law of sets:

\(P(E_i\lor E_j)=P((E_i\land E_j^C)\lor (E_j\land E_i)\lor (E_j\land E_i^C))\)

As these are disjoint:

\(P(E_i\lor E_j)=P(E_i\land E_j^C)+ P(E_j\land E_i)+P(E_j\land E_i^C)\)

From the separation rule:

\(P(E_i\lor E_j)=P(E_i)-P(E_i\land E_j)+ P(E_j\land E_i)+P(E_j)-P(E_j\land E_i)\)

\(P(E_i\lor E_j)=P(E_i)+P(E_j)-P(E_i\land E_j)\)

From the addition rule:

\(P(E_i\lor E_j)=P(E_i)+P(E_j)-P(E_i\land E_j)\)

Consider \(E\) and \(E^C\):

\(P(E\lor E^C)=P(E)+P(E^C)-P(E\land E^C)\)

We know that \(E\) and \(E^C\) are disjoint, that is:

\(E\land E^C=\varnothing\)

Similarly by construction:

\(E\lor E^C=\Omega \)

So:

\(P(\Omega )=P(E)+P(E^C)-P(\varnothing)\)

\(1=P(E)+P(E^C)\)

We define conditional probability

\(P(E_i|E_j):=\dfrac{P(E_i\land E_j)}{P(E_j)}\)

We can show this is between \(0\) and \(1\).

\(P(E_j)=P(E_i\land E_j)+P(\bar{E_i}\land E_j)\)

\(P(E_i|E_j):=\dfrac{P(E_i\land E_j)}{ P(E_i\land E_j)+P(\bar{E_i}\land E_j)}\)

We know:

\(P(x_i|y_j):=\dfrac{P(x_i \land y_j)}{P(y_j)}\)

\(P(y_j|x_i):=\dfrac{P(x_i \land y_j)}{P(x_i)}\)

So:

\(P(x_i|y_j)P(y_j)=P(y_j|x_i) P(x_i)\)

\(P(x_i|y_j)=\dfrac{P(y_j|x_i) P(x_i)}{P(y_j)}\)

Note that this is undefined when \(P(y_j)=0\)

Note that for the same event,

\(P(x_i|x_j)=\dfrac{P(x_i\land x_j)}{P(x_j)}\)

\(P(x_i|x_j)=0\)

For the same outcome:

\(P(x_i|x_i)=\dfrac{P(x_i\land x_i)}{P(x_i)}\)

\(P(x_i|x_i)=\dfrac{P(x_i)}{P(x_i)}\)

\(P(x_i|x_i)=1\)

From the definition of conditional probability we know that:

\(P(E_i|E_j):=\dfrac{P(E_i\land E_j)}{P(E_j)}\)

\(P(E_j|E_i):=\dfrac{P(E_i\land E_j)}{P(E_i)}\)

So:

\(P(E_i\land E_j)=P(E_i|E_j)P(E_j)\)

\(P(E_i\land E_j)=P(E_j|E_i)P(E_i)\)

So:

\(P(E_i|E_j)P(E_j)=P(E_j|E_i)P(E_i)\)

Events are independent if:

\(P(E_i|E_j)=P(E_i)\)

Note that:

\(P(E_i\land E_j)=P(E_i|E_j)P(E_j)\)

And so for independent events:

\(P(E_i\land E_j)=P(E_i)P(E_j)\)

If the prior \(P(\theta)\) and the posterior \(P(\theta | X)\) are in the same family of distributions (eg both Gaussian), then the prior and posterior are conjugate distributions

Given a set of outcomes for a variable, the odds of the outcome are defined as:

\(o_f=\dfrac{P(E)}{P(E^C)}\)

For example, the odds of rolling a \(6\) are \(\dfrac{1}{5}\).

We know that:

\(\sum_yP(X\land Y)=P(X)\)

So for the continuous case

\(P(X)=\int_{-\infty }^{\infty }P(X\land Y)dy\)

This behaves like the probability for a single event, or multiple events with one fewer event if there were more than \(2\) events to start with.

We have a sample space, \(\Omega \). A random variable \(X\) is a mapping from the sample space to the real numbers:

\(X: \Omega \rightarrow \mathbb{R}\)

We can then define the set of elements in \(\Omega \). As an example, take a coin toss and a die roll. The sample space is:

\(\{H1,H2,H3,H4,H5,H6,T1,T2,T3,T4,T5,T6\}\)

A random variable could give us just the die value, such that:

\(X(H1)=X(T1)=1\)

We can define this more precisely using set-builder notation, by saying the following is defined for all \(c\in \mathbb{R}\):

\(\{\omega |X(\omega )\le c\}\)

That is, for any number random variable map \(X\), there is a corresponding subset of \(\Omega \) containing the \(\omega \)s in \(\Omega \) which map to less than \(c\).

Multiple variables can be defined on the sample space. If we rolled a die we could define variables for

Whether it was odd/even

Number on the die

Whether it was less than 3

With more die we could add even more variables

If we define a variable \(X\), we can also define another variable \(Y=X^2\).

\(P(X=x)=P({\omega |X(\omega)=x})\)

For discrete probability, this is a helpful number. For example for rolling a die.

This is not helpful for continuous probability, where the chance of any specific outcome is \(0\).

Random variables all valued as real numbers, and so we can write:

\(P(X\le x)=P({\omega |X(\omega)\le x})\)

Or:

\(F_X(x)=\int_{-\infty}^x f_X(u)du\)

\(F_X(x)=\sum_{x_i\le x}P(X=x_i) \)

\(P(X\le x)+P(X\ge x)-P(X=x)=1\)

\(P(a< X\le b)=F_X(b)-F_X(a)\)

If continuous, probability at any point is \(0\). We instead look at probability density.

Derived from cumulative distribution function:

\(F_X(x)=\int_{-\infty}^x f_X(u)du\)

The density function is \(f_X(x)\).

For probability mass functions:

\(P(Y=y|X=x)=\dfrac{P(Y=y\land X=x)}{P(X=x)}\)

For probability density functions:

\(f_Y(y|X=x)=\dfrac{f_{X,Y}(x,y)}{f_X(x)}\)

\(P(X=x\land Y=y)\)

\(P(X=x)=\sum_{y}P(X=x\land Y=y)\)

\(P(X=x)=\sum_{y}P(X=x|Y=y)P(Y=y)\)

\(x\) is independent of \(y\) if:

\(\forall x_i \in x,\forall y_j \in y (P(x_i|y_j)=P(x_i)\)

If \(P(x_i|y_j)=P(x_i)\) then:

\(P(x_i\land y_j)=P(x_i).P(y_j)\)

This logic extends beyond just two events. If the events are independent then:

\(P(x_i\land y_j \land z_j)=P(x_i).P(y_j \land z_k)=P(x_i).P(y_j).P(z_k)\)

Note that because:

\(P(x_i|y_j)=\dfrac{P(x_i\land y_j)}{P(y_j)}\)

If two variables are independent

\(P(x_i|y_j)=\dfrac{P(x_i)P(y_j)}{P(y_j)}\)

\(P(x_i|y_j)=P(x_i)\)

\(P(A\land B|X)=P(A|X)P(B|X)\)

This is the same as:

\(P(A|B\land X)=P(A|X)\)

\(\phi (P)\in \mathbb{R} \) is a functional on \(P(X)\).

Examples include the expectation and variance.

We can define derivatives on these functionals.

\(\phi (P)\approx \phi (P^0)+D_\phi (P-P^0)\)

Where \(D_\phi \) is linear.

For a random variable (or vector of random variables), \(x\), we define the expected value of \(f(x)\) as :

\(E[f(x)]:=\sum f(x_i) P(x_i)\)

The expected value of random variable \(x\) is therefore this where \(f(x)=x\).

\(E(x)=\sum_i x_i P(x_i)\)

We can show that \(E(x+y)=E(x)+E(y)\):

\(E[x+y]=\sum_i \sum_j (x_i+y_j) P(x_i \land y_j)\)

\(E[x+y]=\sum_i \sum_j x_i [P(x_i \land y_j)]+\sum_i \sum_j [y_j P(x_i \land y_j)]\)

\(E[x+y]=\sum_i x_i \sum_j [P(x_i \land y_j)]+\sum_j y_j \sum_i [P(x_i \land y_j)]\)

\(E[x+y]=\sum_i x_i P(x_i)+\sum_j y_j P(y_j)\)

\(E[x+y]=E[x]+E[y]\)

Expectations

\(E(cx)=\sum_i cx P(x_i)\)

\(E(cx)=c\sum_i x P(x_i)\)

\(E(cx)=cE(x)\)

\(E(c)=\sum_i c_i P(c_i)\)

\(E(c)= cP(c)\)

\(E(c)= c\)

If \(Y\) is a variable we are interested in understanding, and \(X\) is a vector of other variables, we can create a model for \(Y\) given \(X\).

This is the conditional expectation.

\(E[Y|X]\)

\(E[P(Y|X)Y]\)

In the continuous case this is

\(E(Y|X)=\int_{-\infty }^{\infty }yP(y|X)dy\)

We can then identify an error vector.

\(\epsilon :=Y-E(Y|X)\)

So:

\(Y=E(Y|X)+\epsilon \)

Here \(Y\) is called the dependent variable, and \(X\) is called the dependent variable.

\(E[E[Y]]=E[Y]\)

\(E[E[Y|X]=E[Y]\)

The variance of a random variable is given by:

\(Var(x)=E((x-E(x))^2)\)

\(Var(x)=E(x^2+E(x)^2-2xE(x))\)

\(Var(x)=E(x^2)+E(E(x)^2)-E(2xE(x))\)

\(Var(x)=E(x^2)+E(x)^2-2E(x)^2\)

\(Var(x)=E(x^2)-E(x)^2\)

\(Var(c)=E(c^2)-E(c)^2\)

\(Var(c)= c^2-c^2\)

\(Var(c)=0\)

\(Var(cx)=E((cx)^2)-E(cx)^2\)

\(Var(cx)=E(c^2x^2)-[\sum_i cx P(x_i)]^2\)

\(Var(cx)=c^2E(x^2)-c^2[\sum_i x P(x_i)]^2\)

\(Var(cx)=c^2[E(x^2)- E(x)^2]\)

\(Var(cx)=c^2Var(x)\)

\(E(x)^2+Var(x)=E(x)^2+E((x-E(x))^2)\)

\(E(x)^2+Var(x)=E(x)^2+E(x^2+E(x)^2-2xE(x))\)

\(E(x)^2+Var(x)=E(x)^2+E(x^2)+E(E(x)^2)-E(2xE(x))\)

\(E(x)^2+Var(x)=E(x)^2+E(x^2)+E(x)^2-2E(x)E(x))\)

\(E(x)^2+Var(x)=E(x^2)\)

\(Var(x+y)=E((x+y)^2)-E(x+y)^2\)

\(Var(x+y)=E(x^2+y^2+2xy)-E(x+y)^2\)

\(Var(x+y)=E(x^2)+E(y^2)+E(2xy)-E(x+y)^2\)

\(Var(x+y)=E(x^2)+E(y^2)+E(2xy)-[E(x)+E(y)]^2\)

\(Var(x+y)=E(x^2)+E(y^2)+E(2xy)-E(x)^2-E(y)^2-2E(x)E(y)]\)

\(Var(x+y)=[E(x^2)-E(x)^2]+[E(y^2)-E(y)^2]+E(2xy)-2E(x)E(y)\)

\(Var(x+y)=Var(x) +Var(y)+2[E(xy)-E(x)E(y)]\)

We then define:

\(Cov(x,y):=E(xy)-E(x)E(y)\)

Noting that:

\(Cov(x,x)=E(xx)-E(x)E(x)\)

\(Cov(x,x)=Var(x) \)

So:

\(Var(x+y)=Var(x)+Var(y)+2Cov(x,y)\)

\(Var(x+y)=Cov(x,x)+Cov(x,y)+Cov(y,x)+Cov(y,y) \)

\(Cov(x,c)=E(xc)-E(x)E(c)\)

\(Cov(x,c)=cE(x)-cE(x)\)

\(Cov(x,c)=0\)

The \(n\)th moment of variable \(X\) is defined as:

\(E[X^n]=\sum_i x_i^n P(x_i)\)

The mean is the first moment.

The \(n\)th central moment of variable \(X\) is defined as:

\(\mu_n=E[(X-E[X])^n]=\sum_i (x_i-E[X])^n P(x_i)\)

The variance is the second central moment.

The \(n\)th standardised moment of variable \(X\) is defined as:

\(\dfrac{E[(X-E[X])^n]}{(E[(X-E[X])^2]^\frac{n}{2}}=\dfrac{\mu_n}{\sigma^n}\)

Kertosis is the third standardised moment.

Skew is the fourth standardised moment.

With multiple events, covariance can be defined between each pair of events, including the event with itself.

The covariance between \(2\) variables is:

\(Cov(x_i,x_j):=E(x_ix_j)-E(x_i)E(x_j)\)

Which is equal to:

\(Cov(x_i,x_j)=E{[x_i-E(x_i)][x_j-E(x_j)]}\)

We can therefore generate a covariance matrix through:

\(\sum =E[(X-E[X])(X-E[X])^T]\)

If \(\phi\) is convex then:

\(\phi (E[X])\ge E[\phi (X)])\)

\(E[I_{X\ge a}]=P(X\ge a)\)

Consider the indicator function.

\(I_{X\ge a}\)

This is equal to \(0\) if \(X\) is below \(a\) and \(1\) otherwise.

We can take expectations of this.

\(E[I_{X\ge a}]=P(X\ge a).1+P(X<a).0=P(X\ge a)\)

\(E[I_{X\ge a}]=P(X\ge a)\)

\(aI_{X\ge a}\le X\)

While \(X\) is below \(a\) the left side is equal to \(0\), which holds.

While \(X\) is equal to \(a\) the left side is equal to \(X\), which holds.

While \(X\) is above \(a\) the left side is equal to \(a\), which holds.

\(P(X\ge a)\le \dfrac{\mu }{a}\)

From above:

\(aI_{X\ge a}\le X\)

We can take expectations of both sides:

\(E[aI_{X\ge a}]\le E[X]\)

\(aP(X\ge a)\le E[X]\)

\(P(X\ge a)\le \dfrac{\mu }{a}\)

We know from Markov’s inequality that:

\(P(X\ge a)\le \dfrac{\mu }{a}\)

Let’s take the variable \(X\) to be \((X-\mu )^2\)

\(P((X-\mu )^2\ge a)\le \dfrac{E[(X-\mu )^2]}{a}\)

\(P((X-\mu )^2\ge a)\le \dfrac{\sigma^2}{a}\)

\(P(|X-\mu | \ge \sqrt{a})\le \dfrac{\sigma^2}{a}\)

Take \(a\) to be a multiple \(k^2\) of the variance \(\sigma^2\).

\(a=k^2\sigma^2\)

\(P(|X-\mu | \ge k\sigma )\le \dfrac{\sigma^2}{k^2\sigma^2}\)

\(P(|X-\mu | \ge k\sigma )\le \dfrac{1}{k^2}\)

Cumulative probability function

\(F=\int_{-\infty }^\infty xP(x)\)

Moment generating function

\(F=\int_{-\infty }^\infty e^{tx}P(x)\)

Characteristic function

\(F=\int_{-\infty }^\infty e^{itx}P(x)\)

Take random variable \(X\). This has moments we wish to calculate.

We can transform our function in other forms which maintain all of the required information. For example we could also use the cumulative probability function to calculate moments. We now look for an alternative form of the probability density function which allows us to easily calculate moments.

One method is to use the probability density function and the definitions of moments, but there are other options. For example, consider the function:

\(E[e^{tX}]\)

Which expands to:

\(E[e^{tX}]=\sum_{j=1}^\infty \dfrac{t^jE[X^j]}{j!}\)

By taking the \(m\)th derivative of this, we get

\(E[X^m]+\sum_{j=m+1}^\infty \dfrac{t^jE[X^j]}{j!}\)

We can then set \(t=0\) to get

\(E[X^m]\)

Alternatively, see that differentiating \(m\) times gets us

\(E[X^me^{tX}]\)

If we can get this function, we can then easily generate moments.

The function we need to get is:

\(E[e^{tX}]\)

In the discrete case this is:

\(E[e^{tX}]=\sum_{i=1}e^{tx_i}p_i\)

In the continuous case:

\(E[e^{tX}]=\int_{-\infty }^\infty e^{tx}P(x) dx\)

It may not be possible to calculate the integral for the moment generating function. We now look for an alternative formula with which we can generate the same moments.

Consider

\(E[e^{itX}]\)

As this can be broken down into sinusoidal functions it can more readily be integrated.

This expands to

\(E[e^{itX}]=\sum_{j=1}^\infty \dfrac{i^jt^jE[X^j]}{j!}\)

By taking the \(m\)th derivative we get.

\(E[X^m]i^m+\sum_{j=m+1}^\infty \dfrac{t^jE[X^j]}{j!}\)

By setting \(t=0\) we then get:

\(E[X^m]i^m\)

Alternatively see that differentiating \(m\) times gets us

\(E[(iX)^me^{itX}]\)

So we can get the moment by differentiating \(m\) times, and multiplying by \(i^{-m}\).

Moment generating function

Characteristic function

\(\phi_{X+c}(t)=E[e^{it(X+c)}]\)

\(\phi_{X+c}(t)=E[e^{itX}e^itc]\)

\(\phi_{X+c}(t)=e^{itc}E[e^{itX}]\)

\(\phi_{X+c}(t)=e^{itc}\phi_X(t)\)

\(\phi_{X}(t)=e^{-itc}\phi_{X+c}(t)\)

\(\phi_{cX}(t)=E[e^{itcX}]\)

\(\phi_{cX}(t) = \phi_{X}(ct)\)

\(\phi_X(t)=E[e^{itX}]\)

\(\phi_X(t)=\sum_{j=0}^{\infty }\dfrac{\phi_X^j(a)(t-a)}{j!}\)

Around \(a=0\)

\(\phi_X(t)=\sum_{j=0}^{\infty }\dfrac{\phi_X^j(0)(t)}{j!}\)

The characteristic function is now given in terms of its moments.

We know:

\(\phi_X^j(0)=E[X^j]i^j\)

So:

\(\phi_X(t)=\sum_{j=0}^{\infty }\dfrac{E[X^j]i^j(t)^j}{j!}\)

\(\phi_X(t)=\sum_{j=0}^{\infty }\dfrac{E[X^j](it)^j}{j!}\)

We know:

\(\dfrac{E[X^0](it)^0}{0!}=E[1]=1\)

\(\dfrac{E[X^1](it)^1}{1!}=E[X](it)=it\mu_X \)

\(\dfrac{E[X^2](it)^2}{2!}=\dfrac{-E[X^2]t^2}{2}=\dfrac{-(\mu_X +\sigma_X^2 )t^2}{2}\)

So:

\(\phi_X(t)=1+it\mu_X -\dfrac{(\mu_X +\sigma_X^2 )t^2}{2} +\sum_{j=3}^{\infty }\dfrac{E[X^j](it)^j}{j!}\)