Basics of probability


Elementary events

We have a sample space, \(\Omega \) consisting of elementary events.

All elementary events are disjoint sets.

Non-elementary events

We have a \(\sigma\)-algebra over \(\Omega \) called \(F\). A \(\sigma\)-algebra takes a set a provides another set containing subsets closed under complement. The power set is an example.

All events \(E\) are subsets of \(\Omega\)

\(\forall E\in F E\subseteq \Omega\)

Mutually exclusive events

Events are mutually exclusive if they are disjoint sets.


For each event \(E\), there is a complementary event \(E^C\) such that:

\(E\lor E^C=\Omega\)

\(E\land E^C=\varnothing\)

This exists by construction in the measure space.

Union and intersection

As events are sets, we can define algebra on sets. For example for two events \(E_i\) and \(E_j\) we can define:

  • \(E_i\land E_j\)

  • \(E_i\lor E_j\)

The probability function

For all events \(E\) in \(F\), the probability function \(P\) is defined.

Measure space

This gives us the following measure space:

\((\Omega, F, P)\)

Kolmogorov axioms

First axiom

The probability of all events is a non-negative real number.

\(\forall E \in F [(P(E)\ge 0)\land (P(E)\in \mathbb{R})]\)

Second axiom

The probability of one of the elementary events occuring is \(1\).

The probability of the outcome set is \(1\).

\(P(\Omega )=1\)

Third axiom

The probability of union for mutually exclusive events is:

\(P(\cup^\infty_{i=1}E_i)=\sum_{i=1}^\infty P(E_i)\)

Basic results

Probability of null

\(P(\Omega )=1\)

\(P(\Omega \lor \varnothing )=1\)

\(P(\Omega )+P(\varnothing )=1\)

\(P(\varnothing )=0\)


Consider \(E_i\subseteq E_j\):

\(E_j=E_i\lor E_k\)

\(P(E_j)=P(E_i\lor E_k)\)

Disjoint so:


We know that \(P(E_k)\ge 0\) from axiom \(1\) so:

\(P(E_j)\ge P(E_i)\)

Bounds of probabilities

As all events are subsets of the sample space:

\(P(\Omega )\ge P(E)\)

\(1\ge P(E)\)

From axiom \(1\) then know:

\(\forall E\in F [0\le P(E)\le 1]\)

Union and intersection for null and universal

\(P(E\land \varnothing )=P(\varnothing )=0\)

\(P(E\lor \Omega )=P(\Omega )=1\)

\(P(E\lor \varnothing)=P(E)\)

\(P(E\land \Omega )=P(E)\)

Separation rule


\(P(E_i)=P(E_i\land \Omega)\)

\(P(E_i)=P(E_i\land (E_j\lor E_j^C))\)

\(P(E_i)=P((E_i\land E_j)\lor (E_i\land E_j^C))\)

As the latter are disjoint:

\(P(E_i)=P((E_i\land E_j)+(E_i\land E_j^C))\)

Addition rule

We know that:

\(P(E_i\lor E_j)=P((E_i\lor E_j)\land (E_j\lor E_j^C))\)

By the distributive law of sets:

\(P(E_i\lor E_j)=P((E_i\land E_j^C)\lor E_j)\)

\(P(E_i\lor E_j)=P((E_i\land E_j^C)\lor (E_j\land (E_i\lor E_i^C))\)

By the distributive law of sets:

\(P(E_i\lor E_j)=P((E_i\land E_j^C)\lor (E_j\land E_i)\lor (E_j\land E_i^C))\)

As these are disjoint:

\(P(E_i\lor E_j)=P(E_i\land E_j^C)+ P(E_j\land E_i)+P(E_j\land E_i^C)\)

From the separation rule:

\(P(E_i\lor E_j)=P(E_i)-P(E_i\land E_j)+ P(E_j\land E_i)+P(E_j)-P(E_j\land E_i)\)

\(P(E_i\lor E_j)=P(E_i)+P(E_j)-P(E_i\land E_j)\)

Probability of complements

From the addition rule:

\(P(E_i\lor E_j)=P(E_i)+P(E_j)-P(E_i\land E_j)\)

Consider \(E\) and \(E^C\):

\(P(E\lor E^C)=P(E)+P(E^C)-P(E\land E^C)\)

We know that \(E\) and \(E^C\) are disjoint, that is:

\(E\land E^C=\varnothing\)

Similarly by construction:

\(E\lor E^C=\Omega \)


\(P(\Omega )=P(E)+P(E^C)-P(\varnothing)\)


Bayes theorem

Conditional probability

We define conditional probability

\(P(E_i|E_j):=\dfrac{P(E_i\land E_j)}{P(E_j)}\)

We can show this is between \(0\) and \(1\).

\(P(E_j)=P(E_i\land E_j)+P(\bar{E_i}\land E_j)\)

\(P(E_i|E_j):=\dfrac{P(E_i\land E_j)}{ P(E_i\land E_j)+P(\bar{E_i}\land E_j)}\)

We know:

\(P(x_i|y_j):=\dfrac{P(x_i \land y_j)}{P(y_j)}\)

\(P(y_j|x_i):=\dfrac{P(x_i \land y_j)}{P(x_i)}\)


\(P(x_i|y_j)P(y_j)=P(y_j|x_i) P(x_i)\)

\(P(x_i|y_j)=\dfrac{P(y_j|x_i) P(x_i)}{P(y_j)}\)

Note that this is undefined when \(P(y_j)=0\)

Note that for the same event,

\(P(x_i|x_j)=\dfrac{P(x_i\land x_j)}{P(x_j)}\)


For the same outcome:

\(P(x_i|x_i)=\dfrac{P(x_i\land x_i)}{P(x_i)}\)



Bayes theorem

From the definition of conditional probability we know that:

\(P(E_i|E_j):=\dfrac{P(E_i\land E_j)}{P(E_j)}\)

\(P(E_j|E_i):=\dfrac{P(E_i\land E_j)}{P(E_i)}\)


\(P(E_i\land E_j)=P(E_i|E_j)P(E_j)\)

\(P(E_i\land E_j)=P(E_j|E_i)P(E_i)\)



Independent variables

Events are independent if:


Note that:

\(P(E_i\land E_j)=P(E_i|E_j)P(E_j)\)

And so for independent events:

\(P(E_i\land E_j)=P(E_i)P(E_j)\)

Conjugate priors

If the prior \(P(\theta)\) and the posterior \(P(\theta | X)\) are in the same family of distributions (eg both Gaussian), then the prior and posterior are conjugate distributions



Given a set of outcomes for a variable, the odds of the outcome are defined as:


For example, the odds of rolling a \(6\) are \(\dfrac{1}{5}\).

Discrete and continous probability

We know that:

\(\sum_yP(X\land Y)=P(X)\)

So for the continuous case

\(P(X)=\int_{-\infty }^{\infty }P(X\land Y)dy\)

This behaves like the probability for a single event, or multiple events with one fewer event if there were more than \(2\) events to start with.


Random variables

Defining variables

We have a sample space, \(\Omega \). A random variable \(X\) is a mapping from the sample space to the real numbers:

\(X: \Omega \rightarrow \mathbb{R}\)

We can then define the set of elements in \(\Omega \). As an example, take a coin toss and a die roll. The sample space is:


A random variable could give us just the die value, such that:


We can define this more precisely using set-builder notation, by saying the following is defined for all \(c\in \mathbb{R}\):

\(\{\omega |X(\omega )\le c\}\)

That is, for any number random variable map \(X\), there is a corresponding subset of \(\Omega \) containing the \(\omega \)s in \(\Omega \) which map to less than \(c\).

Multiple variables

Multiple variables can be defined on the sample space. If we rolled a die we could define variables for

  • Whether it was odd/even

  • Number on the die

  • Whether it was less than 3

With more die we could add even more variables

Derivative variables

If we define a variable \(X\), we can also define another variable \(Y=X^2\).

Probability mass functions

\(P(X=x)=P({\omega |X(\omega)=x})\)

For discrete probability, this is a helpful number. For example for rolling a die.

This is not helpful for continuous probability, where the chance of any specific outcome is \(0\).

Cumulative distribution functions


Random variables all valued as real numbers, and so we can write:

\(P(X\le x)=P({\omega |X(\omega)\le x})\)


\(F_X(x)=\int_{-\infty}^x f_X(u)du\)

\(F_X(x)=\sum_{x_i\le x}P(X=x_i) \)


\(P(X\le x)+P(X\ge x)-P(X=x)=1\)


\(P(a< X\le b)=F_X(b)-F_X(a)\)

Probability density functions


If continuous, probability at any point is \(0\). We instead look at probability density.

Derived from cumulative distribution function:

\(F_X(x)=\int_{-\infty}^x f_X(u)du\)

The density function is \(f_X(x)\).

Conditional probability distributions

For probability mass functions:

\(P(Y=y|X=x)=\dfrac{P(Y=y\land X=x)}{P(X=x)}\)

For probability density functions:


Multiple variables

Joint and marginal probability

Joint probability

\(P(X=x\land Y=y)\)

Marginal probability

\(P(X=x)=\sum_{y}P(X=x\land Y=y)\)


Independence and conditional independence


\(x\) is independent of \(y\) if:

\(\forall x_i \in x,\forall y_j \in y (P(x_i|y_j)=P(x_i)\)

If \(P(x_i|y_j)=P(x_i)\) then:

\(P(x_i\land y_j)=P(x_i).P(y_j)\)

This logic extends beyond just two events. If the events are independent then:

\(P(x_i\land y_j \land z_j)=P(x_i).P(y_j \land z_k)=P(x_i).P(y_j).P(z_k)\)

Note that because:

\(P(x_i|y_j)=\dfrac{P(x_i\land y_j)}{P(y_j)}\)

If two variables are independent



Conditional independence

\(P(A\land B|X)=P(A|X)P(B|X)\)

This is the same as:

\(P(A|B\land X)=P(A|X)\)


Functionals of probabilities

\(\phi (P)\in \mathbb{R} \) is a functional on \(P(X)\).

Examples include the expectation and variance.

We can define derivatives on these functionals.

\(\phi (P)\approx \phi (P^0)+D_\phi (P-P^0)\)

Where \(D_\phi \) is linear.

Expected value


For a random variable (or vector of random variables), \(x\), we define the expected value of \(f(x)\) as :

\(E[f(x)]:=\sum f(x_i) P(x_i)\)

The expected value of random variable \(x\) is therefore this where \(f(x)=x\).

\(E(x)=\sum_i x_i P(x_i)\)

Linearity of expectation

We can show that \(E(x+y)=E(x)+E(y)\):

\(E[x+y]=\sum_i \sum_j (x_i+y_j) P(x_i \land y_j)\)

\(E[x+y]=\sum_i \sum_j x_i [P(x_i \land y_j)]+\sum_i \sum_j [y_j P(x_i \land y_j)]\)

\(E[x+y]=\sum_i x_i \sum_j [P(x_i \land y_j)]+\sum_j y_j \sum_i [P(x_i \land y_j)]\)

\(E[x+y]=\sum_i x_i P(x_i)+\sum_j y_j P(y_j)\)


Expectations of multiples


\(E(cx)=\sum_i cx P(x_i)\)

\(E(cx)=c\sum_i x P(x_i)\)


Expectations of constants

\(E(c)=\sum_i c_i P(c_i)\)

\(E(c)= cP(c)\)

\(E(c)= c\)

Conditional expectation

If \(Y\) is a variable we are interested in understanding, and \(X\) is a vector of other variables, we can create a model for \(Y\) given \(X\).

This is the conditional expectation.



In the continuous case this is

\(E(Y|X)=\int_{-\infty }^{\infty }yP(y|X)dy\)

We can then identify an error vector.

\(\epsilon :=Y-E(Y|X)\)


\(Y=E(Y|X)+\epsilon \)

Here \(Y\) is called the dependent variable, and \(X\) is called the dependent variable.

Iterated expectation





The variance of a random variable is given by:






Variance of a constant


\(Var(c)= c^2-c^2\)


Variance of multiple


\(Var(cx)=E(c^2x^2)-[\sum_i cx P(x_i)]^2\)

\(Var(cx)=c^2E(x^2)-c^2[\sum_i x P(x_i)]^2\)

\(Var(cx)=c^2[E(x^2)- E(x)^2]\)














\(Var(x+y)=Var(x) +Var(y)+2[E(xy)-E(x)E(y)]\)

We then define:


Noting that:


\(Cov(x,x)=Var(x) \)



\(Var(x+y)=Cov(x,x)+Cov(x,y)+Cov(y,x)+Cov(y,y) \)






The \(n\)th moment of variable \(X\) is defined as:

\(E[X^n]=\sum_i x_i^n P(x_i)\)

The mean is the first moment.

Central moments

The \(n\)th central moment of variable \(X\) is defined as:

\(\mu_n=E[(X-E[X])^n]=\sum_i (x_i-E[X])^n P(x_i)\)

The variance is the second central moment.

Standardised moments

The \(n\)th standardised moment of variable \(X\) is defined as:



Kertosis is the third standardised moment.


Skew is the fourth standardised moment.

Covariance matrix

With multiple events, covariance can be defined between each pair of events, including the event with itself.

The covariance between \(2\) variables is:


Which is equal to:


We can therefore generate a covariance matrix through:

\(\sum =E[(X-E[X])(X-E[X])^T]\)

Jensen’s inequality

If \(\phi\) is convex then:

\(\phi (E[X])\ge E[\phi (X)])\)

Bayesian networks

Bayesian networks


Markov’s inequality and Chebyshev’s inequality

Lemma 1

\(E[I_{X\ge a}]=P(X\ge a)\)

Consider the indicator function.

\(I_{X\ge a}\)

This is equal to \(0\) if \(X\) is below \(a\) and \(1\) otherwise.

We can take expectations of this.

\(E[I_{X\ge a}]=P(X\ge a).1+P(X<a).0=P(X\ge a)\)

\(E[I_{X\ge a}]=P(X\ge a)\)

Lemma 2

\(aI_{X\ge a}\le X\)

While \(X\) is below \(a\) the left side is equal to \(0\), which holds.

While \(X\) is equal to \(a\) the left side is equal to \(X\), which holds.

While \(X\) is above \(a\) the left side is equal to \(a\), which holds.

Markov’s inequality

\(P(X\ge a)\le \dfrac{\mu }{a}\)

From above:

\(aI_{X\ge a}\le X\)

We can take expectations of both sides:

\(E[aI_{X\ge a}]\le E[X]\)

\(aP(X\ge a)\le E[X]\)

\(P(X\ge a)\le \dfrac{\mu }{a}\)

Chebyshev’s inequality

We know from Markov’s inequality that:

\(P(X\ge a)\le \dfrac{\mu }{a}\)

Let’s take the variable \(X\) to be \((X-\mu )^2\)

\(P((X-\mu )^2\ge a)\le \dfrac{E[(X-\mu )^2]}{a}\)

\(P((X-\mu )^2\ge a)\le \dfrac{\sigma^2}{a}\)

\(P(|X-\mu | \ge \sqrt{a})\le \dfrac{\sigma^2}{a}\)

Take \(a\) to be a multiple \(k^2\) of the variance \(\sigma^2\).


\(P(|X-\mu | \ge k\sigma )\le \dfrac{\sigma^2}{k^2\sigma^2}\)

\(P(|X-\mu | \ge k\sigma )\le \dfrac{1}{k^2}\)

Characteristic functions



Cumulative probability function

\(F=\int_{-\infty }^\infty xP(x)\)

Moment generating function

\(F=\int_{-\infty }^\infty e^{tx}P(x)\)

Characteristic function

\(F=\int_{-\infty }^\infty e^{itx}P(x)\)

Moment generating function

Take random variable \(X\). This has moments we wish to calculate.

We can transform our function in other forms which maintain all of the required information. For example we could also use the cumulative probability function to calculate moments. We now look for an alternative form of the probability density function which allows us to easily calculate moments.

One method is to use the probability density function and the definitions of moments, but there are other options. For example, consider the function:


Which expands to:

\(E[e^{tX}]=\sum_{j=1}^\infty \dfrac{t^jE[X^j]}{j!}\)

By taking the \(m\)th derivative of this, we get

\(E[X^m]+\sum_{j=m+1}^\infty \dfrac{t^jE[X^j]}{j!}\)

We can then set \(t=0\) to get


Alternatively, see that differentiating \(m\) times gets us


If we can get this function, we can then easily generate moments.

The function we need to get is:


In the discrete case this is:


In the continuous case:

\(E[e^{tX}]=\int_{-\infty }^\infty e^{tx}P(x) dx\)

Characteristic function

It may not be possible to calculate the integral for the moment generating function. We now look for an alternative formula with which we can generate the same moments.



As this can be broken down into sinusoidal functions it can more readily be integrated.

This expands to

\(E[e^{itX}]=\sum_{j=1}^\infty \dfrac{i^jt^jE[X^j]}{j!}\)

By taking the \(m\)th derivative we get.

\(E[X^m]i^m+\sum_{j=m+1}^\infty \dfrac{t^jE[X^j]}{j!}\)

By setting \(t=0\) we then get:


Alternatively see that differentiating \(m\) times gets us


So we can get the moment by differentiating \(m\) times, and multiplying by \(i^{-m}\).

Inverses of these functions

Moment generating function

Characteristic function

Moments of constants added to variables






Moments of constants multiplied by events


\(\phi_{cX}(t) = \phi_{X}(ct)\)

Taylor series of a characteristic function


\(\phi_X(t)=\sum_{j=0}^{\infty }\dfrac{\phi_X^j(a)(t-a)}{j!}\)

Around \(a=0\)

\(\phi_X(t)=\sum_{j=0}^{\infty }\dfrac{\phi_X^j(0)(t)}{j!}\)

The characteristic function is now given in terms of its moments.

We know:



\(\phi_X(t)=\sum_{j=0}^{\infty }\dfrac{E[X^j]i^j(t)^j}{j!}\)

\(\phi_X(t)=\sum_{j=0}^{\infty }\dfrac{E[X^j](it)^j}{j!}\)

We know:


\(\dfrac{E[X^1](it)^1}{1!}=E[X](it)=it\mu_X \)

\(\dfrac{E[X^2](it)^2}{2!}=\dfrac{-E[X^2]t^2}{2}=\dfrac{-(\mu_X +\sigma_X^2 )t^2}{2}\)


\(\phi_X(t)=1+it\mu_X -\dfrac{(\mu_X +\sigma_X^2 )t^2}{2} +\sum_{j=3}^{\infty }\dfrac{E[X^j](it)^j}{j!}\)