# Fully observable Markov Decision Processes (MDP)

## Stochastic environments

### Markov chain recap

In a Markov chain we move from state to state randomly, following a transtion matrix.

We have:

### The value function

#### State payoffs

An agent gets a payoff which depends on the current state. We now have:

\(S\) - the state space

\(s_1\) - the initial state

\(P\) - the transition model

\(R\) - the reward distribution

#### Discounting

We maximise the reward function using discounting.

\(E[\sum_{t=1}^\infty \gamma^{t-1}r_t|s_1]\)

## Markov Decision Processes (MDPs)

### Markov Decision Processes (MDPs)

#### Introduction

#### Actions

In a MDP the agent can choose an action in each state. The action affects the transition matrix.

This means that the rewards depends on the actions taken. We now have:

\(S\) - the state space

\(s_1\) - the initial state

\(A\) - the action space

\(P\) - the transition model

\(R\) - the reward distribution

### Policies

The decision maker needs to decide which action to take. For a MDP we assume that the agent knows:

The current state

The transition matrix

The payoffs

If the current state is not known, we have a Partially Observable Markov Decision Process.

If the transition matrix or payoffs are not known we have a reinforcement learning problem.

### The value function of a policy

#### Other

\(P(s_{t+1}|H)=P(s_{t+1}|s_t=s, a_t=a)=P_{s,a}\)

\(E[r_t|H]=E[r_t|s_t=s, a_t=a]=R_{s,a}\)

#### Optimal policies

Expected return is greater or identical to any other policy.

## Identifying policies

### Solving Markov Decision Processes with linear programming

### The Bellman equation for Markov Decision Processes

### Policy iteration

#### Policy iteration method

We start with a random policy.

We then loop:

We update the policy by changing \(a\) to maximise:

\(v_pi ' (s)= r_\pi + \gamma P_\pi v_\pi(s)\)

For example, if we have a policy of doing \(a\) in state \(s\), we would see if we increase the value if we change \(a\) to \(a'\).

### Value iteration

#### Value iteration method

Doesn’t directly calculuate policy. Doesn’t require inverting matrix.

## Markov Decision Processes with infinite states

### Markov Decision Processes with infinite states