# Fully observable Markov Decision Processes (MDP)

## Stochastic environments

### Markov chain recap

In a Markov chain we move from state to state randomly, following a transtion matrix.

We have:

• $$S$$ - the state space

• $$s_1$$ - the initial state

• $$P$$ - the transition model

### The value function

#### State payoffs

An agent gets a payoff which depends on the current state. We now have:

• $$S$$ - the state space

• $$s_1$$ - the initial state

• $$P$$ - the transition model

• $$R$$ - the reward distribution

#### Discounting

We maximise the reward function using discounting.

$$E[\sum_{t=1}^\infty \gamma^{t-1}r_t|s_1]$$

## Markov Decision Processes (MDPs)

### Markov Decision Processes (MDPs)

#### Actions

In a MDP the agent can choose an action in each state. The action affects the transition matrix.

This means that the rewards depends on the actions taken. We now have:

• $$S$$ - the state space

• $$s_1$$ - the initial state

• $$A$$ - the action space

• $$P$$ - the transition model

• $$R$$ - the reward distribution

### Policies

The decision maker needs to decide which action to take. For a MDP we assume that the agent knows:

• The current state

• The transition matrix

• The payoffs

If the current state is not known, we have a Partially Observable Markov Decision Process.

If the transition matrix or payoffs are not known we have a reinforcement learning problem.

### The value function of a policy

#### Other

$$P(s_{t+1}|H)=P(s_{t+1}|s_t=s, a_t=a)=P_{s,a}$$

$$E[r_t|H]=E[r_t|s_t=s, a_t=a]=R_{s,a}$$

#### Optimal policies

Expected return is greater or identical to any other policy.

## Identifying policies

### Policy iteration

#### Policy iteration method

We then loop:

• Evaluate the policy, using the Bellman equations.

• Update the policy

We update the policy by changing $$a$$ to maximise:

$$v_pi ' (s)= r_\pi + \gamma P_\pi v_\pi(s)$$

For example, if we have a policy of doing $$a$$ in state $$s$$, we would see if we increase the value if we change $$a$$ to $$a'$$.

### Value iteration

#### Value iteration method

Doesn’t directly calculuate policy. Doesn’t require inverting matrix.