Notations
-
$S_t\in\mathcal{S}$ : the environment state at step$t$ , where$\mathcal{S}$ is the set of possible states -
$A_t\in\mathcal{A}(S_t)$ : action given$S_t$ , where$\mathcal{A}(S_t)$ is the set of actions available in state$S_t$ -
$R_{t+1}\in\mathcal{R}\subset\mathbb{R}$ : reward -
$\pi_t$ : agent's policy, where$\pi_t(a|s)$ is the probability that$A_t=a$ if$S_t=s$
Expected discounted return
A reinforcement learning task that statisfies the Markov property is called a Markov Discision Process (MDP).
A finite MDP is specified by its state and action sets (
All other things can be computed from this dynamics, including
- the expected rewards for state-action pairs
$r(s,a)=\mathbb{E}(R_{t+1}|S_t=s,A_t=a)$ - the state transition prob
$p(s'|s,a)=\text{Pr}(S_{t+1}=s'|S_t=s,A_t=a)$ - the expected rewards for state-action-next-state triples
$r(s,a,s')=\mathbb{E}(R_{t+1}|S_t=s,A_t=a,S_{t+1}=s')$
An alternative definition[1] A MDP is defined by:
- a ste of states
$\mathcal{S}$ - a start state or initial state
$s_0\in\mathcal{S}$ - a set of actions
$\mathcal{A}$ - a transition prob
$P(S_{t+1} = s'|S_t = s,A_t = a)$ - a reward prob
$P(R_{t+1}=r|S_t=s,A_t=a)$
For MDP, the state-value function for a policy
Similar way can be used to define the action-value function for policy
- Q learning
The Bellman equation for
Reinforcement learning can solve Markov decision processes without explicit specification of the transition probabilities.
In finite MDPs, value functions define a partial ordering over policies
- A policy
$\pi$ is defined to be better than or equal to a policy$\pi'$ if its expected return is greater than or equal to that of$\pi'$ for all states.
Optimal state-value function, denoted
For MDP
This equation means that any policy that is greedy with respect to the optimal evaluation function
- This equation is also a special formulation that dynamic programming could find the optimal solution [2].
- Q learning
An example of using MDP for information extraction.
- Mohri, Rostmizadeh, Talwalkar. Foundations of Machine Learning. 2012
- Kleinberg and Tardos. Chap 06 Dynamic Programming, Algorithm Design. 2005