[Review] UCL_RL Lecture08 Integrating Learning and Planning


Preface

This is a review of UCL reinforcement learning course.
And this is a part 08 in a series.
In this lecture, we will learn the model learning directly learn from experience and use planning to construct a value function or policy.
Previously, we have seen the model learning merely directly from experience or the one learning to estimate a value function directly from experience.
In other word, we will see the integrated model of learning and planning on a single architecture.

Content

  1. Introduction
  2. Model-Based Reinforcement Learning
  3. Integrated Architectures
  4. Simulation-Based Search

Introduction

As a preparation of this course, let me brief the two types of model we have seen ever.

  • Model-Free RL: no model, learning value functions from experience
  • Model-Based RL: learning a model from experience and planning value functions from model, like a simulation.

Model-Based RL

To understand the single architecture of planning and learning, let's review the Model-Based RL in this section.

As you can see above, the process is relatively simple, starting with acting based on the initially generated value/policy, the agent moves on to the step of model learning with the observed experience. Then, learned model will be used to plan(evaluate and improve the value/policy).
Although we understand the architecture, you might question what is good for model-based RL, don't you?
The answer is below.

  • Advantages
    • Can efficiently learn model by supervised learning methods
    • Can reason about model uncertainty
  • Disadvantages
    • First learn a model, then construct a value function. Hence there are two sources of approximation error.

Now, in order to understand deeply, let us dive into model-based RL more.

what is a Model?

As we have seen in previous lectures, the model means the comprehensive thing which builds on the MDP(Markov Decision Properties)$< S,A,P,R > $, parametrised by $\eta$. So at this point, in fact, we assume that Action space $A$ and State space $S$ are know. Hence, a model $M = < P_{\eta}, R_{\eta} >$ represents state transitions $P_{\eta} \approx P$ and rewards $R_{\eta} \approx R$.

S_{t+1} \sim P_{\eta}(S_{t+1} | S_t, A_t)\\
R_{t+1} \sim R_{\eta}(R_{t+1} | S_t, A_t)\\
P[S_{t+1}, R_{t+1} | S_t, A_t] = P[S_{t+1}| S_t, A_t] P[R_{t+1} | S_t, A_t]

Model Learning

Based on the definition of the model, I would like to move on to seeing the model learning in this section.
Purpose of learning: estimate a model $M_n$ from experience { $ S_1, A_1, R_2, ... , S_T $ }
So simply saying, this is able to be considered supervised learning problem.

  • Learning s, a -> predicting r(reward) is a problem of regression problem
  • Learning s, a -> predicting s'(next state) is a problem of density estimation problem
  • Select a loss function to optimise, e.g. MSE, Cross Entropy, KL divergence and so on.
  • Find parameters $\eta$ that minimise empirical loss by using optimisation algorithms, e.g. SGD, Momentum SGD, Adam and so on.

wikipedia(density estimation)
In probability and statistics, density estimation is the construction of an estimate, based on observed data, of an unobservable underlying probability density function. The unobservable density function is thought of as the density according to which a large population is distributed; the data are usually thought of as a random sample from that population.

With regard to example of models, I give you a list of models below.

  • Table Lookup Model
  • Linear Expectation Model
  • Linear Gaussian Model
  • Gaussian Process Model
  • Deep Belief Network Model
  • and so on.

Since I am not familiar with some models yet, allow us to stop by a running sample which is Table Lookup Model

Table Lookup Model

As we saw in previous section, a model consists of MDP and transition probabilities and rewards at time steps.
$N(s,a)$: count of visits to each state action pair

\hat{P^a_{s,s'}} = N(s,a) \sum^T_{t=1} 1(S_t, A_t, S_{t+1} = s,a,s')\\
\hat{R^a_s} = N(s,a) \sum^T_{t=1} 1(S_t, A_t = s,a) R_t

Planning

Given a model we saw in last section, $M = $ <$P_{\eta}, R_{\eta}$>, we can solve the MDP <$S, A, P_{\eta}, R_{\eta}$> using any algorithms we have learnt, e.g. value iteration, policy iteration or tree search and so on.

So let's consider the basic example which is sample-based planning

Sample-Based Planning

This is a simple though quite powerful approach requiring many prerequisites.
First, we sample experience from model then apply some model-free RL to samples, e.g. monte-carlo control or Sarsa/ Q-learning.
Visualising this approach with some simple example situation below, we can equip previous AB-example a bit more practical.

On the contrary, I would like to ponder the situation where we have the inaccurate model. Unfortunately, we will see the limitation of model-based RL at this point. But firstly let me clarify the situation below.
Let Imperfect Model be <$P_{\eta}, R_{\eta}$> $\neq$ <$P, R$>.
And as we have seen, we will sample state and action from this inaccurate model, then those samples should be wrong or noise as well. Hence, consequently the learning will not be definitely going well.
So we need solutions to this issue.

  1. When model is wrong, use model-free RL
  2. Reason explicitly about model uncertainty

Next, I would like to think about the integrated algorithm of model-free and model-based RL.

Dyna(Integrated Model)

This time, let me firstly describe the contrast among model-free/model-based/Dyna.

  • Model-Free RL: no model, learning value functions from experience
  • Model-Based RL: learning a model from experience and planning value functions from model, like a simulation.
  • Dyna: learning a model from real experience and also learn and plan value function from real and simulated(modeled) experience. In this Dyna architecture, I would like to show the simplest algorithm as a running example below.

So in this algorithm, firstly, we initialise $Q(s,a), Model(s,a)$, then till it converges to the optimal, repeat the update process for model and action value function.
In side the loop, we use the sample which is used to update the model, then within n step ahead, we are going to sample state and action from updated model to improve action value function.