CS 285 DRL notes

Basic Terminologies

s state

o observation, important parts of a state (things that we concern)

a action.

$\pi$ refers to a policy, which is just what the RL algorithm can gain after learning. $\pi_{\theta}$ is the policy learned from data and $\theta$ refers to parameters in the policy.

The subscript t (such as $s_t, o_t, a_t$) is a time stamp.

Imitation Learning

Just imitate what the machine “see” , which usually refers to experts’ behaviors or actions from training data.

Problem

Because of a tiny wrong caused by the learned policy $\pi_{\theta}$, it may be in a brand-new state that has never occurred during the training process. This may lead to a worse action and accumulate the error.

What’s more, the i.i.d. condition can’t be meet, which is just the Distribution shift problem. The distribution in training dataset is different from the dataset in test. Such as a autonomous driving program, trained in good weather but tested in rainy or foggy situations.

Analysis

Some keys:

Cost function: action same with experts behavior -> 0, otherwise ->1
Assumption: the supervised learning works well: for s appeared in $D_{train}$, $E \leq \epsilon$.
Once the model responds with a wrong action, which may make the state unfamiliar and unable to mitigate. So the cost will be $1 \times T$ (T for time span from now to the end)

In the slide, prof shows us the evaluation of the upper bound of E(cost). It is clear that as the T goes longer, the error will be increasing in a non-linear way, which is bad.

Solution

1. Intentionally add mistakes and corrections

This may dilute the dataset, but the benefit will be larger.

2. Data augmentation

Nvdia’s autonomous driving with two side cameras.

3. Using historical data

In some circumstances, one reason that produces bad performance is the non-Markovian behavior. Human experts don’t make decision only based on the current state, but also what you have experienced or what you have seen before.

E.P. when you are driving, a car disappeared in your vision but you may know it is in your blind spot now because in last few seconds you can see it from your rearview mirror. But if you just judge from the current state, you will not know this information.

Using historical data may cause Causal confusion. Machine will consider that you should break when the car is slowing down but actually it is that you break when you find a person few meters in front of the car and you break.

4. Expressive continuous distributions

For multi-model problems.

Mixed Gaussian, latent variable models, diffusion models

Discretization.

5. Multi-task learning

Different goals, but learn together.

For example, several experts drive to different places, the machine learn all of them even though its final goal is to arrive just one of the target.

6. Relabeling method

Not fully understand yet. I regard it as a way to make the machine understand that the goal of this data may not be the one you want (in the context of imitation learning?)

7. DAgger

Dataset Aggregation, trying to collect dataset from real p ($p_{\pi_\theta}(\bold{o_t})$)

Running the policy, get data $D_{\pi}$, ask human to label Data with their Action, Aggregate $D_{\pi}$ into D

Restrictions

Unnatural for human to label.
Unable for human to label.

如果您喜欢此博客或发现它对您有用，则欢迎对此发表评论。也欢迎您共享此博客，以便更多人可以参与。如果博客中使用的图像侵犯了您的版权，请与作者联系以将其删除。谢谢！

9.17 学习记录

CS 285 DRL notes

Basic Terminologies

Imitation Learning

Problem

Analysis

Solution

1. Intentionally add mistakes and corrections

2. Data augmentation

3. Using historical data

4. Expressive continuous distributions

5. Multi-task learning

6. Relabeling method

7. DAgger

Restrictions

特色标签

链友