Qiuliang Ye | Reinforcement Learning Basics

Introduction

This is the takeaway note of the Fundamentals of Reinforcement Learning course, which is the first course of the Reinforcement Learning by University of Alberta. The lecturers are all from Sutton group, the founder of reinforcement learning! The reference of this course is Reinforcement Learning: An Introduction. You can download it freely. The most practical and important way to study this course is via programming! You can try OpenAI’s gym and codes in the Sutton’s book.

Difference with Supervised Learning and Unsupervised Learning

Supervised Learning: learns a mapping from input distribution to output distribution with labels. Like homework in primary school/middle school that has correct answers.

Unsupervised Learning: learns inherent structures of the data from input distribution without labels. Like exploration homework that does not fixed answers. Everyone (different neural network) has different answer for the same question.

Reinforcement Learning: quite different from the end-to-end learning framework. It can be treated as the learning-based control system. Given an agent that can interact with the environment, the reinforcement learning mainly optimizes the maximum value (the goal/objective) of each action of the agent. Quite similar with our great effort towards a grand goal, we consider each decision (action) carefully.

K-Armed Bandit Problem

Reading: Chapter 2-2.7 (Pages 25-36) of Sutton book.

Decision under uncertainty can be formulated a k-armed bandit problem. That is, we have an agent who chooses between K actions and receives a specific reward based on the chosen action.

Action-values

The value of an action is defined as the expected reward of that action.

\begin{equation} q_\ast(a) \equiv \mathbb{E}[R_t| A_t = a] = \sum_r p(r|a) r,\, \forall a\in {1, \dots, K}. \end{equation}

The goal is to maximize the expected reward \(\mathop{argmax}\limits_{a}\, q_\ast(a)\). Since we do not know the \(q_\ast(a)\), we need to estimate it. The simplest method is via the sample-average method. That is, we sample each arm equally and calculate the estimated action values. Therefore, we have the action value at time \(t\):

\begin{equation} \label{eq:valu-action} Q_t(a) = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}} = \frac{\sum_{i=1}^{t-1}R_i \cdot \mathbf{1}{A{i=a}}}{\sum_{i=1}^{t-1}\mathbf{1}{A{i=a}} } \end{equation}

Here