This post is part of the Deep Reinforcement Learning Blog Series. For a detailed introduction about the series, the curicullum, sources and references used, please check Part 0 - Getting Started.
Structure of the blog
- Part 0 - Getting Started
- Part 1 - Multi-Arm Bandits
- Part 2 - Finite MDP | Prerequisites: python, gymnasium, numpy
- Part 3 - Dynamic Programming | Prerequisites: python, gymnasium, numpy
- Part 4 - Monte Carlo, Temporal Difference & Bootstrapping Methods | Prerequisites: Python, gymnasium, numpy
- Part 5 - Deep SARSA and Q-Learning | Prerequisites: python, gymnasium, numpy, torch
- Part 6 - REINFORCE, AC, DDPG, TD3, SAC
- Part 7 - A2C, PPO, TRPO, GAE, A3C
- TBA (HER, PER, Distillation)
Reinforcement learning loosely refers to the area of machine learning where an agent is tasked with learning about the concequences of its actions and in that process also maximize the numerical reward signal collected over time.
K-Arm Bandits
In this simplified setting we assume that the problem is non-associatve and therefore does not involve learning to act in more than one situation.
Consider the following problem, You are faced repeatedly with a choice among k different options, or actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period, for example, over 1000 action selections, or time steps.
Each of the actions has an expected/mean reward associated with it, called the value of the action ($A_t$). At any timestep $t$ the reward for choosing a paticular action is denoted by $R_t$. The value of an arbitrary action can then be represented as:
\[q_*(a) \ \dot{=} \ \mathbb{E}[R_t|A_t = a ]\]The value of the action is defined as the expected reward associated with choosing the action.
Given the values of all actions it is quite trivial to solve the k-arm bandit problem as one would choose the action with the highest value all the time. But in most close-to-real life senarios the action values are not given and need to be estimated first. Let $Q_t(a)$ denote the estimated value of the action $a$ at time $t$.
We have two possible moves here,
- Make our estimates better by sampling all actions aka Exploration
- Choose the highest action value given the current estimates aka Exploit
The need to balance exploration and exploitation is a distinctive challenge that arises in reinforcement learning. Below mentioned are a few methods to do so..
Action-value Methods
We begin by looking more closely at methods for estimating the values of actions and for using the estimates to make action selection decisions, which we collectively call action-value methods.
Sample Average
Recall that the value of an action is the mean reward recieved when the action is choosen, one approach therefore is to average all the rewards for an action. Given by
\[Q_{n+1}(a) \ \dot{=} \ \dfrac{R_1+R_2+..+R_{n}}{n}\]The estimated value of an action after being selected n times For easier implementation can be written as a recursive formula as follows:
\[\begin{align*} Q_{n+1}(a) &= \dfrac{1}{n} \left( \sum_{i=1}^{i=n}{R_i} \right)\\ &= \dfrac{1}{n} \left( R_n + (n-1)\dfrac{1}{(n-1)}\sum_{i=1}^{i=n-1}{R_i} \right)\\ &= \dfrac{1}{n} \left( R_n + (n-1)Q_n(a) \right)\\ &= \dfrac{1}{n} \left( R_n + nQ_n(a) - Q_n(a) \right)\\ &= Q_n(a) + \dfrac{1}{n}\left[ R_n - Q_n(a)\right] \end{align*}\]The above update rule can be summarised as follows:
\[NewEstimate = OldEstimate + stepSize[Target-OldEstimate]\]A few observations based on the above:
- $[Target-OldEstimate]$ is the error in the estimate that we wish to minimize by taking a step towards the target.
- The stepSize parameter is not constant (For e.g. in sample avg it was 1/n) and is often denoted by $\alpha_t(a)$
Given the estimates we can now focus on exploitation, a simple rather “greedy” method would be to choose the highest action value which can be represented as:
\[A_t = \text{argmax}_{a}Q_t(a)\]The expression $\text{argmax}_{a}$ denotes that a is choosen such that $Q_t(a)$ is maximised with ties broken arbitrarily.
The above method of action selection is quite weak as it does no exploration and therefore can be acting on false assumtions of the action values. A better approach would be to exploit most of the time but every now and then explore values as well. This is called the $\varepsilon$-greedy methods where the decision to explore is based on a small probability $\varepsilon$.
Exponential Recency-Weighted Average
The combination of $\varepsilon$-greedy with sample averages works well for stationary problems where the reward probabilties do not change with time. But as stated at the start in most cases this assumption isn’t valid. Therefore the solution for problems that involve non-statinoary reward distributions is to give more weight to recent rewards when compared to old rewards and one possible way to achieve this is to use a constant stepSize parameter that lies between $[0,1]$.
The estimated value from sample value method can be re-written as follows:
\[\begin{align*} Q_{n+1}(a) &= Q_n(a) + \alpha\left[ R_n - Q_n(a)\right]\\ &= \alpha{R_n}+ (1-\alpha)Q_n\\ &= (1-\alpha)^nQ_1 + \sum_{i=1}^n \alpha(1-\alpha)^{n-i}R_i \end{align*}\]The final step can be derived based on the derivation in the sample based method section and therefore has been cut-short.
The above is called weighted-average because the sum of weights is equal to 1, infact the weights decrease exponentially and the above method is therefore called Exponential Recency-Weighted Average.
Initial Values & Bias
When an action is being selected for the first time in sample average method the denominator is 0, therfore a default value is assumed in such cases and in Exponential Recency-Weighted Average method, the value of $Q_2$ depends on the assumption $Q_1$ which again has to be assumed.
Therefore these methods are biased by their initial estimates. For the sample-average methods, the bias disappears once all actions have been selected at least once, but for methods with constant stepSize, the bias is permanent, though decreasing over time as seen in the equation derived above.
In practice this kind of bias is helpful when the initial values choosen are based on expert knowledge. Initial values can also be used to encourage exploration, instead of setting the initial values to 0 if we set them to a very high value aka “optimistic value” the agent tries each action and gets dissapointed but keeps on trying until all the values are sampled atleast once this method often reffered to as optimistic initial value makes sure there is exploration at the start even when used with pure greedy methods.
Exponential Recency-Weighted Average without Initial Bias
Given that the Exponential Recency-Weighted Average while works on non-stationary problems but suffers from initial bias and the sample average methods are less effected by the initial bias but are not effective against non-stationary problems, there is a need for another method that can work well with non-stationary problems without any intial bias.
One such method can be formulated with the use of a new stepSize parameter defined as
\[\beta_n \ \dot{=} \ \dfrac{\alpha}{\bar{o}_n}\]To process the nth reward for a particular action, where $\alpha$<0 is the conventional stepSize and $\bar{o}_n$ is the trace of one that starts at 0:
\(\bar{o}_n = \bar{o}_{n-1} + \alpha(1-\bar{o}_{n-1}), \text{ for } n>0, \text{ with } \bar{o}_0 \ \dot{=} \ 0\)
Upper-Confidence-Bound
In the $\varepsilon$-greedy methods, while randomly picking actions every once in a while to encourage exploration helps, it makes much more sense to pick these actions based on some guided hurestic rather than random-sampling.
One possible alternative is to pick among the non-greedy actions is to opt for actons that have higher degree of uncertainty in their estimates, such a mechanism will therefore allow us to get better overall estimates for all action values.
The idea of upper confidence bound (UCB) action selection is that the square-root term is a measure of the uncertainty or variance in the estimate of a’s value. The quantity being max’ed over is thus a sort of upper bound on the possible true value of action a,with c determining the confidence level.
The action is therefore selected as follows:
\[A_t \ \dot{=} \ \text{argmax}_a \left[Q_t(a) + c\sqrt{\dfrac{ln\ t}{N_t(a)}} \right]\]Where, $N_t(a)$ denotes the number of times action $a$ has been selected prior to time $t$ and $c$ >0 controls the degree of exploration. When an action is being explored for the first time it is considered to be a maximizing action.
Each time $a$ is selected the uncertainty is presumably reduced: $N_t(a)$ increments, and, as it appears in the denominator, the uncertainty term decreases. On the other hand, each time an action other than $a$ is selected, $t$ increases but $N_t(a)$ does not; because $t$ appears in the numerator, the uncertainty estimate increases.
The use of the natural logarithm means that the increases get smaller over time, but are unbounded; all actions will eventually be selected, but actions with lower value estimates, or that have already been selected frequently, will be selected with decreasing frequency over time.
Gradient-Bandit
So far we have considered methods that estimate action values and use those estimates to select actions. This is often a good approach, but it is not the only one possible.
In this section we consider learning a numerical preference for each action $a$, which we denote as $H_t(a) \in \mathbb{R}$. The larger the preference, the more often that action is taken, but the preference has no interpretation in terms of reward. Only the relative preference of one action over another is important; if we add 1000 to all the action preferences there is no effect on the action probabilities, which are determined according to a soft-max distribution (i.e., Gibbs or Boltzmann distribution) as follows:
\[\text{Pr}\{A_t=a\} \ \dot{=} \ \dfrac{e^{H_t(a)}}{\sum_{b=1}^{k}e^{H_t(b)}} \ \dot{=} \ \pi_t(a)\]Where $\pi_t(a)$ is the probability of choosing action $a$ at time $t$. Initially all actions have the same preference, therefore $H_1(a)=0$.
There is a natural learning algorithm for soft-max action preferences based on the idea of stochastic gradient ascent. On each step, after selecting action $A_t$ and receiving the reward $R_t$, the action preferences are updated by:
\[\begin{align*} H_{t+1}(A_t) \ &\dot{=} \ H_t(A_t) + \alpha(R_t - \bar{R}_t)(1-\pi_t(A_t)) \\ H_{t+1}(a) \ &\dot{=} \ H_t(a) - \alpha(R_t - \bar{R}_t)\pi_t(a) \text{ for all } a\neq A_t \\ \end{align*}\]where $\alpha > 0$ is a step-size parameter, and $\bar{R}_t \in \mathbb{R}$ is the average of the rewards up to but not including time t (with $\bar{R}_1 = R_1$), which can be computed incrementally. The $\bar{R}_t$ term serves as a baseline with which the reward is compared. If the reward is higher than the baseline, then the probability of taking $A_t$ in the future is increased, and if the reward is below baseline, then the probability is decreased. The non-selected actions move in the opposite direction.
Convergence Conditions
Not all stepSize values guarentee that the expected action value converges to the true values. A well-known result in stochastic approximation theory gives us the conditions required to assure convergence with probability 1:
- The first condition is required to guarantee that the steps are large enough to eventually overcome any initial conditions or random fluctuations.
- The second condition guarantees that eventually the steps become small enough to assure convergence.
These can be expressed as follows:
\[\begin{align*} \sum_{n=1}^{\infty}\alpha_n(a) &= \infty\\ \sum_{n=1}^{\infty}\alpha_n^2(a) &< \infty\\ \end{align*}\]Associative Search
In the above discussed methods the learner either tries to find a single best action when the task is stationary, or tries to track the best action as it changes over time when the task is nonstationary. However, in a general reinforcement learning task there is more than one situation, and the goal is to learn a policy: a mapping from situations to the actions that are best in those situations. To set the stage for the full problem, we briefly discuss the simplest way in which nonassociative tasks extend to the associative setting.
As an example, suppose there are several different k-armed bandit tasks, and that on each step you confront one of these chosen at random. Thus, the bandit task changes randomly from step to step. If the probabilities with which each task is selected for you do not change over time, this would appear as a single stationary k-armed bandit task, and you could use one of the methods described in this chapter. Now suppose, however, that when a bandit task is selected for you, you are given some distinctive clue about its identity (but not its action values). Maybe you are facing an actual slot machine that changes the color of its display as it changes its action values. Now you can learn a policy associating each task, signaled by the color you see, with the best action to take when facing that task—for instance, if red, select arm 1; if green, select arm 2. With the right policy you can usually do much better than you could in the absence of any information distinguishing one bandit task from another. This is an example of an associative search task, so called because it involves both trial-and-error learning to search for the best actions, and association of these actions with the situations in which they are best.
Associative search tasks are intermediate between the k-armed bandit problem and the full reinforcement learning problem. They are like the full reinforcement learning problem in that they involve learning a policy, but they are also like our version of the k-armed bandit problem in that each action affects only the immediate reward. If actions are allowed to affect the next situation as well as the reward, then we have the full reinforcement learning problem.