RL Notes

Why is RL important?

The underlying goal of RL is to develop a learning system that can learn on its own with little or no external supervision so as not to limit the system’s capability to the external supervisor’s ability. If we use supervised learning to train an agent to play Go using human demonstrations, then by definition, the abilities of the agent can be no better than that of the human demonstrations that were used to train the agent. With RL, the agent is made to explore different paths to reach a search reward. Doing this enables the agent to learn strategies that could not have even been thought of by the human benchmark/supervisor.

Methodology (A game)

Deep RL could be framed by a supervised learning architecture where the image frames of the game are fed into a neural network and the output layer consists of actions, say one neuron for up and another for down in a 2-output layer. Just like in a supervised learning network, each output neuron possesses some probability of its action being executed. Instead of always going for the action with the highest probability(acting greedily), the agent could be made to sample based on these probabilities in order to increase the chances of exploration.

If the agent reaches the goal, it is rewarded with +1 and if it doesn’t, it is given a -1 reward. The objective function is to amass as much positive reward as possible. Reward is only given at the end of the game. Each complete run of the game is called an episode. Initially, since the agent is just acting randomly, it would probably lose the first few games. But as the agent continues to restart, at some point, a collection of random actions could result in the agent winning the game. When this happens, we can then save the weights and associate the specific frames to those specific actions. Meaning, we can compute the gradients that make the winning sequence of actions more likely in the future. So what policy gradients would do is that, whenever we encounter any of the frames in the winning sequence, we would increase the probability of its corresponding action to be chosen and for the losing streak frames, policy gradient would give it a negative reward to reduce the probability of it being chosen if it is ever encountered.

As a result, as training progresses, the actions that lead to negative reward are slowly going to be filtered out and those that lead to positive rewards are going to remain. All the knowledge of the good and bad actions given a frame is stored in the weights of the neural network.

The obvious problem with policy gradients from what is written above is that, a single bad action at the end of an episode would cause all the actions in that episode to be tagged as bad actions even if they were all good ones, save the last one. This problem is called the Credit Assignment Problem. The innate cause of this problem is our sparse reward system where we only reward the agent after an entire episode. This is the cause of the Sample Inefficiency problem of RL. The Sparse reward system might in the end work with games like pong but for a game like Montezuma’s Revenge, it totally fails because the sequence of actions the agent is supposed to undertake to get a reward is extremely complex so just an initial random set of actions is likely to never end in a positive reward. Same goes for getting a robot arm to perform a stacking operation. The 7 joints of the robot arm create a high dimensional action space so simply just passing random joint actions to the joint would almost never result in the goal.

What makes computer vision work so well with supervised learning is that, for every input frame, you have the correct output frame (ground truth) to compare it with, get the error and back-propagate it through the layers to update the weights. As a result getting it to work very well. With reinforcement learning however, you usually don’t have this ground truth to compare the input frame with.

A way to get round the Sparse Reward problem is to engage in Reward Shaping. This is a way to manually design a reward function that would guide the agent to the goal. In the case of Montezuma’s revenge, this would mean giving the agent a positive reward for performing sub-tasks like avoiding the skull or collecting a key and a negative reward for being killed.  Even though this  guides the policy to reach the goal easier, there are some obvious downsides like having to manually design a custom reward function for every new environment or game. This is just not scalable. Another downside is that Reward Shaping suffers from the Alignment Problem. In some cases, the agent could find a way to keep getting your designed reward without actually doing what you want it to do. It could also end up constraining your agent to act just like a human and in effect not learning anything new.

 

Description of Basic RL algorithms

  1. Value Iteration:
    1.  The algorithm is as follows;
      1. Q value, Q(s,a)  is the discounted total reward when an agent in state s takes action a. The optimum value function is learnt by mapping each state to the best possible Q value. This is done by evaluating all actions for each state and mapping the state to the action with the greatest Q value.
      2. The learnt optimum value function is then used to improve the policy. This is done by first using the optimum value function to calculate the Qtarget value of all actions for each state and mapping the state to the action with the highest Qtarget value to form the improved policy.
      3. This improved policy is used to solve the problem
  2. Policy Iteration
    1. The algorithm is as follows;
      1. First, a random policy is initialised
      2. Next, the policy is evaluated or in other words, a value function is generated from the policy. This is done by computing the value for each state and mapping the states to their corresponding values.
      3. Then, the computed value function is used to improve the policy. This is done by first using the optimum value function to calculate the Qtarget value of all actions for each state and mapping the state to the action with the highest Qtarget value to form the improved policy.
  3. Monte Carlo Learning
    1. This works for episodic tasks only. It is a model free approach. It works for non-markovian tasks. It converges. Updates value function after every episode.
    2. Algorithm works as follows;
      1. Roll out the policy multiple times.
      2. Sample the trajectories and average the sampled trajectories.
      3. This is your value function wrt the policy
    3. There are 2 flavors; first-visit MC and every-visit MC.
    4. First-visit has low bias whilst every-visit has high bias and low variance
    5. Both flavors are consistent.
    6. Alpha GO used a flavor of MC
  4. Temporal Difference Learning
    1. It combines MC and Dynamic Programming. It is a model-free approach. It bootstraps (similar to DP) and samples (similar to MC) so it works for both finite episode settings and infinite-horizon settings. It also immediately updates the estimate of the value function. Doesn’t wait til the end of the episode. It assumes world is Markovian. The estimate is biased.
    2. TD error is the difference between the TD target and the current value function at a particular state.
    3. Algorithm works as follows
      1. Initialize V(s)=0
      2. For all states
        1. V(s) = V(s) + alpha(target – V(s))
  5. Q learning
    1. It involves learning a Q function through interaction with the world. It is a model-free approach. It is used in worlds with discrete states. It is an off-policy algorithm; The sampled targets are not necessarily form the policy being trained. The maximum of all actions it updates with  could be from a secondary policy. This enables the agent to converge since it gets to try a variety of actions outside the original policy.
    2. Algorithm is as follows;
      1. Initialize Q(s,a) for all states and all actions arbitrarily
      2. Get the epsilon-greedy policy w.r.t Q
      3. for each episode:
        1. Set s_1 as starting state
        2. t=1
        3. Til episode terminates
          1. Sample action a_t from policy T(s)
          2. Take action a_t and observe reward r_t, and next state s_t+1
          3. Update Q(s,a): Q(s,a) = Q(s,a) + alpha(r_t+gamma*maxQ(s_t+1,a’) – Q(s_t,a_t)
          4. Choose epsilon-greedy policy w.r.t Q
          5. t=t+1
          6. (a’ is the best possible action from all policies available)
  6. SARSA
    1. It is an on-policy algorithm; It samples max possible target from a single policy. Hence it’s not guaranteed to converge.
    2. Algorithm is as follows;
      1. Initialize Q(s,a) for all states and all actions arbitrarily
      2. Get the epsilon-greedy policy w.r.t Q
      3. for each episode:
        1. Set s_1 as starting state
        2. t=1
        3. Til episode terminates
          1. Sample action a_t from policy T(s)
          2. Take action a_t and observe reward r_t, and next state s_t+1
          3. Choose action a_t+1 from policy T(s_t+1)
          4. Update Q(s,a): Q(s,a) = Q(s,a) + alpha(r_t+gammaQ(s_t+1,a_t+1) – Q(s_t,a_t)
          5. Choose epsilon-greedy policy w.r.t Q
          6. t=t+1

Leave a Reply

Your email address will not be published. Required fields are marked *