Reinforcement Learning (RL) is a distinct form of machine learning used for generally open ended problems. It can be compared to Supervised Learning (SL), which is essentially a high dimensional regression task. In SL, we are given a dataset of inputs and outputs, and our task is to learn a function that minimizes the mean squared error (MSE) over our dataset. However, many useful problems cannot be expressed in this framework. Suppose you wanted to teach a virtual humanoid agent to do a flip. What would our dataset be? How could you collect such a dataset? It’s not straightforward. What is straightforward, though, is knowing if the agent actually flipped. We just check:
- Did the agent’s feet leave the ground
- Were the agent’s feet pointed up at some point in time
- Did the agent’s feet come back to the ground
We see that for this task, success can be easily defined, but the exact joint controls required for success cannot. It is in such circumstances we use RL.
At a high level, RL is a method of trial and error. In PG, we start out with some policy, try a bunch of tactics, see which ones were successful, and guide the policy to be more like the successful ones. In this article, we’re going to explore RL and Policy Gradient Methods from the ground up, culminating in PPO ( Schulman et al., 2017 Proximal Policy Optimization AlgorithmsarXiv preprint arXiv:1707.06347 ) , which is, and has remained the foundational baseline for PG methods. We’ll also touch on earlier trust-region approaches that inspired PPO.
Preliminaries
Markov Decision Processes
We first want to formalize the idea of agents, environments, and rewards so we can deal with them mathematically. The most common formalization is that of Markov Decision Processes (MDPs) ( Bellman, 1957 A Markovian Decision ProcessJournal of Mathematics and Mechanics ; Sutton and Barto, 2018 Reinforcement Learning: An IntroductionThe MIT Press ) .
An MDP is defined by the following:
- : the set of all possible states of the environment
- : the set of all actions the agent can perform
- : the set of all rewards the agent may receive
- : a transition function, which tells us: if we took action at state , what is the probability of receiving reward and transitioning to the next state ?
- : the time discount factor, i.e. a reward at time is worth times less than at time .
- : the initial state distribution
Within an environment defined by an MDP, we may have a policy , which tells us the probability distribution we’re currently sampling actions from. For the rest of this article, we’ll assume this is a neural network parameterized by weights .
We can generate a trajectory , which is simply a sequence of states, actions, and rewards over time
Once you’ve ingested all those variables, our goal should be clear. We want to learn the weights for such that the expected future discounted rewards, which we denote , are maximized. Formally, we want to maximize
Since we want to use backprop on to optimize this objective, we need to express in terms of the gradient of . However, the operation is not differentiable, i.e. gradients cannot pass through the action of sampling from back into . We want the to be inside the expectation, not outside.
This is possible once we notice that
This lets us simply sample trajectories from the policy, multiply the log-probability of the trajectory by its discounted total reward, and backprop without issue.
We can’t compute the gradient of an estimate, but we can compute the estimate of a gradient!
Vanilla Policy Gradient
OK, enough with all that theory. Let’s see if this thing actually works.
We’re going to use gymnasium’s
CartPole-v1 environment,
where the goal is to keep a pole attached to the cart in the upright position for as long as possible
(
Towers et al., 2024
Gymnasium: A Standard Interface for Reinforcement Learning EnvironmentsarXiv preprint arXiv:2407.17032
)
.
At every timestep, the environment gives us its state–which in this case is a 4-vector of pole angle, pole angular velocity, cart position, and cart velocity. We then run the state through a policy network, which outputs a distribution over two actions–push left or push right. We recieve a reward of for every timestep we survive.
Here’s the pseudocode. The full runnable file is here .
policy = Policy() # a linear layer + softmax
optimizer = optim.Adam(policy.parameters(), lr=alpha)
for iteration in range(num_iterations):
rewards, action_log_probs = [], []
state = env.reset() # initial state from ρ_0
done = False
while not done:
log_probs = policy(state) # (2,); has grad
action_dist = Categorical(log_probs)
action = action_dist.sample()
log_prob = action_dist.log_prob(action) # (1,); has grad
state, r, done = env.step(action.detach())
action_log_probs.append(log_prob)
rewards.append(r)
action_log_probs = torch.stack(action_log_probs)
# compute the trajectory return ∑_{t=0}^T γ^t r_t
R = 0.0
for r in reversed(rewards):
R = r + gamma * R
# to maximize J, we minimize -J
loss = -(action_log_probs.sum() * R) # grad only flows through log probs
optimizer.zero_grad()
loss.backward()
optimizer.step()
At a high level, this will
- Do a rollout using the current policy
- Compute the likelihood of that rollout
- Maximize R times the probability
- If R is positive, increase the likelihood of the trajectory
- If R is negative, decrease the likelihood of the trajectory
Results
The results are… quite bad. If you squint really hard the curve barely trends up—but it still sucks. There are a few obvious problems with this algorithm. First, this environment only gives positive rewards, so is always positive. This means that no matter how bad the trajectory in an iteration was, the algorithm will always increase its likelihood. The only difference rewards make is the magnitude of the increase. Second, our credit assignment is sequence-level. Every action in a trajectory is rewarded equally based on the entire trajectory’s success, so the algorithm lacks fine grained control over action choices.
Let’s first focus on the credit assignment issue.
REINFORCE
Instead of rewarding every action in with , we can instead assign timestep level rewards to each action . Looking back at our gradient estimate, we can split it up into timesteps
Why is the second term zero? Intuitively, this is because rewards from the past ( for ) are effectively constants at time . Once this is factored out, the expectation evaluates to zero cleanly. Let be the history up to time . The second term is equivalent to
We note that the inner expectation reduces to 0
And so does the entire expression
Thus, we see Eq 4. is equal to
Okay, so what does this tell us? “The policy gradient is preserved under step–level credit assignment if we use the reward–to–go, ”. This lets us assign credit in a finer grain without compromising on correctness. This algorithm is known as REINFORCE ( Williams, 1992 Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement LearningMachine Learning ) .
Here’s a trained REINFORCE policy in action. Press Tab to compete against it yourself!
Implementation
We only need to change a few lines of code from our vanilla PG implementation:
policy = Policy()
optimizer = optim.Adam(policy.parameters(), lr=alpha)
for iteration in range(num_iterations):
rewards, action_log_probs = [], []
state = env.reset()
done = False
while not done:
log_probs = policy(state)
action_dist = Categorical(log_probs)
action = action_dist.sample()
log_prob = action_dist.log_prob(action)
state, r, done = env.step(action.detach())
action_log_probs.append(log_prob)
rewards.append(r)
action_log_probs = torch.stack(action_log_probs)
G = torch.zeros_like(action_log_probs)
R = 0.0
for r in reversed(rewards):
for t, r in reversed(enumerate(rewards)):
R = r + gamma * R
G[t] = R
loss = -(action_log_probs.sum() * R)
loss = -(action_log_probs * G).sum()
optimizer.zero_grad()
loss.backward()
optimizer.step()Results
Just this one change appears to make a significant improvement in performance.
This shows that credit assignment matters! However, we’re still quite a way off from solving CartPole, which happens if we keep the pole up for 500 timesteps, or equivalently receive 500 trajectory rewards. The final improvement we need are baselines.
Baselines
If we look back, the second problem we had was that the probability of an action being taken was always increased. Our algorithm essentially has no way to tell if an action is bad, only characterize how good of an action it was.
A common sense way to determine whether an action is good or bad, is to just measure how well performed relative to some baseline. One way to do this is to just run several rollouts each iteration, take the average of across rollouts—call this —and use that quantity as a baseline.
So, our new policy gradient is
There is one caveat: our PG estimate will be biased if is not independent of . So, we need to use the leave–one–out (LOO) mean as the baseline. This means, for trajectory , we only consider the average return across all other rollouts. So, for rollouts,
Another caveat: each rollout need not have the same length, since they fail at different times. So, the mean must be taken for each across the rollouts if that is valid. This is easier to illustrate in code than in equations, so I’ll just drop that here.
env = gym.make_vec("CartPole-v1", num_envs=args.n_rollouts)
for iter in range(iterations):
done_envs = np.zeros((N,), dtype=bool)
rewards, log_probs_history, entropies, alive_masks_history = [], [], [], []
done = False
while not done:
alive_mask = torch.as_tensor(~done_envs, dtype=torch.float32)
state_tensor = torch.as_tensor(state, dtype=torch.float32)
log_probs = policy.forward(state_tensor)
dist = torch.distributions.Categorical(logits=log_probs)
action = dist.sample()
state, reward, terminated, truncated, info = env.step(action.numpy())
rewards.append(torch.as_tensor(reward, dtype=torch.float32) * alive_mask)
log_probs_history.append(dist.log_prob(action) * alive_mask)
entropies.append(dist.entropy().detach())
alive_masks_history.append(alive_mask)
done_envs |= terminated | truncated
done = np.all(done_envs)
T = len(log_probs_history)
action_log_probs = torch.stack(log_probs_history) # (T, N)
alive_masks = torch.stack(alive_masks_history) # (T, N)
G = torch.zeros_like(action_log_probs)
running = torch.zeros(N, dtype=action_log_probs.dtype)
rewards = torch.stack(rewards) # (T, N)
for t in reversed(range(T)):
running = rewards[t, :] + gamma * running
G[t, :] = running
# Leave-one-out mean baseline for unbiased gradient estimation
# Each rollout's baseline excludes its own return to maintain independence
alive_count = alive_masks.sum(dim=-1, keepdim=True) # (T, 1)
total_G = G.sum(dim=-1, keepdim=True) # (T, 1)
baseline = (total_G - G) / (alive_count - 1).clamp(min=1)
A = G - baseline
loss = -(action_log_probs * A).sum()
Most of the diff between this and the previous version comes from using the vectorized env,
which requires us to (a bit tediously) keep track of the done state of multiple trajectories running in parallel.
Results
The mean baseline blows the other two out of the park! It typically solves the task (achieves 500 reward) in less than 200 steps. Also note that while the previous plots ran for 1k steps, this only does for 250!
Demo
Here’s a policy trained with this algorithm, fully running in your browser.
Hit tab to compete with it.
References
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347. https://arxiv.org/abs/1707.06347. ↩
- Bellman, R. (1957). A Markovian Decision Process. Journal of Mathematics and Mechanics. Vol. 6, No. 5, pp. 679-684. doi:10.1512/iumj.1957.6.56038. https://doi.org/10.1512/iumj.1957.6.56038. ↩
- Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. The MIT Press. https://mitpress.mit.edu/9780262039246/reinforcement-learning. ↩
- Towers, M., Kwiatkowski, A., Terry, J., Balis, J. U., De Cola, G., Deleu, T., Goulão, M., Kallinteris, A., Krimmel, M., KG, A., Perez-Vicente, R., Pierré, A., Schulhoff, S., Tai, J. J., Tan, H., and Younis, O. G. (2024). Gymnasium: A Standard Interface for Reinforcement Learning Environments. arXiv preprint arXiv:2407.17032. doi:10.48550/arXiv.2407.17032. https://arxiv.org/abs/2407.17032. ↩
- Williams, R. J. (1992). Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning. Vol. 8, No. 3-4, pp. 229-256. doi:10.1007/BF00992696. https://doi.org/10.1007/BF00992696. ↩