PPO from the ground upDraft

October 22, 2025

Reinforcement Learning (RL) is a distinct form of machine learning used for generally open ended problems. It can be compared to Supervised Learning (SL), which is essentially a high dimensional regression task. In SL, we are given a dataset of inputs and outputs, ${(x_{i}, y_{i})}$ and our task is to learn a function $f (x)$ that minimizes the mean squared error (MSE) over our dataset. However, many useful problems cannot be expressed in this framework. Suppose you wanted to teach a virtual humanoid agent to do a flip. What would our dataset be? How could you collect such a dataset? It’s not straightforward. What is straightforward, though, is knowing if the agent actually flipped. We just check:

Did the agent’s feet leave the ground
Were the agent’s feet pointed up at some point in time
Did the agent’s feet come back to the ground

We see that for this task, success can be easily defined, but the exact joint controls required for success cannot. It is in such circumstances we use RL.

At a high level, RL is a method of trial and error. In PG, we start out with some policy, try a bunch of tactics, see which ones were successful, and guide the policy to be more like the successful ones. In this article, we’re going to explore RL and Policy Gradient Methods from the ground up, culminating in PPO ( Schulman et al., 2017 ) , which is, and has remained the foundational baseline for PG methods. We’ll also touch on earlier trust-region approaches that inspired PPO.

Preliminaries

Markov Decision Processes

We first want to formalize the idea of agents, environments, and rewards so we can deal with them mathematically. The most common formalization is that of Markov Decision Processes (MDPs) ( Bellman, 1957 ; Sutton and Barto, 2018 ) .

An MDP is defined by the following:

$S$ : the set of all possible states of the environment
$A$ : the set of all actions the agent can perform
$R$ : the set of all rewards the agent may receive
$p (s^{'}, r ∣ s, a)$ : a transition function, which tells us: if we took action $a$ at state $s$ , what is the probability of receiving reward $r$ and transitioning to the next state $s^{'}$ ?
$0 \leq γ \leq 1$ : the time discount factor, i.e. a reward at time $t + 1$ is worth $γ$ times less than at time $t$ .
$ρ_{0}$ : the initial state distribution

Within an environment defined by an MDP, we may have a policy $π_{θ} (a ∣ s)$ , which tells us the probability distribution we’re currently sampling actions from. For the rest of this article, we’ll assume this is a neural network parameterized by weights $θ$ .

We can generate a trajectory $τ = (s_{0}, a_{0}, r_{0}, s_{1}, a_{1}, r_{1}, \dots)$ , which is simply a sequence of states, actions, and rewards over time

s_{0} a_{t} (r_{t}, s_{t + 1}) \sim ρ_{0}, \sim π_{θ} (\cdot ∣ s_{t}), \sim p (\cdot, \cdot ∣ s_{t}, a_{t})

Once you’ve ingested all those variables, our goal should be clear. We want to learn the weights for $π_{θ}$ such that the expected future discounted rewards, which we denote $G_{0}$ , are maximized. Formally, we want to maximize

J (θ) = E_{τ \sim π_{θ}} [t = 0 \sum \infty γ^{t} r_{t}] = E_{τ \sim π_{θ}} [R (τ)] .

Since we want to use backprop on $π_{θ}$ to optimize this objective, we need to express $\nabla_{θ} J (θ)$ in terms of the gradient of $π_{θ}$ . However, the $E_{a \sim π_{θ}} [\cdot]$ operation is not differentiable, i.e. gradients cannot pass through the action of sampling from $π_{θ}$ back into $θ$ . We want the $\nabla_{θ}$ to be inside the expectation, not outside.

This is possible once we notice that

\nabla_{θ} E_{τ \sim π_{θ}} [R (τ)] = \nabla_{θ} \int_{θ} R (τ) π_{θ} (τ) d τ = \int_{θ} R (τ) \nabla_{θ} π_{θ} (τ) d τ = \int_{θ} R (τ) π_{θ} (τ) \nabla_{θ} lo g π_{θ} (τ) d τ = E_{τ \sim π_{θ}} [\nabla_{θ} R (τ) lo g π_{θ} (τ)] .

This lets us simply sample trajectories from the policy, multiply the log-probability of the trajectory by its discounted total reward, and backprop without issue.

We can’t compute the gradient of an estimate, but we can compute the estimate of a gradient!

Vanilla Policy Gradient

OK, enough with all that theory. Let’s see if this thing actually works.

We’re going to use gymnasium’s CartPole-v1 environment, where the goal is to keep a pole attached to the cart in the upright position for as long as possible ( Towers et al., 2024 ) .

At every timestep, the environment gives us its state–which in this case is a 4-vector of pole angle, pole angular velocity, cart position, and cart velocity. We then run the state through a policy network, which outputs a distribution over two actions–push left or push right. We recieve a reward of $1$ for every timestep we survive.

Here’s the pseudocode. The full runnable file is here .

policy = Policy()  # a linear layer + softmax
optimizer = optim.Adam(policy.parameters(), lr=alpha)

for iteration in range(num_iterations):
    rewards, action_log_probs = [], []
    state = env.reset()  # initial state from ρ_0
    done = False
    while not done:
        log_probs = policy(state)  # (2,); has grad
        action_dist = Categorical(log_probs)
        action = action_dist.sample()
        log_prob = action_dist.log_prob(action)  # (1,); has grad
        state, r, done = env.step(action.detach())
        action_log_probs.append(log_prob)
        rewards.append(r)
    action_log_probs = torch.stack(action_log_probs)
    # compute the trajectory return ∑_{t=0}^T γ^t r_t
    R = 0.0
    for r in reversed(rewards):
        R = r + gamma * R
    # to maximize J, we minimize -J
    loss = -(action_log_probs.sum() * R)  # grad only flows through log probs
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

At a high level, this will

Do a rollout using the current policy
Compute the likelihood of that rollout
Maximize R times the probability
- If R is positive, increase the likelihood of the trajectory
- If R is negative, decrease the likelihood of the trajectory

Results

Average rewards over 100 runs, each running for 1000 iterations.

The results are… quite bad. If you squint really hard the curve barely trends up—but it still sucks. There are a few obvious problems with this algorithm. First, this environment only gives positive rewards, so $G_{0}$ is always positive. This means that no matter how bad the trajectory in an iteration was, the algorithm will always increase its likelihood. The only difference rewards make is the magnitude of the increase. Second, our credit assignment is sequence-level. Every action in a trajectory is rewarded equally based on the entire trajectory’s success, so the algorithm lacks fine grained control over action choices.

Let’s first focus on the credit assignment issue.

REINFORCE

Instead of rewarding every action in $τ$ with $R (τ)$ , we can instead assign timestep level rewards to each action $a_{t}$ . Looking back at our gradient estimate, we can split it up into timesteps

\nabla_{θ} J (θ) = E_{a_{t} \sim π_{θ}} [\nabla_{θ} R (τ) lo g π_{θ} (τ)] = E_{a_{t} \sim π_{θ}} [\nabla_{θ} t = 0 \sum \infty lo g π_{θ} (a_{t} ∣ s_{t}) R (τ)] = E_{a_{t} \sim π_{θ}} [\nabla_{θ} t = 0 \sum \infty k = 0 \sum \infty lo g π_{θ} (a_{t} ∣ s_{t}) γ^{k} r_{k}] = E_{a_{t} \sim π_{θ}} [\nabla_{θ} t = 0 \sum \infty k \geq t \sum \infty lo g π_{θ} (a_{t} ∣ s_{t}) γ^{k} r_{k}] + = 0 E_{a_{t} \sim π_{θ}} [\nabla_{θ} t = 0 \sum \infty k < t \sum \infty lo g π_{θ} (a_{t} ∣ s_{t}) γ^{k} r_{k}] .

Why is the second term zero? Intuitively, this is because rewards from the past ( $r_{k}$ for $k < t$ ) are effectively constants at time $t$ . Once this is factored out, the expectation evaluates to zero cleanly. Let $τ_{t} = (s_{0}, a_{0}, r_{0}, s_{1}, a_{1}, r_{1}, \dots, s_{t})$ be the history up to time $t$ . The second term is equivalent to

E_{τ_{t}} [E_{a_{t} \sim π_{θ} (\cdot ∣ s_{t})} [\nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) \cdot γ^{k} r_{k} ∣ τ_{t}]] = E_{τ_{t}} [γ^{k} r_{k} \cdot E_{a_{t} \sim π_{θ} (\cdot ∣ s_{t})} [\nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) ∣ τ_{t}]]

We note that the inner expectation reduces to 0

E_{a_{t} \sim π_{θ} (\cdot ∣ s_{t})} [\nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) ∣ τ_{t}] = \int π_{θ} (a_{t} ∣ s_{t}) \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) d a = \nabla_{θ} \int π_{θ} (a_{t} ∣ s_{t}) d a = \nabla_{θ} 1 = 0.

And so does the entire expression

E_{τ_{t}} [γ^{k} r_{k} \cdot 0] = 0.

Thus, we see Eq 4. is equal to

\nabla J (θ) = E_{a_{t} \sim π_{θ}} [\nabla_{θ} t = 0 \sum \infty k \geq t \sum \infty lo g π_{θ} (a_{t} ∣ s_{t}) γ^{k} r_{k}] = E_{a_{t} \sim π_{θ}} [\nabla_{θ} t = 0 \sum \infty lo g π_{θ} (a_{t} ∣ s_{t}) k \geq t \sum \infty γ^{k} r_{k}] = E_{a_{t} \sim π_{θ}} [\nabla_{θ} t = 0 \sum \infty lo g π_{θ} (a_{t} ∣ s_{t}) G_{t}] .

Okay, so what does this tell us? “The policy gradient is preserved under step–level credit assignment if we use the reward–to–go, $G_{t}$ ”. This lets us assign credit in a finer grain without compromising on correctness. This algorithm is known as REINFORCE ( Williams, 1992 ) .

Here’s a trained REINFORCE policy in action. Press Tab to compete against it yourself!

Implementation

We only need to change a few lines of code from our vanilla PG implementation:

policy = Policy()
optimizer = optim.Adam(policy.parameters(), lr=alpha)

for iteration in range(num_iterations):
    rewards, action_log_probs = [], []
    state = env.reset()
    done = False
    while not done:
        log_probs = policy(state)
        action_dist = Categorical(log_probs)
        action = action_dist.sample()
        log_prob = action_dist.log_prob(action)
        state, r, done = env.step(action.detach())
        action_log_probs.append(log_prob)
        rewards.append(r)
    action_log_probs = torch.stack(action_log_probs)
    G = torch.zeros_like(action_log_probs)
    R = 0.0
    for r in reversed(rewards):
    for t, r in reversed(enumerate(rewards)):
        R = r + gamma * R
        G[t] = R
    loss = -(action_log_probs.sum() * R)
    loss = -(action_log_probs * G).sum()
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Results

Just this one change appears to make a significant improvement in performance.

Average rewards over 100 runs, each running for 1000 iterations.

This shows that credit assignment matters! However, we’re still quite a way off from solving CartPole, which happens if we keep the pole up for 500 timesteps, or equivalently receive 500 trajectory rewards. The final improvement we need are baselines.

Baselines

If we look back, the second problem we had was that the probability of an action being taken was always increased. Our algorithm essentially has no way to tell if an action is bad, only characterize how good of an action it was.

A common sense way to determine whether an action $a_{t}$ is good or bad, is to just measure how well $a_{t}$ performed relative to some baseline. One way to do this is to just run several rollouts each iteration, take the average of $G_{t}$ across rollouts—call this $\overset{ˉ}{G_{t}}$ —and use that quantity as a baseline.

So, our new policy gradient is

J (θ) = E_{a_{t} \sim π_{θ}} [\nabla_{θ} t = 0 \sum \infty lo g π_{θ} (a_{t} ∣ s_{t}) (G_{t} - \overset{ˉ}{G_{t}})]

There is one caveat: our PG estimate will be biased if $\overset{ˉ}{G_{t}}$ is not independent of $G_{t}$ . So, we need to use the leave–one–out (LOO) mean as the baseline. This means, for trajectory $τ_{i}$ , we only consider the average return across all other rollouts. So, for $N$ rollouts,

J (θ) = \frac{1}{N} i = 1 \sum N \nabla_{θ} t = 0 \sum \infty lo g π_{θ} (a_{t} ∣ s_{t}) (G_{t} - \frac{1}{N - 1} j \neq = i \sum G_{j, t})

Another caveat: each rollout need not have the same length, since they fail at different times. So, the mean must be taken for each $t$ across the $N$ rollouts if that $t$ is valid. This is easier to illustrate in code than in equations, so I’ll just drop that here.

env = gym.make_vec("CartPole-v1", num_envs=args.n_rollouts)
for iter in range(iterations):
    done_envs = np.zeros((N,), dtype=bool)
    rewards, log_probs_history, entropies, alive_masks_history = [], [], [], []
    done = False
    while not done:
        alive_mask = torch.as_tensor(~done_envs, dtype=torch.float32)
        state_tensor = torch.as_tensor(state, dtype=torch.float32)
        log_probs = policy.forward(state_tensor)
        dist = torch.distributions.Categorical(logits=log_probs)
        action = dist.sample()
        state, reward, terminated, truncated, info = env.step(action.numpy())
        rewards.append(torch.as_tensor(reward, dtype=torch.float32) * alive_mask)
        log_probs_history.append(dist.log_prob(action) * alive_mask)
        entropies.append(dist.entropy().detach())
        alive_masks_history.append(alive_mask)
        done_envs |= terminated | truncated
        done = np.all(done_envs)

    T = len(log_probs_history)
    action_log_probs = torch.stack(log_probs_history)  # (T, N)
    alive_masks = torch.stack(alive_masks_history)  # (T, N)
    G = torch.zeros_like(action_log_probs)
    running = torch.zeros(N, dtype=action_log_probs.dtype)
    rewards = torch.stack(rewards)  # (T, N)
    for t in reversed(range(T)):
        running = rewards[t, :] + gamma * running
        G[t, :] = running

    # Leave-one-out mean baseline for unbiased gradient estimation
    # Each rollout's baseline excludes its own return to maintain independence
    alive_count = alive_masks.sum(dim=-1, keepdim=True)  # (T, 1)
    total_G = G.sum(dim=-1, keepdim=True)  # (T, 1)
    baseline = (total_G - G) / (alive_count - 1).clamp(min=1)
    A = G - baseline
    loss = -(action_log_probs * A).sum()

Most of the diff between this and the previous version comes from using the vectorized env, which requires us to (a bit tediously) keep track of the done state of multiple trajectories running in parallel.

Results

The mean baseline blows the other two out of the park! It typically solves the task (achieves 500 reward) in less than 200 steps. Also note that while the previous plots ran for 1k steps, this only does for 250!

Demo

Here’s a policy trained with this algorithm, fully running in your browser. Hit tab to compete with it.

References

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347. https://arxiv.org/abs/1707.06347. ↩
Bellman, R. (1957). A Markovian Decision Process. Journal of Mathematics and Mechanics. Vol. 6, No. 5, pp. 679-684. doi:10.1512/iumj.1957.6.56038. https://doi.org/10.1512/iumj.1957.6.56038. ↩
Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. The MIT Press. https://mitpress.mit.edu/9780262039246/reinforcement-learning. ↩
Towers, M., Kwiatkowski, A., Terry, J., Balis, J. U., De Cola, G., Deleu, T., Goulão, M., Kallinteris, A., Krimmel, M., KG, A., Perez-Vicente, R., Pierré, A., Schulhoff, S., Tai, J. J., Tan, H., and Younis, O. G. (2024). Gymnasium: A Standard Interface for Reinforcement Learning Environments. arXiv preprint arXiv:2407.17032. doi:10.48550/arXiv.2407.17032. https://arxiv.org/abs/2407.17032. ↩
Williams, R. J. (1992). Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning. Vol. 8, No. 3-4, pp. 229-256. doi:10.1007/BF00992696. https://doi.org/10.1007/BF00992696. ↩

LLM transparency

Codex CLI and Claude Code wrote the browser CartPole implementation and plotting code. All writing, training code, research, and organization is my own.