Introduction
The REINFORCE algorithm is a classic method in policy gradient reinforcement learning. It directly optimizes the policy by maximizing the expected cumulative reward. To understand how REINFORCE uses stochastic gradient ascent to update policy parameters, we need to delve into the problem of computing gradients of the expected reward and the derivation of the policy gradient theorem.
The development of the REINFORCE algorithm and the associated policy gradient theorem was driven by a need to optimize policies in reinforcement learning (RL) environments where the outcome (reward) is uncertain and depends on a sequence of actions. The challenges in computing gradients directly from the expected reward and the eventual adoption of the logderivative trick stem from deep insights into how learning from rewards works in probabilistic environments.
Historical Context and Intuition
The Challenge of Policy Optimization in RL
Before policy gradient methods like REINFORCE were developed, many reinforcement learning approaches focused on valuebased methods, like Qlearning, where the objective is to learn a value function that estimates the expected reward for a given stateaction pair. However, these methods don’t directly optimize the policy itself.
Researchers realized that directly optimizing the policy—learning the parameters that dictate which actions to take—could offer a more direct approach to solving RL problems. The policy \(\pi_\theta(a \mid s)\) is a function that gives the probability of taking action \(a\) given state \(s\), and it is parameterized by \(\theta\).
The goal is to find the optimal policy parameters \(\theta^*\) that maximize the expected cumulative reward:
\[J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ R(\tau) \right]\]Where:
 \(\tau\) represents a trajectory (sequence of states, actions, and rewards).
 \(R(\tau)\) is the total reward of the trajectory \(\tau\).
 \(\pi_\theta\) is the policy parameterized by \(\theta\).
 \(\mathbb{E}_{\tau \sim \pi_\theta}\) denotes the expectation over trajectories sampled from the policy \(\pi_\theta\).
The challenge is that \(J(\theta)\) is an expectation, and directly computing its gradient \(\nabla_\theta J(\theta)\) is difficult because the expectation involves a distribution over trajectories, which depends on the policy parameters \(\theta\). The trajectory distribution itself is a complex, implicit function of \(\theta\), making it nontrivial to compute the gradient directly.
Ok, here I said a lot of things regarding the challenges of computing gradients of an expectation but what does it actually mean? Let’s break it down.
But Why is Computing the Gradient of the Expected Reward Difficult?
The primary challenge is that the expected reward \(J(\theta)\) is an expectation over a distribution of trajectories, which are indirectly dependent on the policy parameters \(\theta\).
 Complex Dependency:
 The expectation \(\mathbb{E}_{\tau \sim \pi_\theta}\) involves summing over all possible trajectories \(\tau\), each of which is a sequence of states and actions influenced by the policy \(\pi_\theta\).
 The distribution over trajectories itself depends on the policy parameters \(\theta\), making the expectation highly nonlinear and difficult to differentiate directly.
 Infeasibility of Direct Differentiation:
 To compute the gradient \(\nabla_\theta J(\theta)\) directly, you would need to compute how changes in \(\theta\) affect the entire distribution of trajectories and how these changes in turn affect the expected reward.
 For most practical problems, directly calculating this gradient is infeasible due to the combinatorial explosion of possible trajectories and the implicit, complex relationship between the policy parameters and the trajectories.
Okkkk….? But let’s take another step back and try to understand in more depth:
Understanding the Infeasibility of Direct Gradient Calculation
1. Combinatorial Explosion of Trajectories
Just as mentioned earlier, in reinforcement learning, a trajectory \(\tau\) refers to a sequence of states, actions, and rewards experienced by the agent from the start to the end of an episode. For example, a trajectory might look like this:
\[\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \dots, s_T, a_T, r_T).\]
Combinatorial Explosion: The number of possible trajectories grows exponentially with the length of the episode and the number of possible actions at each state. If an environment has \(S\) states, \(A\) actions, and episodes of length \(T\), the number of possible trajectories could be as large as \((S \times A)^T\). This rapid growth in the number of trajectories is what we mean by combinatorial explosion.
 Example: Suppose an environment has 100 states, 10 possible actions per state, and an episode length of 50 steps. The number of possible trajectories is \((100 \times 10)^{50} = 10^{100}\), an astronomically large number.

Implicit Dependence on Policy Parameters: Each trajectory’s probability depends on the policy parameters \(\theta\). Changing \(\theta\) alters the probabilities of taking specific actions in given states, which in turn alters the distribution of trajectories. The relationship between \(\theta\) and the trajectories is implicit and complex because small changes in \(\theta\) can lead to different sequences of states and actions, which accumulate over time.
 Implication: Calculating the gradient of the expected reward directly would require summing over all possible trajectories, each weighted by its probability under the current policy. This would involve evaluating and differentiating an enormous number of terms, which is computationally infeasible.
Now… I hope things are a bit more clear as to why computing the gradient of the expected reward is difficult. But how do we actually go about solving this problem? This is where the logderivative trick and the REINFORCE algorithm come into play.
The LogDerivative Trick
When early researchers realized the challenge of directly computing \(\nabla_\theta J(\theta)\), they sought an alternative way to express the gradient that would be easier to work with. They turned to the field of statistical estimation, where the logderivative (likelihood ratio) trick was already known.
Derivation of the LogDerivative Trick
We’re starting with the goal of computing the gradient of the expected reward \(\mathbb{E}[R_t \mid \pi_\theta]\) with respect to the policy parameters \(\theta\).
The key steps in the derivation are:
 Expectation of the Reward:
 \[\nabla_\theta \mathbb{E}[R_t \mid \pi_\theta] = \nabla_\theta \sum_a \pi_\theta(a) \mathbb{E}[R_t \mid A_t = a]\]
 This expression shows the expectation as a sum over all possible actions \(a\), weighted by the probability \(\pi_\theta(a)\) of taking action \(a\) under the current policy.
 Introducing the \(q(a)\) Function:
 \(q(a) = \mathbb{E}[R_t \mid A_t = a]\) is the expected reward when taking action \(a\).
 Substituting this into the original expression gives us:
 Gradient Expansion:
 Using the product rule (since \(\pi_\theta(a)\) depends on \(\theta\)), we expand the gradient: \(\nabla_\theta \sum_a \pi_\theta(a) q(a) = \sum_a q(a) \nabla_\theta \pi_\theta(a)\)
 Rewriting the Gradient Using \(\pi_\theta(a)\):
 Notice that \(\nabla_\theta \pi_\theta(a) = \pi_\theta(a) \frac{\nabla_\theta \pi_\theta(a)}{\pi_\theta(a)}\).
 This allows us to rewrite the gradient as: \(\sum_a q(a) \nabla_\theta \pi_\theta(a) = \sum_a \pi_\theta(a) q(a) \frac{\nabla_\theta \pi_\theta(a)}{\pi_\theta(a)}\)
 Factoring out \(\pi_\theta(a)\):
 The expression simplifies to: \(\sum_a \pi_\theta(a) q(a) \frac{\nabla_\theta \pi_\theta(a)}{\pi_\theta(a)}\)

Here, we can see that \(\frac{\nabla_\theta \pi_\theta(a)}{\pi_\theta(a)}\) is the score function, often denoted as \(\nabla_\theta \log \pi_\theta(a)\).
 This is still a sum over all possible actions \(a\). The term \(\pi_\theta(a) q(a)\) represents the contribution of action \(a\) to the expected reward, weighted by the likelihood of taking that action under the policy \(\pi_\theta\).

Transition to Expectation:
 Recognize that \(\pi_\theta(a)\) is the probability of taking action \(a\). Therefore, summing over all actions can be viewed as taking an expectation over actions sampled from the policy distribution \(\pi_\theta\).
 We replace the sum with an expectation over the distribution of actions \(A_t\) under policy \(\pi_\theta\): \(\sum_a \pi_\theta(a) \left[ q(a) \frac{\nabla_\theta \pi_\theta(a)}{\pi_\theta(a)} \right] = \mathbb{E}_{A_t \sim \pi_\theta} \left[ q(A_t) \frac{\nabla_\theta \pi_\theta(A_t)}{\pi_\theta(A_t)} \right]\)  Here, \(q(A_t) = R_t\) is the reward associated with taking action \(A_t\).
 Finally: \(\mathbb{E}_{A_t \sim \pi_\theta} \left[ R_t \frac{\nabla_\theta \pi_\theta(A_t)}{\pi_\theta(A_t)} \right]\)
The logderivative trick allows us to express the gradient of the expectation in a different form that is more tractable:
\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ R(\tau) \nabla_\theta \log \pi_\theta(\tau) \right]\]Benefits of This Transformation
 SamplingBased Estimation:
 Original Problem: Directly computing \(\nabla_\theta J(\theta)\) would require summing over all possible trajectories and computing their derivatives, which is infeasible.

Transformed Problem: The new expression is an expectation that we can estimate by sampling trajectories from the policy. Instead of needing to consider all possible trajectories, we can generate a few sample trajectories \(\tau \sim \pi_\theta\) and use these samples to approximate the gradient. We turn the problem of computing a complicated expectation over an enormous set into a simpler problem of estimating that expectation by sampling. This is much more feasible in practice, especially in environments where the state and action spaces are large.
 Formally: By expressing the gradient as an expectation, we can approximate it using Monte Carlo methods. We only need to sample actions from the policy, compute the reward, and update the policy based on those samples.
 Gradient Computation Becomes Tractable:
 The transformed expression \(\mathbb{E}_{\tau \sim \pi_\theta} \left[ R(\tau) \nabla_\theta \log \pi_\theta(\tau) \right]\) can be computed using the chain rule on the policy network. This is because \(\nabla_\theta \log \pi_\theta(\tau)\) is just the gradient of the logprobability of the trajectory, which is directly computable if we know the policy \(\pi_\theta\). Therefore, we can do stochastic gradient ascent in the space of policy parameters.
Bonus Lesson on Monte Carlo and Gradient Estimation
Monte Carlo (MC) methods are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. In the context of reinforcement learning (RL) and specifically in policy gradient methods like REINFORCE, Monte Carlo methods are used to estimate the expectation of a function with respect to a probability distribution.
What Happens When You Sample?
 Expectation of a Function:
 Suppose you have a random variable \(X\) with a probability distribution \(p(X)\), and you want to compute the expected value of a function \(f(X)\) under this distribution: \(\mathbb{E}_{X \sim p(X)}[f(X)] = \int f(X) p(X) dX\)
 In practice, for complex functions or highdimensional spaces, computing this expectation analytically is difficult or impossible.
 Monte Carlo Estimation:
 Instead of calculating the integral directly, you can estimate the expectation by taking random samples \(X_i\) from the distribution \(p(X)\) and averaging the function values: \(\mathbb{E}_{X \sim p(X)}[f(X)] \approx \frac{1}{N} \sum_{i=1}^N f(X_i)\)
 As \(N\) (the number of samples) increases, the average converges to the true expectation due to the Law of Large Numbers.
Applying This to Policy Gradients
In the context of the policy gradient theorem:
 Gradient of the Expected Reward:
 The gradient of the expected reward can be expressed as: \(\nabla_\theta J(\theta) = \mathbb{E}_{A_t \sim \pi_\theta} \left[ R_t \frac{\nabla_\theta \pi_\theta(A_t)}{\pi_\theta(A_t)} \right]\)
 This expectation is over the distribution of actions \(A_t\) given by the policy \(\pi_\theta\).
 Sampling from the Policy:
 To estimate this gradient, you don’t need to evaluate every possible action (which would be computationally infeasible in large or continuous action spaces).
 Instead, you draw samples of actions \(A_t\) from the policy \(\pi_\theta\), compute the associated reward \(R_t\) for those actions, and then compute the policy gradient for those specific samples.
 Monte Carlo Estimation of the Gradient:
 The policy gradient can be approximated by averaging the gradients computed for each sample: \(\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N R_{t,i} \frac{\nabla_\theta \pi_\theta(A_{t,i})}{\pi_\theta(A_{t,i})}\)
 Here, \(A_{t,i}\) are the actions sampled from the policy \(\pi_\theta\), and \(R_{t,i}\) are the corresponding rewards.
 As you draw more samples, the average of these gradients converges to the true gradient of the expected reward, allowing you to use this approximation to update the policy parameters \(\theta\).
Key Takeaways:

Why It Works: By expressing the gradient as an expectation, you transform a potentially intractable sum (over all possible actions) into something you can approximate using a manageable number of samples. The Monte Carlo method is powerful because it allows you to estimate this expectation (and thus the gradient) by averaging over a relatively small number of samples from the policy.

Key Insight: The crucial insight is that you don’t need to consider every possible action at every step. Instead, by randomly sampling actions according to the policy and averaging the results, you get an unbiased estimate of the true gradient. This is what enables policy gradient methods to scale to complex, highdimensional action spaces.

Implication for Learning: This approach is why we can effectively use stochastic gradient ascent to optimize policies in reinforcement learning. Even though each step in the optimization process is based on a noisy estimate of the gradient, as long as you sample enough and follow the gradient on average, the policy will improve over time.
Code Implementation
import gymnasium as gym
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical
from torch import nn, relu, softmax
from torch.distributions import Categorical
class PolicyNetwork(nn.Module):
"""
Policy Network that defines the policy π_θ(a  s) as a neural network.
This network takes in the state as input and outputs a probability distribution
over actions.
Args:
 state_dim: Dimensionality of the state space.
 action_dim: Dimensionality of the action space.
 hidden_dim: Number of hidden units in the hidden layer.
"""
def __init__(self, state_dim, action_dim, hidden_dim=128):
super(PolicyNetwork, self).__init__()
# First fully connected layer
self.fc1 = nn.Linear(state_dim, hidden_dim)
# Second fully connected layer (output layer)
self.fc2 = nn.Linear(hidden_dim, action_dim)
def forward(self, x):
"""
Forward pass through the network. The output is a probability distribution
over actions, given the input state.
This corresponds to computing π_θ(a  s).
"""
# Apply ReLU activation to the first layer's output
x = relu(self.fc1(x))
# Compute action probabilities using softmax
x = self.fc2(x)
return softmax(x, dim=1) # Output π_θ(a  s)
def select_action(self, policy_net, state):
"""
Select an action based on the current policy. This involves sampling from the
distribution π_θ(a  s) and returning both the action and its logprobability.
Args:
 policy_net: The policy network (self).
 state: The current state of the environment.
Returns:
 action: The action sampled from the policy.
 log_prob: The logprobability of the selected action.
"""
# Convert state to tensor
state = torch.from_numpy(state).float()
# Get action probabilities from the policy network
probs = policy_net(state)
# Create a categorical distribution over actions
m = Categorical(probs)
# Sample an action from the distribution
action = m.sample()
# Return the action and its logprobability (log π_θ(a  s))
return action.item(), m.log_prob(action)
def compute_returns(self, rewards, gamma=0.99):
"""
Compute the return R(τ) for each time step in a trajectory. This involves
calculating the cumulative discounted reward.
Args:
 rewards: List of rewards obtained during the episode.
 gamma: Discount factor.
Returns:
 returns: List of discounted returns for each time step.
"""
R = 0
returns = []
# Iterate over rewards in reverse order to calculate returns
for r in reversed(rewards):
R = r + gamma * R
returns.insert(0, R) # Insert at the beginning
return returns # This is R(τ) in the theory
# Set up the environment
env = gym.make('CartPolev1')
# Set the seed for reproducibility
torch.manual_seed(0)
np.random.seed(0)
# Initialize the policy network with state and action dimensions
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
policy_net = PolicyNetwork(state_dim, action_dim)
def train(policy_net, optimizer, num_episodes=1000, gamma=0.99):
"""
Train the policy network using the REINFORCE algorithm.
Args:
 policy_net: The policy network to be trained.
 optimizer: The optimizer to update the network parameters.
 num_episodes: Number of episodes to train the network.
 gamma: Discount factor for future rewards.
"""
for episode in range(num_episodes):
# Reset the environment to get the initial state
state = env.reset()[0]
log_probs = [] # Store log π_θ(a  s)
rewards = [] # Store rewards
for t in range(1500): # Cap episode length
# Select action and get its logprobability
action, log_prob = policy_net.select_action(policy_net, state)
# Take the action in the environment
state, reward, done, *res = env.step(action)
# Store logprobability and reward
log_probs.append(log_prob)
rewards.append(reward)
if done:
break # End episode if done
# Compute returns R(τ) from rewards
returns = policy_net.compute_returns(rewards, gamma)
returns = torch.tensor(returns)
# Normalize returns (optional but helps with stability)
returns = (returns  returns.mean()) / (returns.std() + 1e8)
# Calculate the loss: sum(log π_θ(a  s) * R(τ))
# MAXIMIZING SOMETHING = MINIMIZING ITS NEGATIVE
#(hence we are still doing gradient ascent)
loss = torch.sum(torch.stack(log_probs) * returns)
# Perform backpropagation to compute gradients
optimizer.zero_grad() # Clear existing gradients
loss.backward() # Compute gradients w.r.t. policy parameters
optimizer.step() # Update policy parameters θ
# Log progress every 100 episodes
if episode % 100 == 0:
print(f"Episode {episode}, Total Reward: {sum(rewards)}")
# Save the trained policy network
torch.save(policy_net.state_dict(), 'policy_net.pth')
# Initialize optimizer (Adam) with learning rate
optimizer = optim.Adam(policy_net.parameters(), lr=0.01)
# Train the policy network using REINFORCE
train(policy_net, optimizer, num_episodes=1000)
def test_policy(policy_net, env, num_episodes=10):
"""
Test the trained policy network by running it in the environment.
Args:
 policy_net: The trained policy network.
 env: The environment to test in.
 num_episodes: Number of episodes to run the policy.
"""
for episode in range(num_episodes):
# Reset environment for each episode
state = env.reset()[0]
total_reward = 0
done = False
while not done:
env.render() # Render the environment
# Select action based on the trained policy
action, _ = policy_net.select_action(policy_net, state)
# Take the action and observe the outcome
state, reward, done, *_ = env.step(action)
total_reward += reward
print(f"Test Episode {episode}, Total Reward: {total_reward}")
env.close()
# Reinitialize the environment for testing with rendering
env = gym.make('CartPolev1', render_mode='human')
test_policy(policy_net, env)