RL Ch1 QLearning

RL and Its Components

RL is a online-learning system based on trial and error. RL process can be analogized as a game, where the player (agent) interacts with the environment to achieve a goal. The agent can take actions and receive rewards from the environment. The agent's goal is to maximize the total reward it receives over time.

RL can be used on both traditional ML algorithms and NN based methods. This series of notes only focus on the latter.

We usually call each round of a game an episode.

Gymnasium

A popular library for RL is OpenAI's Gymnasium. It provides a wide range of environments for testing RL algorithms. The environments are categorized into different levels of difficulty.

You can simply install gymnasium[box2d] swig to get started, we would also need pytorch.

To create a game (an environment) in Gymnasium, you can use the following code:

import gym

env = gym.make('CartPole-v0')

There is an option, render_mode, that can be set to human to visualize the game. We usually only use the mode human to visualize the game, and let it stay unrendered (None) in other cases.

Useful methods for the environment object include:

observation, info = env.reset() reset the environment and return the initial observation (a state tensor).
observation, reward, terminated, truncated, info = env.step(action) take an action and return the observation, reward (a scaler), whether the games failed (terminated), whether the game is truncated (run out of time), and some additional information.
env.action_space.shape check the accepted shape of the action tensor.
action = env.action_space.sample() generate a random action.
env.observation_space has the same methods mentioned above for action space.

Q-Learning

The first RL algorithm we will discuss is Q-Learning. Q-Learning is a model-free, off-policy algorithm that can be used to find the optimal action-selection policy for any given MDP. The Q-Learning algorithm is based on the Bellman equation, which is used to calculate the Q-value of a state-action pair.

Q-Learning was originally proposed by Chris Watkins in 1989, applied to traditional ML algorithms. However, we use NN instead here.

Q-Learning aims to give every action a Q-value, which represents the expected total reward the agent will receive if it takes that action in a given state. The agent then selects the action with the highest Q-value, known as the greedy exploitation. However, during training, this typically prohibits the model to choose other options, so the stratagem used during training is usually randomized-greedy. That is, having a small chance of randomly choosing an action, or most of the time, choose the action greedily.

The Q-Learning algorithm is based on the following equation:

$Q_{next}(s, a) = Q(s, a) + \alpha \cdot (r + \gamma \cdot \max_{a'} Q(s', a') - Q(s, a))$

where:

$Q(s, a)$ is the Q-value of the state-action pair $(s, a)$ .
$\alpha$ is the learning rate.
$r$ is the reward received after taking action $a$ in state $s$ .
$\gamma$ is the discount factor.
$s'$ is the next state.
$a'$ is the next action.

This equation means that, the expected Q-value for the current $a, s$ pair is expected to be $r + \gamma \cdot \max_{a'} Q(s', a')$ , which means is the current reward plus the most optimal Q-value of the future with a decay of $gamma$ .

This is for the more traditional ML method. For the NN method, we use the following loss function:

$L = \left( r + \gamma \cdot \max_{a'} Q(s', a') - Q(s, a) \right)^2$

Obviously this is just a rewritten form for the previous equation, when we assume the output of the model is a Q-value tensor, because if we take gradient for that loss, the update step becomes the same as the previous equation.

$a'$ can be approximated by the current model. That is, what the current model would do under the given state $s'$ .

Implementation

DQN means Deep Q-learning Network. It is a NN based Q-learning algorithm. We will implement it here.

This is just a simple MLP model. You can use any other architecture you like. The training process is identical to previous equations, except that the max function is replaced by an approximation of the current model (as the reference model).

The loss curve may bounce up and down for RL, but you can witness a steady increase in the reward, which is more important.

For this scenario, the Q-Learning method yields an acceptable results.

import gymnasium as gym
import torch
import torch.nn as nn
import torch.optim as optim
import random
from torch.optim.lr_scheduler import CosineAnnealingLR
import wandb
from tqdm import trange
import os

wandb.init(project="rl")
device = "cpu"
print(f"Using device: {device}")

class QNetwork(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(QNetwork, self).__init__()
        self.fc1 = nn.Linear(state_dim, 128)  
        self.fc2 = nn.Linear(128, 128) 
        self.fc3 = nn.Linear(128, action_dim)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

num_episodes = 1000
learning_rate = 3e-4
gamma = 0.99
epsilon = 1.0  
epsilon_decay = 0.99 
min_epsilon = 0.05

env = gym.make("CartPole-v1")
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n 

q_network = QNetwork(state_dim, action_dim).to(device)
if os.path.exists("q_network.pth"):
    q_network.load_state_dict(torch.load("q_network.pth"))
optimizer = optim.Adam(q_network.parameters(), lr=learning_rate)
scheduler = CosineAnnealingLR(optimizer, num_episodes)
loss_fn = nn.MSELoss()

for episode in trange(num_episodes):
    state, _ = env.reset() 
    state = torch.FloatTensor(state).to(device) 
    episode_reward = 0 
    done = False

    while not done:
        if random.random() < epsilon:
            action = env.action_space.sample()
        else:
            with torch.no_grad():
                q_values = q_network(state)
            action = torch.argmax(q_values).item()
        next_state, reward, done, _, _ = env.step(action)
        next_state_tensor = torch.FloatTensor(next_state).to(device) 
        episode_reward += reward

        with torch.no_grad():
            q_next = q_network(next_state_tensor)
            max_q_next = torch.max(q_next).item()
            target_q = reward + (gamma * max_q_next if not done else 0.0)

        q_values = q_network(state)
        current_q = q_values[action]

        loss = loss_fn(current_q, torch.tensor(target_q).to(device)) 
        wandb.log({"loss": loss.item(), "episode": episode, "lr": scheduler.get_last_lr()[0], "epsilon": epsilon, "reward": episode_reward})
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        state = next_state_tensor
    scheduler.step()
    epsilon = max(min_epsilon, epsilon * epsilon_decay)

env.close()

q_network.eval()
torch.save(q_network.state_dict(), "q_network.pth")
test_env = gym.make("CartPole-v1", render_mode="human")
state, _ = test_env.reset()
done = False

while not done:
    test_env.render()
    state_tensor = torch.FloatTensor(state).to(device)
    with torch.no_grad():
        q_values = q_network(state_tensor)
    action = torch.argmax(q_values).item()
    state, reward, done, _, _ = test_env.step(action)

test_env.close()

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search