RL Ch1 QLearning
RL and Its Components
RL is a online-learning system based on trial and error. RL process can be analogized as a game, where the player (agent) interacts with the environment to achieve a goal. The agent can take actions and receive rewards from the environment. The agent's goal is to maximize the total reward it receives over time.
RL can be used on both traditional ML algorithms and NN based methods. This series of notes only focus on the latter.
We usually call each round of a game an episode.
Gymnasium
A popular library for RL is OpenAI's Gymnasium. It provides a wide range of environments for testing RL algorithms. The environments are categorized into different levels of difficulty.
You can simply install gymnasium[box2d] swig
to get started, we would also need pytorch
.
To create a game (an environment) in Gymnasium, you can use the following code:
import gym
env = gym.make('CartPole-v0')
There is an option, render_mode
, that can be set to human
to visualize the game. We usually only use the mode human
to visualize the game, and let it stay unrendered (None
) in other cases.
Useful methods for the environment object include:
observation, info = env.reset()
reset the environment and return the initial observation (a state tensor).observation, reward, terminated, truncated, info = env.step(action)
take an action and return the observation, reward (a scaler), whether the games failed (terminated), whether the game is truncated (run out of time), and some additional information.env.action_space.shape
check the accepted shape of the action tensor.action = env.action_space.sample()
generate a random action.env.observation_space
has the same methods mentioned above for action space.
Q-Learning
The first RL algorithm we will discuss is Q-Learning. Q-Learning is a model-free, off-policy algorithm that can be used to find the optimal action-selection policy for any given MDP. The Q-Learning algorithm is based on the Bellman equation, which is used to calculate the Q-value of a state-action pair.
Q-Learning was originally proposed by Chris Watkins in 1989, applied to traditional ML algorithms. However, we use NN instead here.
Q-Learning aims to give every action a Q-value, which represents the expected total reward the agent will receive if it takes that action in a given state. The agent then selects the action with the highest Q-value, known as the greedy exploitation. However, during training, this typically prohibits the model to choose other options, so the stratagem used during training is usually randomized-greedy. That is, having a small chance of randomly choosing an action, or most of the time, choose the action greedily.
The Q-Learning algorithm is based on the following equation:
where:
- is the Q-value of the state-action pair .
- is the learning rate.
- is the reward received after taking action in state .
- is the discount factor.
- is the next state.
- is the next action.
This equation means that, the expected Q-value for the current pair is expected to be , which means is the current reward plus the most optimal Q-value of the future with a decay of .
This is for the more traditional ML method. For the NN method, we use the following loss function:
Obviously this is just a rewritten form for the previous equation, when we assume the output of the model is a Q-value tensor, because if we take gradient for that loss, the update step becomes the same as the previous equation.
can be approximated by the current model. That is, what the current model would do under the given state .
Implementation
DQN means Deep Q-learning Network. It is a NN based Q-learning algorithm. We will implement it here.
This is just a simple MLP model. You can use any other architecture you like. The training process is identical to previous equations, except that the max
function is replaced by an approximation of the current model (as the reference model).
The loss curve may bounce up and down for RL, but you can witness a steady increase in the reward, which is more important.
For this scenario, the Q-Learning method yields an acceptable results.
import gymnasium as gym
import torch
import torch.nn as nn
import torch.optim as optim
import random
from torch.optim.lr_scheduler import CosineAnnealingLR
import wandb
from tqdm import trange
import os
wandb.init(project="rl")
device = "cpu"
print(f"Using device: {device}")
class QNetwork(nn.Module):
def __init__(self, state_dim, action_dim):
super(QNetwork, self).__init__()
self.fc1 = nn.Linear(state_dim, 128)
self.fc2 = nn.Linear(128, 128)
self.fc3 = nn.Linear(128, action_dim)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
x = self.fc3(x)
return x
num_episodes = 1000
learning_rate = 3e-4
gamma = 0.99
epsilon = 1.0
epsilon_decay = 0.99
min_epsilon = 0.05
env = gym.make("CartPole-v1")
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
q_network = QNetwork(state_dim, action_dim).to(device)
if os.path.exists("q_network.pth"):
q_network.load_state_dict(torch.load("q_network.pth"))
optimizer = optim.Adam(q_network.parameters(), lr=learning_rate)
scheduler = CosineAnnealingLR(optimizer, num_episodes)
loss_fn = nn.MSELoss()
for episode in trange(num_episodes):
state, _ = env.reset()
state = torch.FloatTensor(state).to(device)
episode_reward = 0
done = False
while not done:
if random.random() < epsilon:
action = env.action_space.sample()
else:
with torch.no_grad():
q_values = q_network(state)
action = torch.argmax(q_values).item()
next_state, reward, done, _, _ = env.step(action)
next_state_tensor = torch.FloatTensor(next_state).to(device)
episode_reward += reward
with torch.no_grad():
q_next = q_network(next_state_tensor)
max_q_next = torch.max(q_next).item()
target_q = reward + (gamma * max_q_next if not done else 0.0)
q_values = q_network(state)
current_q = q_values[action]
loss = loss_fn(current_q, torch.tensor(target_q).to(device))
wandb.log({"loss": loss.item(), "episode": episode, "lr": scheduler.get_last_lr()[0], "epsilon": epsilon, "reward": episode_reward})
optimizer.zero_grad()
loss.backward()
optimizer.step()
state = next_state_tensor
scheduler.step()
epsilon = max(min_epsilon, epsilon * epsilon_decay)
env.close()
q_network.eval()
torch.save(q_network.state_dict(), "q_network.pth")
test_env = gym.make("CartPole-v1", render_mode="human")
state, _ = test_env.reset()
done = False
while not done:
test_env.render()
state_tensor = torch.FloatTensor(state).to(device)
with torch.no_grad():
q_values = q_network(state_tensor)
action = torch.argmax(q_values).item()
state, reward, done, _, _ = test_env.step(action)
test_env.close()