Solving Blackjack

Introduction

In this tutorial, we’ll explore and solve the Blackjack-v1 environment
(this means we’ll have an agent learn an optimal policy).

This tutorial is part of the Gymnasium documentation. A more detailed version with training plots can be found on the Gymnasium website.

The documentation for the blackjack environment is available here.

Blackjack is one of the most popular casino card games that is also infamous for being beatable under certain conditions. This version of the game uses an infinite deck (we draw the cards with replacement), so counting cards won’t be a viable strategy in our simulated game.

Objective: To win, your card sum should be greater than than the dealers without exceeding 21.

Approach: To solve this environment by yourself, you can pick your favorite discrete RL algorithm. The presented solution uses Q-learning (a model-free RL algorithm).

# Imports and Environment Setup

# Author: Till Zemann
# License: MIT License

import gymnasium as gym
import numpy as np
import matplotlib
import seaborn as sns
from matplotlib import pyplot as plt
from matplotlib.patches import Patch
from collections import defaultdict

matplotlib.use('TkAgg')
plt.rcParams['text.usetex'] = True

# Let's start by creating the blackjack environment.
# Note: We are going to follow the rules from Sutton & Barto.
# Other versions of the game can be found below for you to experiment.

env = gym.make('Blackjack-v1', sab=True)

Other possible environment configurations:

env = gym.make('Blackjack-v1', natural=True, sab=False)

env = gym.make('Blackjack-v1', natural=False, sab=False)

Basics: Interacting with the environment

Observing the environment

First of all, we call env.reset() to start an episode. This function resets the environment to a starting position and returns an initial observation. We usually also set done = False. This variable will be useful later to check if a game is terminated. In this tutorial we will use the terms observation and state synonymously but in more complex problems a state might differ from the observation it is based on.

# reset the environment to get the first observation
done = False
observation, info = env.reset()

print(observation)

Note that our observation is a 3-tuple consisting of 3 discrete values:

The players current sum
Value of the dealers face-up card
Boolean whether the player holds a usable ace (An ace is usable if it counts as 11 without busting)

Executing an action

After receiving our first observation, we are only going to use the env.step(action) function to interact with the environment. This function takes an action as input and executes it in the environment. Because that action changes the state of the environment, it returns four useful variables to us. These are:

next_state: This is the observation that the agent will receive after taking the action.
reward: This is the reward that the agent will receive after taking the action.
terminated: This is a boolean variable that indicates whether or not the episode is over.
truncated: This is a boolean variable that also indicates whether the episode ended by early truncation.
info: This is a dictionary that might contain additional information about the environment.

The next_state, reward, and done variables are self-explanatory, but the info variable requires some additional explanation. This variable contains a dictionary that might have some extra information about the environment, but in the Blackjack-v1 environment you can ignore it. For example in Atari environments the info dictionary has a ale.lives key that tells us how many lives the agent has left. If the agent has 0 lives, then the episode is over.

Blackjack-v1 doesn’t have a env.render() function to render the environment, but in other environments you can use this function to watch the agent play. Important to note is that using env.render() is optional - the environment is going to work even if you don’t render it, but it can be helpful to see an episode rendered out to get an idea of how the current policy behaves. Note that it is not a good idea to call this function in your training loop because rendering slows down training by a lot. Rather try to build an extra loop to evaluate and showcase the agent after training.

# sample a random action from all valid actions
action = env.action_space.sample()

# execute the action in our environment and receive infos from the environment
observation, reward, terminated, truncated, info = env.step(action)

print('observation:', observation)
print('reward:', reward)
print('terminated:', terminated)
print('truncated:', truncated)
print('info:', info)

Once terminated = True or truncated=True, we should stop the current episode and begin a new one with env.reset(). If you continue executing actions without resetting the environment, it still responds but the output won’t be useful for training (it might even be harmful if the agent learns on invalid data).

Building an agent

Let’s build a Q-learning agent to solve Blackjack-v1! We’ll need some functions for picking an action and updating the agents action values. To ensure that the agents expores the environment, one possible solution is the epsilon-greedy strategy, where we pick a random action with the percentage epsilon and the greedy action (currently valued as the best) 1 - epsilon.

class BlackjackAgent():
    
    def __init__(self, lr=1e-3, epsilon=0.1, epsilon_decay=1e-4):
        """
        Initialize an Reinforcement Learning agent with an empty dictionary 
        of state-action values (q_values), a learning rate and an epsilon.
        """
        self.q_values = defaultdict(lambda: np.zeros(env.action_space.n)) # maps a state to action values
        self.lr = lr
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
    
    def get_action(self, state):
        """
        Returns the best action with probability (1 - epsilon) 
        and a random action with probability epsilon to ensure exploration.
        """
        # with probability epsilon return a random action to explore the environment
        if np.random.random() < self.epsilon:
            action = env.action_space.sample()
            
        # with probability (1 - epsilon) act greedily (exploit)
        else:
            action = np.argmax(self.q_values[state])
        return action
    
    def update(self, state, action, reward, next_state, done):
        """
        Updates the Q-value of an action.
        """
        old_q_value = self.q_values[state][action]
        max_future_q = np.max(self.q_values[next_state])
        target = reward + self.lr * max_future_q * (1 - done)
        self.q_values[state][action] = (1 - self.lr) * old_q_value + self.lr * target
    
    def decay_epsilon(self):
        self.epsilon = self.epsilon - epsilon_decay

To train the agent, we will let the agent play one episode (one complete game is called an episode) at a time and then update it’s Q-values after each episode. The agent will have to experience a lot of episodes to explore the environment sufficiently.

Now we should be ready to build the training loop.

# hyperparameters
epsilon = 0.6
n_episodes = 300_000
epsilon_decay = epsilon / n_episodes # eps-decay facilitates less exploration over time

agent = BlackjackAgent(lr=1e-3, epsilon=epsilon)

def train(agent, n_episodes):
    for episode in range(n_episodes):
    
        # reset the environment
        state, info = env.reset()
        done = False

        # play one episode
        while not done:
            action = agent.get_action(observation)
            next_state, reward, terminated, truncated, info = env.step(action)
            done = terminated or truncated # set done=True if episode ended early
            agent.update(state, action, reward, next_state, done)
            state = next_state

        agent.update(state, action, reward, next_state, done)

Great, let’s train!

train(agent, n_episodes)

Visualizing the results

def create_grids(agent, usable_ace=False):
    
    # convert our state-action values to state values 
    # and build a policy dictionary that maps observations to actions
    V = defaultdict(float)
    policy = defaultdict(int)
    for obs, action_values in agent.q_values.items():
        V[obs] = np.max(action_values)
        policy[obs] = np.argmax(action_values)
    
    X, Y = np.meshgrid(
        np.arange(12,22), # players count
        np.arange(1,11))  # dealers face-up card
    
    # create the value grid for plotting
    Z = np.apply_along_axis(
        lambda obs: V[(obs[0], obs[1], usable_ace)], axis=2, arr=np.dstack([X, Y])
        )
    value_grid = X, Y, Z
    
    # create the policy grid for plotting
    policy_grid = np.apply_along_axis(
        lambda obs: policy[(obs[0], obs[1], usable_ace)], axis=2, arr=np.dstack([X, Y])
        )
    return value_grid, policy_grid

def create_plots(value_grid, policy_grid, title='N/A'):
    
    # create a new figure with 2 subplots (left: state values, right: policy)
    X, Y, Z = value_grid
    fig = plt.figure(figsize=plt.figaspect(0.4))
    fig.suptitle(title, fontsize=16)

    # plot the state values
    ax1 = fig.add_subplot(1, 2, 1, projection='3d')
    ax1.plot_surface(X, Y, Z, rstride=1, cstride=1,
                    cmap='viridis', edgecolor='none')
    plt.xticks(range(12, 22), range(12, 22))
    plt.yticks(range(1,11), ['A'] + list(range(2,11)))
    ax1.set_title('State values: ' + title);
    ax1.set_xlabel("Player sum")
    ax1.set_ylabel("Dealer showing")
    ax1.zaxis.set_rotate_label(False)
    ax1.set_zlabel('$V_{\pi}$', fontsize=14, rotation=0)
    ax1.view_init(20, 220)
    
    # plot the policy
    a2x = fig.add_subplot(1, 2, 2)
    ax2 = sns.heatmap(policy_grid, linewidth=0, annot=True, cmap="Accent_r", cbar=False)
    ax2.set_title('Policy: ' + title);
    ax2.set_xlabel("Player sum")
    ax2.set_ylabel("Dealer showing")
    ax2.set_xticklabels(range(12, 22))
    ax2.set_yticklabels(['A'] + list(range(2,11)), fontsize=12)
    
    # add a legend
    legend_elements = [Patch(facecolor='lightgreen', edgecolor='black', label='Hit'),
                       Patch(facecolor='grey', edgecolor='black', label='Stick')]
    ax2.legend(handles=legend_elements, bbox_to_anchor=(1.3, 1))
    return fig

# state values & policy with usable ace (ace counts as 11)
value_grid, policy_grid = create_grids(agent, usable_ace=True)
fig1 = create_plots(value_grid, policy_grid, title='With usable ace')
plt.show()

# state values & policy without usable ace (ace counts as 1)
value_grid, policy_grid = create_grids(agent, usable_ace=False)
fig2 = create_plots(value_grid, policy_grid, title='Without usable ace')
plt.show()

It’s good practice to call env.close() at the end of your script, so that any used ressources by the environment will be closed.

env.close()

Optimal policy

For reference, here the optimal policy and value function, which we were able to achieve <:) (taken from Sutton & Barto):

I hope that this Tutorial helped you get a grip of how to interact with OpenAI-Gym environments and sets you on a journey to solve many more RL challenges.

It is recommended that you solve this environment by yourself (project based learning is really effective!). You can apply your favorite discrete RL algorithm or give Monte Carlo ES a try (covered in Sutton & Barto, section 5.3) - this way you can compare your results directly to the book.

Best of fun!