Playing TicTacToe with a DQN

Written on August 25, 2019

With the possible exception of computer vision, reinforcement learning has probably captured more of the public imagination than any other area of data science, artificial intelligence, or machine learning. It perhaps most closely mirrors what we think of as intelligence: an environment is observed, the machine takes action, and learns from the consequences of those actions. In recent years, huge advances in reinforcement learning (such as Deep Mind’s AlphaZero) have allowed machines to match or exceed the greatest human players in challenging, difficult games like Chess, Go, or DotA. With this in mind, I turned my attention toward the most challenging game of all: tic tac toe. In this post, I’ll train a deep Q network (DQN) to play tic tac toe. It will learn to play entirely by playing against itself.

A little background

Reinforcement learning has a somewhat-deserved reputation for having a rather high barrier to entry (even relative to other areas of machine learning). A large part of this stems from the fact that the training process follows a different pattern than other areas of machine learning. In this post, I’ll focus on deep Q networks, but I’ll try to keep the content broadly applicable to other reinforcement learning models (e.g. DDPGs) as well.

The models we’ll build are called agents, and they define functions that map states to actions. This mapping (the strategy used to determine the next action based on the current state) is called a policy. Often, the policies are learned through Q-learning, which is a process for learning what the optimal reward resulting from a given action in a given state will be. The “Q” in Q-learning refers to the Q-value, or discounted future reward, for an action given the current state of the agent and environment. This is defined recursively through the Bellman Equation:

\[Q(s, a) := r(s,a) + \gamma \max_{a} Q(s', a)\]

In this equation, $s$ and $a$ are the current state and action, respectively, $s’$ is the state resulting from the action $a$, and $\gamma$ is a “discount factor” (a real number typically between 0 and 1) that is used to decrease the weight of future rewards. This equation is greedy and recursive, so if a reward $r$ could be obtained in three actions (and each one of those actions optimized the discounted future rewards), that reward would appear in the $Q(s, a)$ discounted by a factor of $\gamma^{2}$.

If we had perfect knowledge of the Q function, we would be able to choose an optimal action in any given state. In practice, the space of state-action combinations we’d have to explore is often impracticably large. For example, a very simple game like tic tac toe has $3 \times 9$ possible board states and $9$ possible actions in any given state, for a total of $243$ different pairs of $(s,a)$ we would have to explore. More complicated games like chess have about $10^3$ possibilities for a pair of moves (one move by White and one move by Black) alone, and probably upwards of $10^{40}$ possible legal board configurations. For games with a continuous space of actions, a brute-force search is not only computationally but theoretically intractable.

The minor miracle that powers deep Q learning is that, at least under certain conditions, deep neural networks can learn to approximate the Q function!

Getting up and running with RL

There’s been a lot of activity recently focused on making techniques like DQNs and DDPGs more accessible. Two particular things we’ll focus on in this post are OpenAI’s gym, which defines a standard interface for training environments (and includes many example games), and Keras-RL, which allows developers to implement a variety of reinforcement learning algorithms using Keras’s high-level model API. Both libraries can be installed with pip and are reasonably well-documented.

Defining our environment

One interesting aspect of how agents like AlphaZero were trained is that they are essentially “self-taught”: the environment only describes what moves are and aren’t legal, and what constitutes a winning game; AlphaZero did not learn to play Go by studying previous games. We’ll take a similar approach here.

The various environments packaged with the OpenAI gym are generally “single player” games, so we’ll have to be a little clever to define a multiplayer game using the same framework. Each OpenAI gym has to define four methods, as well as document some characteristics of the observation and action space and reward range. Here’s a stub of this:

class TicTacToe(gym.Env):
    reward_range = (-np.inf, np.inf)
    observation_space = spaces.MultiDiscrete([2 for _ in range(0, 3*3*3)])
    action_space = spaces.Discrete(9) 

    def seed(self, seed=None):
        pass
    
    def reset(self):
        # [...]
        return obs
    
    def step(self, action):
        # [...]
        return obs, reward, done, {}
    
    def render(self, mode="human"):
        pass

We’ll represent our tic tac toe board to the agent as a flattened $3 \times 3 \times 3$ tensor of binary indicators. The first two dimensions are the board position, and the final dimension indicates whether a space is unoccupied, occupied by player 1, or occupied by player 2. (Note that in principle we would only need two binary variables to determine whether a space is unoccupied, occupied by player 1, or occupied by player 2, but by one-hot encoding each of these possibilities we are saving the agent from needing to spend time learning this more complicated representation). Internally, we can track the board state as a $3 \times 3$ dimensional array of integers for convenience.

The agent’s action space consists of the integers 0, 1, .., 8, indicating the position of the board the agent wants to move to. These correspond to the board positions in the obvious way, i.e.

[ 0, 1, 2,
  3, 4, 5,
  6, 7, 8 ]

The seed function is irrelevant for our purposes since tic tac toe is entirely deterministic.

The function reset resets the game to the initial state and returns the initial observation of the environment. In our case, this will need to zero out our board and return the appropriate empty board:

    def reset(self):
        self.current_player = 0
        self.board = np.zeros(9, dtype='int')
        return self._one_hot_board()

Here, _one_hot_board is a function that transforms the environment’s internal representation of the board state (an array of integers) into the tensor expected by the agent.

The step function is probably the most important of these four: it takes in an action, and returns the new state and reward. Its implementation is also the most complicated:

    def step(self, action):
        exp = {"state": "in progress"}
        
        reward = 0
        done = False

        # handle illegal moves
        if self.board[action] != 0:
            reward = -10 # illegal moves are really bad
            exp = {"state": "done", 
                   "reason":"Illegal move"}
            done = True
            return self._one_hot_board(), reward, done, exp
        
        self.board[action] = self.current_player + 1
        
        # check if the other player can win on the next turn:
        for streak in self.winning_streaks:
            if ((self.board[streak] == 2 - self.current_player).sum() >= 2) \
                 and (self.board[streak] == 0).any():
                reward = -1
                exp = {
                "state": "in progress", 
                "reason": "{} can lose next turn".format(self.current_player)
                    }
                
        # check if we won
        for streak in self.winning_streaks:
            if (self.board[streak] == self.current_player + 1).all():
                reward = 1 # player wins!
                exp = {"state": "in progress", 
                       "reason": "{} has won".format(self.current_player)}
                done = True
        # check if we tied, which is also a win
        if (self.board != 0).all():
            reward = 1
            exp = {"state": "in progress", 
                   "reason": "{} has tied".format(self.current_player)}
            done = True
        
        # move to the next player
        self.current_player = 1 - self.current_player
        
        return self._one_hot_board(), reward, done, exp

The step function returns a tuple with the following information:

The new observed state (i.e. the one-hot-encoded board)
The reward resulting from the agent’s action
A boolean indicating whether or not the game is over
Any additional debug information (used in our case to indicate what happened that turn)

In addition, we also update the environment’s internal state with each step. In particular, each step represents one player’s move, and we alternate with each step which player is moving. (We’ll have to write a special kind of “interleaved agent” to do this, see below.)

Picking reward functions is one of the most difficult things to get right when doing reinforcement learning. The reward function I settled on is the following:

If the agent wins the game, they get a reward of 1, and the game ends.
If the agent makes an illegal move, they get a reward of -10, and the game ends.
If the agent makes a move that allows the opposing player to win on the next turn, they get a reward of -1.

The last item (penalizing losing moves) required some trial and error to discover: initially it seemed cleaner to me not to include such a penalty. However, I noticed that this resulted in the two players racing to get three in a row without any concern for the other player’s moves. This is because without a penalty for making a “losing” move, the agent does not learn to pay attention to how close the other player is to winning.

It’s possible that including a reward for “staying alive” might be another way to incentivize avoiding losing moves, and it might be an interesting experiment to see how two agents with alternative reward structures play against one another!

Defining our agent

Recall that we’re trying to train our agent to learn to play tic tac toe by playing against itself. This is a kind of “multi-agent” (i.e. multiplayer) game. Keras-RL doesn’t include any multi-agent functionality out of the box. However, there are some examples in the wild implementing this—I ended up shamelessly copying an interleaved agent implementation from here.

Adding a new type of agent to Keras-RL is actually not especially difficult, and it’s a testament to the library that I (or rather, GitHub user velochy) was able to bend it to this purpose. The main work involved is just defining a new class inheriting from the base Keras-RL Agent class, and overriding some methods (specifically the forward and backward passes). I’ll direct the interested reader to velochy’s implementation, and only show specifically the forward and backward passes here:

    def forward(self, observation):
        # forward updates the current agent first
        self.current_agent = (self.current_agent + 1) % self.n
        
        return self.agents[self.current_agent].forward(observation)
    
    def backward(self, reward, terminal):
        return (self.agents[self.current_agent]
                .backward(reward, terminal)[:len(self.m_names)])

What’s going on here is that we’re rotating through each individual agent in the MultiAgent with each call to forward. Since our environment rotates through players as well, this means that each time we train on a step, we are training the appropriate agent. The fit method is actually implemented in the base Keras-RL agent (see here), and for each step forward gets called, then backward, for the same agent.

Building the model

For the rest of it, we can build our model just like one normally would with Keras-RL:

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten, Embedding
from keras.optimizers import Adam

from rl.agents.dqn import DQNAgent
from rl.policy import EpsGreedyQPolicy
from rl.memory import SequentialMemory

def make_dqn():
    model = Sequential()
    model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
    model.add(Activation('relu'))
    model.add(Dense(27))
    model.add(Activation('relu'))
    model.add(Dense(27))
    model.add(Activation('relu'))
    model.add(Dense(27))
    model.add(Activation('relu'))
    model.add(Dense(nb_actions))
    model.add(Activation('linear'))
    
    memory = SequentialMemory(limit=50000, window_length=1)
    policy = EpsGreedyQPolicy(eps=0.2)
    dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory, 
                   nb_steps_warmup=100,
                   target_model_update=1e-2, 
                   policy=policy)
    dqn.compile(Adam(lr=1e-3), metrics=['mae'])
    
    return dqn

dqn = make_dqn()

agent = MultiAgent([dqn, dqn])

Here, I’ve wrapped the model definition in a simple function to make it easy to compare training the same model compared to training different models for player 1 and player 2. (For this post, I’ve focused on the former case.)

Next, let’s compile and train our agent:

agent.compile(Adam(lr=1e-3), metrics=['mae'])

agent.fit(env, nb_steps=100000, visualize=False, verbose=1)

After about 13 minutes on my machine, training completes. The loss, error, and reward values reported are not especially informative given that we’re training our DQN to play itself.

Testing it out!

Now, let’s see how our agent does in a “real” game. First, the set up:

test_env = TicTacToe()

done = False
observation = test_env.reset()

agent.training = False

It’s very important to set agent.training = False: in our agent definition, we used an epsilon-greedy policy with $\epsilon = 0.2$, meaning that during training the agent will take a random action with probability 0.2. Setting the training parameter to false will disable this random action, which is generally a good thing. In certain circumstances note that it can be desirable to have some amount of randomness in your agent’s actions (e.g. to avoid getting “stuck” in some games), but tic tac toe is definitely not one of those cases.

Next let’s play the game:

step = 0
while not done:
    print("Turn: {}".format(step))
    action = agent.forward(observation)
    observation, reward, done, exp = test_env.step(action)
    test_env.render()
    print("\n")
    step += 1

print("A strange game. The only winning move is not to play.")

On my machine, this resulted in:

Turn: 0
0|0|0
-----
0|0|0
-----
0|1|0


Turn: 1
0|0|0
-----
0|0|0
-----
0|1|2


Turn: 2
0|0|1
-----
0|0|0
-----
0|1|2


Turn: 3
0|0|1
-----
0|0|2
-----
0|1|2


Turn: 4
0|0|1
-----
1|0|2
-----
0|1|2


Turn: 5
0|0|1
-----
1|0|2
-----
2|1|2


Turn: 6
0|0|1
-----
1|1|2
-----
2|1|2


Turn: 7
0|2|1
-----
1|1|2
-----
2|1|2


Turn: 8
1|2|1
-----
1|1|2
-----
2|1|2

Note that our game ends in a tie (although we play until the board is full). Also note that a very interesting play happened with player 2’s move in turn 5:

0|0|1    0|0|1
-----    -----
1|0|2 -> 1|0|2
-----    -----
0|1|2    2|1|2

Player 2 chooses a move that successfully blocks player 1 from completing a winning streak on either the diagonal or left-hand vertical axis, even though this move will not directly lead to any 3-in-a-row opportunities for player 2. This is an interesting move, but likely non-optimal: playing in the upper-left corner instead would have similarly blocked player 1, but also left player 2 open to possibly get 3 in a row along the diagonal. Oh well!