Sunday, July 31, 2011

TD Gammon neural network strategy

Once I got the basic backgammon framework working, I needed to build a player that uses the TD gammon neural net to make board value decisions, and that can evolve itself through training.

In terms of the TD gammon setup itself, the best source online I found is actually an assignment from a computer science class at Rice University. It goes through the basic learning algo in some detail and gives some tips on how to make it work.

For inputs I followed the standard Tesauro setup that doesn't encode anything about backgammon strategy - it is purely a sensible way of encoding the board layout. That's probably the coolest thing about the neural net players - there was very little game wisdom put into the game by human experts, which meant that the trained nets do not suffer from any biases that human players may have. In the case of backgammon, that turned out to be a huge deal: the neural net players ended up superior to all human players, and ended up changing the way humans played.


There are 196 inputs, 98 for each player:

  • For each spot on the board, we use four inputs to encode the number of checkers. The first of the four inputs has value 1 if the spot has at least 1 checker, 0 otherwise. The second input equals 1 if there are at least 2 checkers, 0 otherwise. The third input equals 1 if there are at least 3 checkers, 0 otherwise. The fourth input equals (number of checkers-3)/2 if there at least 4 checkers, 0 otherwise. For one player that gives us 24*4=96 inputs.
  • The 97th input represents the number of checkers hit / 2. The factor of 2 is meant to give it a sensible unit-level scale.
  • The 98th input represents the number of checkers borne in / 15.
Note that this is a little different to the standard Tesauro inputs - it does not have the last two inputs that represent whose turn it is. That's due to the symmetry I tried to enforce: everything is calculated from the perspective of the player whose turn it is, whoever that is.

The network has three layers: the inputs, the hidden middle nodes, and the outputs. 

For my first player, I differed again from the standard Tesauro approach and used only one output node, representing the odds of winning the game. So this first player knows nothing about gammons or backgammons, and does not optimize the expected number of points - just the probability of any win. This is of course suboptimal, but it is simpler, and after two failed attempts in the past I wanted to be a bit conservative.

The hidden nodes depend on the inputs in the standard way:




where y_i is the value of the i'th hidden node, w_ij is the weight of that node with respect to the j'th input, and x_j is the j'th input. N is the number of inputs, fixed as above at 198.

The output node value z depends on the middle node values y_i in a similar way:




where v_i is the weight of the output to the i'th hidden node and M is the number of hidden nodes (which is a parameter we can tune, but is normally something like 40).

One interesting consequence of the symmetry in perspective I put into the framework is that it's easy to take a board from the perspective of player 0 and calculate the probability of a win, then flip perspective and calculate the probability again.

These two probabilities should of course sum to 1, but they do not in general. So I looked for a way to enforce that.

The easiest and most elegant way I found was to define a relationship between the weights. The details are slightly tricky and left as an exercise to the reader, but if you add constraints on the middle weights such that the weight for the reference player is negative the corresponding weight for the opponent, then the middle nodes flip properly. To get the output to respect the perspective shift you need one more constraint: all the output weights must sum to zero.

So these constraints effectively reduce the number of weights from M*196+M to M*98+(M-1). For M=40 this reduces the number of weights from 7,880 to 3,959. This should help speed up the numerical search.

I'm not sure whether other neural nets follow this same approach to reducing the number of weights based on enforced symmetry - I assume it's a pretty standard strategy though.

Anyways, given the weights and the inputs (based on the board layout) you can calculate the hidden node values, and then use those to calculate the output, which represents the probability of a win (remember, any win - it doesn't distinguish between single wins and gammons/backgammons). That is exactly what the strategy returns as the boardValue.

For learning, I followed the usual approach. I did alpha splitting (so using alpha for the learning speed for the output weights, and another speed beta for the hidden node weights). My symmetry rules changed the equations a bit, since I need to only update half the hidden node weights, and the other half automatically update. Also you get another term in the expressions when you take the derivative of output with respect to a given middle weight. Similarly for the output weights, if you represent the constraint by making the final output weight equal the negative of the sum of all the previous ones.

To initialize the weights everything I've read just says "use small random numbers". I couldn't find anything that noted what "small" meant exactly, so I used weights uniformly distributed between -0.1 and +0.1.


No comments:

Post a Comment