Saturday, August 27, 2011

Some more results

The earlier work was not in vain: I generalized the network so that they do not make the symmetry assumptions, but used the symmetry network weights as starting points for the training of the generic networks. That allowed relatively fast training.

Here are some results showing the training performance for 20-, 40-, and 80-hidden node networks:

The training worked as before, except that the benchmark was no longer the pubEval player, as that was too weak for sensible comparisons. Instead, each generic network used the maximal-performance version of the symmetric network with the same number of hidden nodes as a benchmark. So for example, the 80-hidden node training used the 80-hidden node version of the symmetric network as a benchmark - and used the instance of that symmetric network that performed best against the pubEval benchmark in its training.

In all cases the trained networks performed around +0.2-0.3ppg vs their benchmarks, and in the maximal cases got to about +0.4-0.5ppg. (In the 200-game comparison runs done every 1,000 training iterations.)

These results show that the training does continue to improve the performance of the bots, even without adding a separate race network. However, there are some mysteries that I am still trying to decipher:

  • At each training step, I have to train it against the board and its flipped-perspective board. That makes sense for the final step where you use the actual win or lose performance at the end of the game, since otherwise you would never train the probability-of-gammon-loss node (since the update function is called only from the perspective of the winner, since that is the last play in any game). But I do not understand why the training fails to work if I do not train it on the flipped board for every step.
  • The network should know which player holds the dice. As mentioned before, this is key to knowing the probability of winning or losing at every point, and especially at some end game scenarios. I tried adding a new input which is 1 if the player holds the dice, and 0 otherwise; and initialized the weights for those inputs to zero. However, the training fails if I actually train those weights. The results above do not train these weights.
"Training fails" above means that the weights become large and positive for the probability of win and probability of gammon win nodes, and large and negative for the probability of gammon loss node. And of course the measured performance of the network vs its benchmark is terrible.

I am not sure why this is happening, and I want to sort that out before I move on to adding a separate race network.

Tuesday, August 23, 2011

Not so smart again

One trick I used was to assume that when the perspective of the board flips, the probability of win flips to the probability of loss - ie, that the probability player 1 wins is the same as the probability player 2 loses.

However, that's not really the case. The network returns the probability of win given that it's your turn with the dice - so flipping the perspective does not just flip the perspective, but also hands the dice to the other player!

The starting board is one way to see this, but it is true for any symmetric setup - the probability of a win is forced to be 50%. The most extreme example: both players have only two checkers left to bear in, both on their 1 spot. Then the play with the dice has a 100% chance of winning and the other 0%. In my setup, the network would think both have 50% chance of winning.

So my whole initial assumption about symmetry on board flips is now in question.

But at least it solves one of these issues: the probability of winning the game from the starting position was always 50%. Even adding the bias node didn't change that; the symmetry I enforced always made sure that flipping the perspective was equivalent, when it isn't really.

Have to think about how to tweak the setup. This could require some relatively deep changes.

Not so smart

Turns out that the first innovation I mentioned in the last post - training on the conditional probability of gammon loss in the case that the game ends in a gammon - is irrelevant.

That's because the training only ever ends a game on a win for the player, since the winner is always the last side to play in a game.

As part of playing with this, though, I noticed that the probability of a gammon win from the starting board changes quite a bit during the training: anywhere from 10% to 40%. And the network does not recognize that the probability of a gammon is zero if the other player has borne in any checkers. That's one innovation that should help things: force the bot to notice that game fact.

Also, the probability of win from the starting board is exactly 50%. That is by construction, since I require that flipping the board perspective must give the same value. Of course, the probability of win given that a player has the first move is a bit more than 50%, and that should come out of this.

The right thing here, I think, is to add a bias node to the probability of win. That changes the symmetry relationship so that the probability of win is not always 50%.

Monday, August 22, 2011

The next steps

I've finally got a bit of free time to spend on the backgammon bot again, so I'm trying two quite different next steps.

The first is a bit of a tweak on what was already there. It came up because I noticed something unappealing about my current setup. The second output node is the probability of a gammon win conditioned on a win. When a game is over, I set that to 1 if the player wins a gammon (or backgammon). However, I would set it exactly the same way if the second node represented the unconditional probability of a gammon, which seems a bit weird.

One solution to this is to ignore the second node altogether (for training its weights or the middle weights) in the case where the player loses the game. But that seems a bit wasteful. In any case, I wasn't ignoring the second node before in losses, and I think that was incorrect. Perhaps that was why many of my trained networks ended up with artificially large estimates for probability of gammon.

I can do better than just ignoring those games though: if the player loses, instead of setting the "true" value of the second node to zero and training on that, I train on the conditional probability of a gammon loss conditioned on a loss. That's a kind of virtual third node in my setup; its value is entirely determined by the weights of the second node, to satisfy symmetry on flipping the board perspective.

I'm not sure how big a difference that will make, but I'm running some tests now to see.

The second change is more significant: I am adding a second neural network that is used in races, in addition to the one that's already there which I use in contact games. Those two game regimes are significantly different, and all the literature I've read suggests that generalizing to different networks for different qualitative game phases improves things significantly. (This version also includes the train-on-gammon-loss change described above.)

Hopefully I'll have some results in a couple of days for both extensions.

Wednesday, August 17, 2011

Stanford Machine Learning course

Interesting - Stanford is offering a free online course in Machine Learning this fall. I expect it'll cover some of the neural network stuff that the backgammon bots are based on.

They're also doing one on Artificial Intelligence and one on Introduction to Databases.

Monday, August 8, 2011

An example game

Here is the output of an example game I played against the bot. I was white and the bot was black:

New results post-bug

The results after fixing the bug actually are not terribly dissimilar to the ones before, except shifted downward a little in points per game. It is a little surprising how much it looks like it just changed the gammon payout to three from two without affecting the training at all - I would have expected its training path to change a little as well.

Here are the results:

I also added a strategy that is a "human" strategy - ie one that lets me interactively play against the trained neural net strategies. I had it print out the network's estimate of probability of win and gammon at each step in the games I played to get an idea how reasonable they are.

I played a 10-game match against the strongest player - the 80-node network that during training performed best against pubEval (+0.885ppg in the 200-game benchmark match). It actually beat me! I played -0.2ppg against it.

Note: I discovered later on that my pubEval benchmark implementation was wrong; the results above are invalid. PubEval is a much stronger player than it would seem from the numbers above. In any case my network setup also had a significant bug, which I later discovered and fixed.

I'm not a great player. I play around -0.4ppg against the best backgammon bots, with an ER (error rating) around 9. So solidly intermediate according to the tables I've seen.

The 80-node bot isn't perfect - it left a few puzzling shots. For example, it played a 63 off the starting board as 13/4, leaving a shot in its home board right in front of my two back checkers. Its estimate of the probability of a gammon win don't swing to zero when the other player has borne in.

But these are really promising results! I feel quite proud that I've build a bot that game beat me (even though I suspect it was just a little lucky in that match...).

Friday, August 5, 2011

Another bug!

Arg - another bug. The game framework was accidentally calling almost all gammons backgammons.

That's most likely why the pubEval comparison was so high compared to the published results, and why so many of the simulation runs ended in backgammons.

Again this invalidates all the previous results, because the networks were basically trained to treat a gammon as three points instead of two. Perhaps that's why the trained networks left shots more often than they should have - they were too incentivized to try for a gammon (which it scored as three points instead of two).

So... bug fixed, and off we go again to re-train the networks. <Sigh>

An example game

I took the 80-hidden-node network with the maximum score against pubEval and played it against itself to see the quality of play.

Overall it's pretty solid. I'd say a weak intermediate level of play now, though still making some clear mistakes with leaving shots.

Two output networks results

I finished 2M training runs (again with alpha=beta=0.1, no alpha/beta damping or backtracking) on the new network which has two output nodes, after correcting the bug. Here are the results:

The blue dots are the benchmark runs for the 10-hidden-node network, and the blue line is the rolling 10-point average. Red is the same for the 20-hidden-node network; green is 40 nodes; and purple is 80 nodes.

The bug ended up not making that much difference; the trained networks still perform extremely well against the pubEval benchmark.

So much so that pubEval probably isn't a great benchmark anymore. The trained networks backgammon pubEval almost a quarter of the time, which is not realistic.

Note: I discovered (much) later than I had a significant bug in my pubEval implementation. After fixing that it's a much stronger player. I also had some significant bugs in my network design that held them back. Check out posts from Jan 2012 to see proper results.

Now to look at some example games that the trained networks play against themselves to get a feel for how good they are...

Thursday, August 4, 2011

An interesting perspective from the academy

An interesting article I found on training neural nets with different approaches - directly relevant to the approach I'm trying here.

A few observations from this:
  • Using a single neural net isn't optimal - you should use different nets for different parts of the game. For example, one net for bearing in, one for a race, and one for the rest. I've seen this discussed elsewhere as well. The article goes a bit nuts and has nine separate networks for different varieties of game position. I'll get to multiple nets eventually.
  • The optimal value for lambda they found was around 0.6. Not sure how this squares with comments from Tesauro in his later papers that lambda=0 is the best to train with. I've been using lambda=0.
  • They didn't evaluate against pubEval - they created their "expert" network (actually group of nine networks) and trained that for a while, and then used that as the benchmark. While training the expert they chose the rolling best performing version as the benchmark for subsequent training, and decided when to stop training based on when the performance had converged on that basis.
Pretty interesting. The meat of the paper, though, is on how different training methods compare (self-learning vs using the expert at each step to specify the best move vs training against expert v expert matches). The conclusion was that training against the expert's evaluation at each step was the most efficient, followed closely by self-learning, and that training by watching expert v expert games was not very efficient.

Wednesday, August 3, 2011

A bug!

Uh oh - I discovered a bug deep in the guts of the backgammon engine, in the piece that determines the available moves given a roll. I found it when running my trained network against itself for a 1,000 game benchmark - and found that player 1 would consistently lose 0.3ppg against player 2, even though they were the same strategy. It didn't matter if the strategy was a neural net or pubEval - either way it was broken.

Fixed now, but it does invalidate all my earlier results.

I suspect the really strong performance (stronger than the published TD gammon results) I recorded against pubEval was an accidental consequence of this. I'm re-running the training now for the two-output network (10, 20, 40, and 80 hidden node versions) and we'll see how those come out.

Note: I discovered much later that I had a few significant bugs. The most relevant here was in my implementation of pubEval - after fixing that it was a much stronger player.

Tuesday, August 2, 2011

Extending the outputs

The first network I set up had just one output node, which represents the probability of a win - any kind of win, ignoring the distinction between single wins and gammons/backgammons.

This was a simplification of the original Tesauro work which had four outputs: one for a win, one for a gammon win, one for a loss, and one for a gammon loss. There was no knowledge in the network of backgammons, under the assumption that in practice they happen so rarely that they can be ignored (this is true in regular games between experienced human players).

So the next step in the evolution of my net is to add an understanding of gammons. As with the single-output network, I want the network to do the right thing when you flip the perspective of a board and pass it in. For a multi-output network this means that prob of win->prob of loss, prob of gammon win->prob of gammon loss, and so on.

Example of learned gameplay

I tried a game where I played the 40-hidden-node player against itself, and printed out the results (using a fairly simple board representation - no pretty graphics yet!).

In general the player was pretty good. Not great; perhaps better than a beginner human player, but certainly not close to intermediate (which I judge because I think I'm intermediate and can see lots of places where it made the wrong move).

Particularly, looking at a couple examples of the gameplay, it looks like it's not very good in races or efficiently bearing in. It also leaves shots more than it should.

Monday, August 1, 2011

The next set of results

I turned off backtracking and alpha/beta decay, since that seemed to have minimal effect in my previous experiments. I also dropped alpha and beta to 0.1 to avoid jumping around too much.

Then I let the training run for 2M iterations to see if I could get something that is plausibly converged. Honestly I don't even know if a realistic neural net does converge, but looking at other articles on TD gammon training it looks like they do.

Here are the results:

The blue line and blue points are for a network with 10 hidden nodes; green is 20; and orange is 40. The points are the average points per game playing the network against the pubEval benchmark for 200 games. The lines are the rolling 10-point averages.