Arimaa Forum (http://arimaa.com/arimaa/forum/cgi/YaBB.cgi)
Arimaa >> Bot Development >> Stochastic tree-building with deterministic eval
(Message started by: Fritzlein on Feb 16th, 2007, 6:08pm)

Title: Stochastic tree-building with deterministic eval
Post by Fritzlein on Feb 16th, 2007, 6:08pm
Omar recently pointed me to an article on MoGo, which revived my interest in something 99of9 proposed in an earlier thread.  MoGo uses stochastic methods both for building a search tree, and for evaluating the leaves on the tree.  Stochastic evaluation (i.e. random playouts) makes sense for go on a small board, but performs terribly for Arimaa for reasons 99of9 pointed out: flinging rabbits forward always looks good because they might get through to goal against random defense.  In fact, JDB's bot with random playouts was a darned strong player within a few moves of the end of the game, but that's just because that's the one time when flinging rabbits forward actually is good.  In the opening that bot just gave up rabbits for nothing.  For the most part I think we can rule out playouts as a method of evaluating the leaves on our search tree, as 99of9 said.

But if our evaluation is deterministic, that raises the question of why we would ever search stochastically.  It seemed, to my intuition, that we might as well just do forward pruning.  For example, if you are eventually going to examine all the moves at the root, then start by examining all the nodes at the root.  Why wait?  If at the next level of depth, you are eventually going to expand 50% of the nodes into a deeper search tree, then just do that right away.  I couldn't see any benefit of building a search tree randomly and incrementally as compared to simply doing stochastic forward pruning in a breadth-first fashion.

Now, however, the possible benefit finally dawns on me.  We don't really know in advance how many moves will seem to be worth investigating further.  It may be that many moves look equally promising, in which case we want more breadth to find a hidden pitfall or a hidden treasure.  It may be that only a couple of moves look good, in which case we want more depth to make a finer distinction between those few good options.  In other words, we need a function to manage the breadth/depth tradeoff based on actual evaluations, rather than based on pre-determined percentages.

Furthermore, by analogy to how I actually analyze moves as a human, the depth/breadth tradoff varies based on what my search indicates.  I may quickly settle on three moves as being promising, but when I look a bit deeper, it seems that none of the three are good after all, so I go back to the root and find a fourth option.  On the other hand, if two of the three seem to be crushing, I want to play out only those two a bit deeper to be really sure, and will probably not seriously consider anything else.  Thus I want forward pruning, but not forward pruning where I can't later go back and open up a new line of investigation.

Reading about how MoGo builds a search tree convinced me that, even with a deterministic evaluation function, I might still want to build my search tree stochastically.  If I expand the root node and do static eval on each leaf, I get a deterministic number, but that number is not reliable.  It will change based on further examination, and I can't predict whether it will change for the better or for the worse.  So even with a deterministic evaluation it is as if it contained a random component.  I can't rely on the leaf evaluation for irrevocable forward pruning: I have to be ready for the idea that, the more I look at what seems to be a good move, the worse it may later appear, in which case I need to look at other moves as well.

This completely dovetails with the problem of the gambler who has a choice of many slot machines to play.  He can experiment by playing all the machines for a while, but before long he will want to concentrate on the machines with the higher payout, and play those more often.  On the other hand, he doesn't want to get stuck playing only the "best" machine when he isn't really sure that it is best.  Particularly if the slot machine he thought was best doesn't pay out for a while, he might want to revisit machines he had tried before and rejected, but even if his favorite machine is as good as expected, he might want to experiment a little bit.

I was unaware that this gambling problem had been mathematically "solved".  By solved I mean that there is a way of deciding which machine to play next on every pull that guarantees you will eventually play the best machine way more than any other.  Applied to building a search tree, it means that you will eventually search the best (i.e. most important) lines the most deeply.

If you imagine a full-width search tree with evaluations at the leaves, then the minimax path will be a single line down that tree, i.e. the best move, the best response, the best response to that response, etc.  That's what bots give as a PV, i.e. predicted variation.  However, with further searching, part or all of the minimax path might change.  To use our time really efficiently, we would like our search tree to be bushy where the minimax path will probably ultimately be, and sparse where it probably won't be.  Alpha-beta pruning makes the tree 100% complete where the minimax path might be and totally empty where the minimax path can't be, but we want to prune both more and less than this.  We don't want to totally give up on lines which alpah-beta pruning may have to ressurect in a further iteration, so we keep them around to a low depth, and we also want to look deeper than alpha-beta can, so we selectively expand some nodes much deeper than other, particularly at places along our minimax path, or where we expect the minimax path to be in the future.

In order to use the solution of the gambling problem to build a tree, we have to fully expand any node we expand at all.  That is to say, the gambler must try every slot machine once before trying any one twice.  This makes sense to me.  So to start building our search tree we have to fully expand the root node and statically evaluate all the leaves.  Then the gambler will do the logical thing and play the move with the best static eval, expanding the tree to include all the responses to that best move, and doing static eval on each.  The minimum static evaluation bubbles up the tree (since it was the opponent's move).

Now if the move we thought was best no longer appears to be best, the gambler will certainly play the move that previously appeared second best, and expand that node.  But even if the previously best move still looks best, the gambler will not necessarily play it again.  He will consult his mathematically proven formula to see which is better between exploring the best move again, or exploring the second-best move.  Of course, he would never want to look at the third-best move at this point.

Let's just suppose our breadth/depth tradeoff formula tells us to play the best move again.  Then we will descend into our node which was already expanded.  Since none of the moves at this level have been expanded yet, the gambler will of course pick the one with the minimum static evaluation (opponent's move) to play, and will do a static evaluation of all responses to that move.  The maximum bubbles up (our move again), and the minimum bubbles up from there, etc.  Note that we have built a PV of depth three after expanding only one node to depth two.  We've taken depth over breadth in our depth/breadth tradeoff.  This won't always happen, of course, but it will happen when the best move at the root seems waaaay better than any other move.  In this fashion we can continue to build a search tree that is only as bushy as it needs to be, and examine to great depth if a line seems "forced".

I'm very excited about this possibility.  One drawback I see, however, is that I'm not sure how to handle transpositions.  Without a good way to handle transpositions, a step-based search becomes unrealistic, and one would have to use a turn-based search instead.  Furthermore I'm not sure whether "killer moves" would still be useful, i.e. whether one could somehow transfer knowledge of what worked well in one line to a different branch.  Probably other optimizations to alpha-beta searching also become useless.

On the other hand quiescence search can work as before: it's just part of the "static" eval.  Overall, it's worth thinking about, eh?

Title: Re: Stochastic tree-building with deterministic ev
Post by Janzert on Feb 16th, 2007, 9:11pm
Aww Fritz, there ya go tellin' everyone what I want to try. Well, except that I didn't actually tell you that was what I want to try. :P

The algorithm Fritz is referring to and partially describing is UCT (Upper Confidence bounds applied to Trees). The original paper describing it is at http://zaphod.aml.sztaki.hu/papers/ecml06.pdf. The Mogo homepage is http://www.lri.fr/~gelly/MoGo.htm.

To slightly expand on Fritz' great explanation. At every position two numbers are tracked. The estimated score and a confidence bound on the score. You then choose the move to explore by taking the most optimistic move, i.e. the move with the highest estimated score plus confidence bound. Once you reach a leaf node you then perform an evaluation to update the score and confidence value. In go this has consisted of a random or semi-random1 playout of the game.

In regards to transpositions; many (most?) of the go programs are using a transposition table in the same way you would with alpha beta instead of building an actual tree. I don't think killer moves make sense for UCT since you aren't doing iterative deepening, or rather you're constantly doing iterative deepening and never a full width search, i.e. you take only one path from the root to a leaf never backing up only part of the way.

I'm not sure how well UCT will combine with standard board evaluation. Random playout gives very little information about a position, almost certainly an information theoretic bit or less, but the information it does give is always, at least in some sense, correct. But I do think it should be a pretty interesting thing to try out.

Janzert

Title: Re: Stochastic tree-building with deterministic ev
Post by JacquesB on Feb 17th, 2007, 4:41pm
Hi. I am also a Go programmer. (I say also, because I can see many known names from the computer-go list.)

Your description of UCT is very good. I have some experience with UCT. (After being for a long time among the skeptics, I have finally written my UCT engine for research.) This subject has been discussed very much in the last 8-10 months in the list. Let me point some things i think relevant to this discussion:

1. You say " ... expanding the tree to include all the responses to that best move, ... ". This is very memory consuming in Go and even worse in Arimaa. The good news is, it is not necessary. Today, all UCT engines only expand a node when it has been visited n times. And in Go, values of n as high as 100 do not weaken the program significantly, while the memory requirements drop spectacularly. (See the computer-go list at http://computer-go.org/pipermail/computer-go/ )

2. It is very important to understand that the random playouts do not need to be uniformly random. Uniformly random playouts are not bad because they do not have the worst "vices" (being biased or systematically insisting in an error), but they produce very slow convergence. The best playout (in Go) proved by experience is pattern driven. Bear in mind that the "random" positions always follow the current position in the game. If a playout is "too smart" and produces some error (example in Go, not defending something believed to be dead which isn't really), that error will be repeated systematically in all (or most) nodes. Therefore, a "stupid" playout is frequently better than a "smart" playout. I think that an unbiased random playout that does not prune any options (unless you are 100.0000% sure that you can prune them), but plays "better" options more frequently than "worse" options is the best.

3. Reasons "against" UCT (it my own post, it is short enough to copy here)

Matt Gokey wrote:

> But what are some of the reasons MC is not even better?
> 1-Since MC engines don't deal with tactics directly, they're not likely
> going to play tactical sequences well for low liberty strings, securing
> eye space, cutting and connecting, ko fights, or ladders, etc.
> 2-Also because most of the play-outs are usually nonsense, they may
> have trouble dealing with meaningful nuances because the positions that
> will lead to these distinctions just don't arise with enough statistical
> frequency in the play-outs to affect the result.  Yet when very
> selective moves are used in the play-outs, too many possibilities can be
> missed.
> 3-Finally, with 19x19 anyway, the size of the board and game tree
> probably limits the practical effectiveness of the sampling and move
> ordering. I don't try to address this last point any further in this
> message.

My own 4th reason:

As Luke Gustafson pointed out, we have to contemplate the simulation as a _stochastic process_. We want to determine the conditional probability of a win given a particular move is made. And that depends on the _length of the simulation_. Dramatically! If the simulation is 500 moves long (Chinese rules with recaptures, etc.) the observed variance at an early move blurs out everything.

Just a simple stochastic process: Count a dollar each time you correctly predict a p=1/2 coin thrown n=500 times. The expected average is (obviously) 250, but the expected variance of that measure is n·p·(1-p) = 125 proportional to n.


3. Reasons why I like it:

a. Alfa-beta is very sensitive to erroneous evaluation, in fact it seeks outliers. In the opposite UCT is very "noise-resistant" and stable.
b. UCT evaluations can be interrupted at anytime and you always have a reasonable moves.
c. UCT finds very solid and robust moves. And does that by searching an insignificant fraction of the nodes. It finds very natural adaptation to handicap games neither overplaying nor underplaying.

One could say, that when the search space increases too much for "deterministic" search (Alfa-beta, Df-Pn, etc.) only UCT can go a step further. But just *one* step. ;-) I don't think UCT will "solve" Go and probably, neither Arimaa.

It has been a very interesting 2006 advance in Go and the same may happen in2007 in Arimaa. Good luck!

Jacques.

Title: Re: Stochastic tree-building with deterministic ev
Post by Fritzlein on Feb 17th, 2007, 9:17pm
Janzert, sorry to spill the beans.  I didn't know you were seriously designing a bot.  I thought you just liked to think and talk about such issues, as I do.  I'm sure I will never write my own bot, but I still find the underlying technologies fascinating.

JacquesB, thank you for your long and thoughtful response.  It is interesting to know more of what is going on in the Go programming world than my meager reading can provide.


on 02/17/07 at 16:41:42, JacquesB wrote:
1. You say " ... expanding the tree to include all the responses to that best move, ... ". This is very memory consuming in Go and even worse in Arimaa. The good news is, it is not necessary. Today, all UCT engines only expand a node when it has been visited n times.

I can see the point in not immediately expanding a node when one is using stochastic evaluation.  However, if we are really going to use deterministic evaluation, then there is no point evaluating a node twice.  No new information can be gained except by expanding the tree under the node.

I hadn't considered the memory-consumption issue.  I guess alpha-beta searching doesn't need to save the whole tree, but for our UCT-built tree we would have to save every node we expand.  Running out of memory would place a premium on better static evaluation.  A deterministic static evaluation can't do better by running twice, but it could be made smarter at the expense of speed.  The equivalent of running a stochastic evaluation 100 times might be making the deterministic evaluation 100 times slower (and hopefully correspondingly smarter).

For the sake of conserving memory one might also want to build the search tree step by step rather than move by move.  At the step level Arimaa's branching factor is probably only 20-something.


Quote:
I think that an unbiased random playout that does not prune any options (unless you are 100.0000% sure that you can prune them), but plays "better" options more frequently than "worse" options is the best.

I don't understand: in what sense is a random playout "unbiased" if it plays better options more frequently than worse ones?

I appreciate your argument about the robustness of evaluation by semi-random playout, but I remain skeptical that it can be even close to accurate for Arimaa.  Admittedly, I would have been skeptical it could be useful in Go as well, and I would have been wrong in that case, but that still doesn't persuade me for Arimaa.

Arimaa has something Go does not, and that is reasonably accurate material evaluation.  For Go it isn't easy to tell whether some stones are alive or dead, but for Arimaa a dead piece is dead, no doubt about it.  This gives a solid foundation for static evaluation in Arimaa.  I expect static evaluation will be distinctly more reliable than playouts in Arimaa, except when one or both sides have an immanent goal threat.


Quote:
As Luke Gustafson pointed out, we have to contemplate the simulation as a _stochastic process_. We want to determine the conditional probability of a win given a particular move is made. And that depends on the _length of the simulation_. Dramatically!

Yes, it's a good point that the longer the simulation, the less reliable it is.  If you are 40 moves for each player from the end of the game with 4 steps per move you have 320 steps to the end of the game.  (Generating random moves rather than random steps, although it cuts down the length of the simulation, is probably too costly.)   This is another reason for using deterministic evaluation rather than random playouts, except very close to the end of the game.


Quote:
a. Alfa-beta is very sensitive to erroneous evaluation, in fact it seeks outliers.

Yes, this point is well made.  We know that Bomb, our strongest program, intentionally seeks out some positions that are known to be bad.  If you angle for such a position, Bomb will actually cooperate with you to get there faster, due to its erroneous evaluation.

That said, I don't believe the robustness of random playouts can manifest in a reasonable amount of time.  A primary weakness of evaluation by playout is that a move may be good "on average", but bad in light of the opponents best response.  Maybe in Go these moves are the exception rather than the rule, but in Arimaa they are the rule rather than the exception.

I am speaking here of rabbit advances.  The few precise moves that kill and otherwise exploit the enemy's advanced rabbits will be statistically overwhelmed by the multitude of moves that don't punish the advances at all, and indeed perhaps let the rabbits through to goal.  In Arimaa defense is stronger than offense (in the opening), but only with accurate play.  When the play is inaccurate, offense is stronger than defense.


Quote:
b. UCT evaluations can be interrupted at anytime and you always have a reasonable moves.

Yes, like alpha-beta searching with iterative deepening.  I love the "anytime" feature of UCT tree-building.  The part I am skeptical of is stochastic evaluation


Quote:
c. UCT finds very solid and robust moves. And does that by searching an insignificant fraction of the nodes. It finds very natural adaptation to handicap games neither overplaying nor underplaying.

The fact that this is true for Go gives me hope that it might be true for Arimaa.  Bots have a problem that in some positions they play ridiculous moves.  If they could play a somewhat reasonable move in every position (forget about the best move) then they would probably overwhelm humans simply by failing to make tactical errors.


Quote:
I don't think UCT will "solve" Go and probably, neither Arimaa.

I quite agree.  I don't immediately see how even searching deeper in promising lines is going to produce reasonable moves.  We don't have a good way of determining which lines are worth further investigation.  (Deterministic evaluation is poor, stochastic evaluation is worse.)  At the end of the day I remain skeptical about bots' ability to play Arimaa well.  I think only that UCT is a intriguing idea for making bots play a little better than they do now.


Quote:
It has been a very interesting 2006 advance in Go and the same may happen in2007 in Arimaa.

I hope so!  But it will fall on Janzert to actually implement a bot of this type.  ;)

Title: Re: Stochastic tree-building with deterministic ev
Post by Fritzlein on Feb 17th, 2007, 9:26pm
I have belatedly realized that my title for this thread is incorrect.  UCT is not stochastic tree-building.  At each point the decision of what node to expand next is deterministic.  Perhaps a better description is "uneven" tree-building, although that is already done in many ways different from UCT, i.e. selective extensions.

Also, I suddenly wonder whether UCT is appropriate with a deterministic evaluation.  Yes, there is an element of uncertainty (which maps somewhat to randomness) in deterministic evaluations, but the changing evaluation is not like repeated independent trials of the same random variable.  The standard deviation of the average of a random variable decreases with 1/sqrt(n) over n trials.  Does the evaluation of a node increase in accuracy at the same speed in the number of trials?  Maybe some different breadth/depth decision function is appropriate.

Title: Re: Stochastic tree-building with deterministic ev
Post by jdb on Feb 17th, 2007, 9:58pm
The random player I made up a while ago used UCT. It expanded the tree via steps not turns.

Using random playouts it did OK in the endgame, but in the rest of the game it was very weak.

I tried using the static eval. One way I tried this was to run the random playout for, say, 10 steps and then apply the static eval. This way the static eval score is not deterministic. This bot played about the level of Arimaazilla, if I recall.


Title: Re: Stochastic tree-building with deterministic ev
Post by Fritzlein on Feb 18th, 2007, 9:35am
JDB, if I remember correctly, that bot was actually quite fearsome in the endgame.  I would be more afraid of it than Bomb, because Bomb assumes that its opponent will make the best available move, whereas UTC-Clueless allowed for the possibility of the human cracking under pressure.  I'd probably rather face the "objectively best" move than the move which gives me the greatest opportunity to mess up my goal defense.

Do you recall exactly when UTC-Clueless was playing?  I'd love to go back and look at those games again.

Also, did you ever try relying entirely on static evaluation at the leaves with no random component?  I would be curious if that works at all.  Maybe since you have the UTC part written that wouldn't be too tough to test?

Title: Re: Stochastic tree-building with deterministic ev
Post by JacquesB on Feb 19th, 2007, 5:20am
I take my examples from Go, because my Arimaa knowledge is very limited (trap tactics are enough to win the bots below 1500 and in am there for the moment.) Of course,  there are many things that have to be totally different. I have the impression that in an Arimaa game you play like a Roman army for a while (organized, controlling some targets: traps, material, etc.) and you finish like a Barbarian invasion ;-) For two possible reasons: either because you are winning or because you are desperate. Many games are won by the player who was below. It seems that if you start the Barbarian invasion too early, you don't have real chances. Maybe a good Arimaa bot should play, both Roman and Barbarian. It should also estimate the probability of winning the Barbarian invasion to decide when to start, probably before it is 100% clear. I don't know in Arimaa, but in Go doing the things when they are clear is slow.

The paradox is that in Go UCT plays Roman, not Barbarian. It plays slow but strong and sound moves, it does not try speculative invasions or such things. In Arimaa as you say, its the opposite.

Now, answering your post:

1. The memory issue: It's clear, I think. Expanding all legal moves at any terminal node reached, as was done by the first UCT engines, is very memory consuming. No matter it you build trees of moves or a transposition table of board images.

2. The bias: At each terminal node you are using a stochastic process to evaluate your chances of winning following that branch of the tree. There is a random variable (the "real" probability) which is unknown. And you have an estimator of that variable computed from your sample (the random games you have tried).  It is crucial that the estimator is an unbiased estimator of that variable. Of course, you may say that the estimator defined as p-hat = # wins / (# wins + #looses) is unbiased, so what is the point? The point is that if you try "smart tricks" (In Go it is very tempting to make the playout shorter using tricks) as soon as these tricks make the estimator unbiased, the result is ruined. Nobody has afaik given a mathematical proof. I am talking from experience and what I have read mostly in the list and in Sylvain Gelly's paper "Modifications of UCT with Patterns in Monte-Carlo Go" (Silvain's page is http://www.lri.fr/~gelly/Publications.htm). It is also very dangerous to prune moves. If it happens that the moves you pruned are necessary moves, the evaluation will be flawed.

3. Maybe in Arimaa since, unlike in Go, both players are not seeking the same (territory), but defending vs. attacking, it is hard to define what is unbiased. Attacking vs a random defense is too easy. But, is random attack vs random defense biased? (Not in the sense that it is at 50%, but in the sense that the probability of the attack succeeding is about  the same as among balanced players. Of course, that depends on the players level.) Also, both players attack and both defend, but from a given position that does not necessarily compensate bias. Some positions are offensive, some are not, some require defense urgently. It's complicated.

Jacques.



Arimaa Forum » Powered by YaBB 1 Gold - SP 1.3.1!
YaBB © 2000-2003. All Rights Reserved.