Arimaa Forum (http://arimaa.com/arimaa/forum/cgi/YaBB.cgi)
Arimaa >> General Discussion >> Spooky material evaluation result
(Message started by: IdahoEv on Nov 7th, 2006, 1:58pm)

Title: Spooky material evaluation result
Post by IdahoEv on Nov 7th, 2006, 1:58pm
I picked up some code I started back during the summer to come up with an "optimal" material evaluation function.   The idea is to generate some formulas that might serve as good material eval functions, then measure how well they predict the ultimate winner of a game just by looking at mid-game material states.  

I'd curve-fit the coefficients of said formula with a gradient descent technique, scoring them by what fraction of the time the material eval formula correctly predicts the winner of the game.  

I was planning to report in once I had some nice solid results - comparing FAME and other systems to other ones of my own devising.  But the results I got last night were so strange I just had to share them.  Maybe they'll turn out to be the result of a bug.

I started testing the system last night with a very naive function: all rabbits are worth 1 point.   Two coefficients determine officer value:  A is the ratio of cat:rabbit, and B is the ratio of dog:cat,  horse:dog, camel:horse and elephant:camel.   (the system also collapses categories when they become equivalent, i.e. cats get promoted to "dogs" if the opponent has no cats or dogs, etc.)

I expected the best performance to be around A=1.5 B=1.5, just intuitively; each rank of pieces is worth 50% more than the last.

But remarkably, I find the best fit when A and B are both exactly 1.0 ... in other words, with this formula, the absolute best I can do is to simply count the number pieces each side has and predict that the player with the largest number of pieces is winning.   If either of the coefficients changes by even 1% there is a noticeable increase in the number of states for which the eval incorrectly predicts the eventual winner.   By simply counting the number of pieces and assuming the player with more pieces is winning, this evaluator can predict the outcome of the game correctly over 85% of the time.  

There's a secondary minimum in the error around where B=1.4, corresponding to my intuition, but that form of the formula only correctly predicts about 82% of outcomes.   Counting the pieces definitely does slightl better.

I don't know if that's caused by a bug in my system (seems unlikely; the code is very thoroughly unit-tested and all my spot checks show it is analyzing states correctly)  or some artifact of the particular games in the archive and/or the way decent players play.   Maybe because good players only generally allow captures of weaker pieces, counting the extant pieces is as good as an "actually correct" eval formula when examining the actual database of games.  

I'm only considering games where both players are rated at least 1600, that were rated, and which progressed to a natural win by one player.    I tried excluding h-b games because so many players bait Bomb by throwing away a cat, but it didn't change the result.

I'll post again when I figure out what's going on.   But "simply counting the pieces is the best material eval you can do" was too weird a result not to share.


Title: Re: Spooky material evaluation result
Post by Janzert on Nov 7th, 2006, 3:31pm
Interesting, have you seen the old thread from earlier this year (http://arimaa.com/arimaa/forum/cgi/YaBB.cgi?board=talk;action=display;num=1145935577) discussing some similar things in regards to first capture and FAME?

It would seem that you need to get up to a FAME score of around 7 to have the same prediction success? If I'm correctly remembering the meaning of my own graph and understanding your prediction percentage correctly.

Janzert

Title: Re: Spooky material evaluation result
Post by IdahoEv on Nov 7th, 2006, 6:23pm
Interesting; I've read many discussions about related topics (and started a few) but had missed that particular one for whatever reason.

The biggest difference of my system from the ones described in that thread is that mine is very careful not to include material states from the middle of trades and bloodbaths.   I figure there's no point in counting player A's advantage for having just captured a dog if player B captures a horse on the following turn.

So I only quantify material states that persist for at least 2 turns.   Also, I'm only looking at games where both players were rated 1600 or greater.  

In combination, those two criteria most likely account for the increased accuracy of the predictions.

Title: Re: Spooky material evaluation result
Post by Janzert on Nov 7th, 2006, 6:43pm
Ahh, yes. That makes sense.

Now you got me wanting to plot out a 2d surface, percentage after having at least x score for y moves. :P

Hope you can run FAME through your exact setup sometime soon as well.

Janzert

Title: Re: Spooky material evaluation result
Post by IdahoEv on Nov 7th, 2006, 9:15pm
I am first looking at alternate error functions for my gradient-descent routine.

I think the fact that it counts all mispredictions the same way is why I have that strange minimum at A=B=1.0.   Right now the formula is punished just as much for saying white is 0.5 points ahead after capturing a single rabbit (if black won) as it would be if black had lost 10 pieces and white was 350 points ahead when black won.   So I want to score formulae in proportion to how right or wrong they were and see what that does to my functions.

I do intend to code FAME in at some point soon though and will tell you what I find.


Title: Re: Spooky material evaluation result
Post by seanick on Nov 7th, 2006, 11:03pm
I have a couple theories about this. Note, I don't really think  either of them even comes close to explaining the phenomenon, but more an attempt at rationalization.

1. Unlike chess, in Arimaa there is very little benefit to a sacrifice. It always means one less piece between your home row and the opponents rabbits. so keeping your pieces is a good strategy in all cases.

2. One person who keeps all of his pieces whenever I play him (+/- 1 or 2, At most) is Fritzlein. Now, he also plays a lot of games. And wins them.
so, either:
A. Fritzlein wins a lot of games because the strategy of keeping as many pieces as possible works, or
B. the evaluation is affected by the fact that a very active and very good player happens to try to keep all of his pieces.

2 is slightly in jest, but slightly believeable. 1, I am sure has some effect, but not 85% effect.


question to those who process games like IdahoEv and Fritzlein (and whoever else does that is willing to admit it): what method do you use to process the games? do you parse the txt files into a database first, or process them directly? I was thinking about doing some game analysis myself, but got sidetracked when the schema definition I was writing for a planned database of games started getting too complex :)

Title: Re: Spooky material evaluation result
Post by Fritzlein on Nov 7th, 2006, 11:25pm
IdahoEv, do you really wonder why we missed you so much?

This is an extremely interesting result, and I will reply more thoughtfully later, when it isn't so near my bedtime.

First question: Do you let A and B fall below 1.0?  For example, might there be a minimum at A=0.8 and B=1.5?

My second question is one that you anticipated.  Any material evaluation function will say the side is winning that has a free piece of any type.  If I get a camel for nothing, then everyone predicts it as a win for me, whether the camel is worth one rabbit or five.  Counting who is wrong/right provides very little discrimination.  How often will "count the pieces" disagree with FAME about who is winning?  Maybe only 2% of positions are differentiated, so relative merits of the two systems on the other 98% of positions are ignored by your metric.  Thus you may not find "what is the relative value of pieces" so much as "how can you tell who is winning in positions that are almost dead even".

We had a similar issue when designing the WC prediction contest.  If you are only allowed to bet for or against someone, then there is no discrimination between someone who thinks robinson is 55% to win and someone else who thinks robinson is 95% to win.   Both of them bet on robinson, and indeed, both should bet the maximum possible dollar amount on robinson to maximize their expected return.

I recall from a previous thread that an initial capture of a camel resulted in higher win percentage for the capturing side than an initial capture of a rabbit, so I imagine the data must somehow support the value of heavy pieces versus rabbits, even if it doesn't prove cats are worth more than rabbits.  (Which brings me back to allowing A < 1)

I would be interested to see the optimal values when the material advantage is converted into a winning percentage, and then scored by square error, just like in the prediction contest.  So if the material function predicted a 55% chance of winning, it wouldn't get much for being right or be penalized much for being wrong.

One way to convert a linear material advantage M into a percentage prediction is to take 1/(1+exp(-M/k)), where k is something like 4.  Unfortunately, since k must now be allowed to vary as well, that gives your model three parameters instead of two.  On the bright side, when you find the optimum for the three-parameter system, you will also have a percentage prediction for a given material imbalance, e.g. "a free horse gives a 65% chance of winning" rather than merely "a free horse is worth 2.347 rabbits".

Thanks again for your fascinating commentary and analysis.

Title: Re: Spooky material evaluation result
Post by Fritzlein on Nov 8th, 2006, 12:35am
After I went to bed, I thought of a likely bug in your code that would produce the spooky result.  I tried to go to sleep anyway, but I just couldn't relax.

I'll bet that when the evaluation sees the material as equal, you let it get away with not making any prediction at all.  Thus the more often the function can say a position is even, the less often it has to predict, and the more often it is right among the predictions it does make.

Suppose there is an initial trade of a horse for a rabbit.  A reasonable material function picks the side up a horse to win, and is right, say, 65% of the time.  This is pretty good, much better than saying the position is equal.  But the "count the pieces" algorithm, in contrast, makes no prediction whatsoever, and therefore doesn't bring down its 85% average of being right in its predictions.

If this is in fact the bug, it makes sense that the secondary minimum was with A=1, because that at least allows the function to say cats and rabbits are equal, avoiding predictions when there has been a cat-for-rabbit trade.

To fix this problem, force every function to pick a winner in every position where the pieces left on each side are not identical.  If each function has to make the same number of predictions, then A=B=1 will not get the most predictions correct, unless I miss my guess.

Phew, now maybe I can sleep.

Title: Re: Spooky material evaluation result
Post by IdahoEv on Nov 8th, 2006, 2:25am
Well I'll be d**ned, Karl.   That's a very good point:  my error function (as is) only punishes wrong predictions.   Failures to predict anything (i.e. scores within some small delta of zero) don't generate any contribution to the error function.

This adds up with another result I had tonight.   The "best score" of A=B=1 was found by hand; I hadn't coded the gradient descent algorithm yet.  

When I actually coded the gradient descent function today, I found that the well in the error function at A=B=1 was so incredibly narrow that the optimizer couldn't find it no matter what my annealing parameters; I had only found it because I tried those values by hand.  In fact, it pretty reliably settled on a minimum where A ~= 1.1 and B ~= 1.35, values nicely matching my intuition of the piece values.

In fact, I am willing to hazard now that the width of that fitness well is actually the same width as the delta I include around 0 within which I consider the score to actually be zero.

Fritzlein,  I think you just figured out what was wrong with my code,   without ever looking at my code.   How am I suppossed to compete with a brain like that come Saturday?    ;)

(Incidentally my attempts at a different error function tonight were dramatic failures.)  

Title: Re: Spooky material evaluation result
Post by IdahoEv on Nov 8th, 2006, 3:21am

on 11/07/06 at 23:25:37, Fritzlein wrote:
First question: Do you let A and B fall below 1.0?  For example, might there be a minimum at A=0.8 and B=1.5?


there are no constraints on the coefficients whatsoever, they can even go negative.   If cats are worth less than rabbits, the optimizer should find it.


Quote:
I would be interested to see the optimal values when the material advantage is converted into a winning percentage,  ...  One way ... is to take 1/(1+exp(-M/k)), where k is something like 4.


Yes, I've considered using a sigmoid to convert the score into a probability, much like a typical NN cell.   The only problem is that I have quantified something like 11,000 different material states that have existed, but < 200 of them have occurred more than once and < 50 of them have occurred more than 10 times.   So "probability" becomes a pretty sketchy question for most of the dataset.


Quote:
Unfortunately, since k must now be allowed to vary as well, that gives your model three parameters instead of two.


That's no worry.   The two-parameter system was just for testing things out.   I have formulae in mind I want to try with up to 5 coefficients.   In particular I think considering all 8 rabbits to be of the same value is highly unlikely to be the best solution and some of the formulae take this into account at some level.   Also, it seems unlikely that a material state with 0 or 1 rabbits should score well even if all the other pieces are still in place.

Title: Re: Spooky material evaluation result
Post by Fritzlein on Nov 8th, 2006, 8:31am

on 11/08/06 at 02:25:46, IdahoEv wrote:
In fact, it pretty reliably settled on a minimum where A ~= 1.1 and B ~= 1.35, values nicely matching my intuition of the piece values.

Well, that's less spooky than A=B=1, but still a bit a of a slap against my proposed static values.  I have said

R=1
C=1.5
D=2
H=3
M=5

but this suggests

R=1
C=1.1
D=1.5
H=2
M=2.7

It would re-write Arimaa theory once again if an initial camel for three rabbits trade favored the side with three rabbits.

Title: Re: Spooky material evaluation result
Post by Fritzlein on Nov 8th, 2006, 8:36am

on 11/08/06 at 03:21:06, IdahoEv wrote:
Yes, I've considered using a sigmoid to convert the score into a probability, much like a typical NN cell.   The only problem is that I have quantified something like 11,000 different material states that have existed, but < 200 of them have occurred more than once and < 50 of them have occurred more than 10 times.   So "probability" becomes a pretty sketchy question for most of the dataset.

You don't have to compare a probability to a probability.  You can score each prediction individually, even if it is a one-of-a-kind situation.

If you predict 55% and you are right, the penalty is (1-0.55)^2 = 0.2025 for that single prediction.  If you predict 55% and you are wrong, the penalty is (0-0.55)^2 = 0.3025 for that single prediction.  You add up all the penalties for each system without ever having to calculate, "how often am I right in this exact situation".

Title: Re: Spooky material evaluation result
Post by IdahoEv on Nov 8th, 2006, 1:48pm

on 11/08/06 at 08:31:40, Fritzlein wrote:
It would re-write Arimaa theory once again if an initial camel for three rabbits trade favored the side with three rabbits.


I'm going to go back and rework my material state collapse function before I go forward.   I'm using the one I proposed last summer (http://arimaa.com/arimaa/forum/cgi/YaBB.cgi?board=talk;action=display;num=1146199776) that collapses levels when entire levels are missing for some players, for example      EMHRRedccrrr is expressed as  120002-103003, with the MH collapsed into a single level for white and the dcc collapsed into a single level for black because the pieces are functionally equivalent.

However, expressed the way it is, the dcc have been collapsed *up* two levels, and this two parameter system would value them at A*B*B each - the normal value for horses. But is that state really more like an elephant and two camels vs. an elephant and three horses, or is it more like a horse and two dogs vs. a horse and three cats?   (These are *NOT* equivalent, because the value of rabbits is fixed!)  Intuition tells me the latter.

If there are enough such states in the dataset,  collapsing upward would probably tend to drive B somewhat lower to avoid overvaluing the small pieces relative to the rabbits when fitting the data.  

So instead I want it try collapsing downward, expressing that same state as 001202-001033.  The relative value of the officers is preserved, but the rabbits themselves become relatively more valuable in the endgame, which is intuitively satisfying for me.

Yes this would mean the elephant is less valuable in an absolute sense on such an emptied board, but I suspect that's actually accurate.   If the downward-collapse allows the function to fit the data as well or better, then I will go with it and start comparing the two-parameter system to other formulae.


Title: Re: Spooky material evaluation result
Post by IdahoEv on Nov 8th, 2006, 2:24pm

on 11/08/06 at 08:36:16, Fritzlein wrote:
You don't have to compare a probability to a probability.  You can score each prediction individually, even if it is a one-of-a-kind situation.


This is true.   The other drawback of the K-sigmoid-probability formula however is that filtering the score through that function will mask the "true meaning" of the other coefficients for humans who want to make sense of the resulting values.

Title: Re: Spooky material evaluation result
Post by RonWeasley on Nov 8th, 2006, 2:28pm
Collapsing down also, in your example, reduces the material expression to four significant figures.  This may make calculations more intuitive since it reflects the fact that there are indeed exactly four distinct piece levels remaining in the game.  To me this is an indicator of a promising approach.

Title: Re: Spooky material evaluation result
Post by Fritzlein on Nov 8th, 2006, 7:51pm

on 11/07/06 at 23:03:21, seanick wrote:
question to those who process games like IdahoEv and Fritzlein (and whoever else does that is willing to admit it): what method do you use to process the games? do you parse the txt files into a database first, or process them directly?

I import the text files to Microsoft Access, and then do queries.  Only rarely do I write a bit of VB code to process the fields.  Janzert and IdahoEv clearly do more sophisticated processing of the move data; I feel rather like a dilettante in comparison.

Title: Re: Spooky material evaluation result
Post by IdahoEv on Nov 8th, 2006, 7:52pm
As it would have it, experimentation today demonstrates that collapsing down vs. collapsing up makes very little difference on either the ability of the optimizer to fit the coefficients or on the coefficients that are actually derived.

So, since collapsing downward feels intuitively better to me, I will use it from now on.  

The next batch of results look to be interesting; I will create a separate, appropriately-named thread for them.

Title: Re: Spooky material evaluation result
Post by Fritzlein on Nov 8th, 2006, 7:59pm

on 11/08/06 at 13:48:03, IdahoEv wrote:
So instead I want it try collapsing downward, expressing that same state as 001202-001033.  The relative value of the officers is preserved, but the rabbits themselves become relatively more valuable in the endgame, which is intuitively satisfying for me.

My intuition agrees with yours and Ron's, that collapsing states downward is likely to work better than collapsing them upwards.  Elephants are less valuable in the endgame than in the opening, although that is beside the point since they are so rarely lost.  More to the point is that camels and horse get collapsed down to a lower value relative to rabbits.

After all the (justified) criticisms of FAME, I am convinced that something better can be done.  In fact, I've toyed with an idea or two for bettering FAME, but haven't done the work to fine tune it.  I hope that you come up with a material function demonstrably better than FAME by any reasonable metric.  Not only will that advance the state of the art, it will also motivate me to get off my butt and work further in this area myself.

Title: Re: Spooky material evaluation result
Post by IdahoEv on Nov 8th, 2006, 8:49pm

on 11/08/06 at 19:51:56, Fritzlein wrote:
I import the text files to Microsoft Access, and then do queries.  Only rarely do I write a bit of VB code to process the fields.  Janzert and IdahoEv clearly do more sophisticated processing of the move data; I feel rather like a dilettante in comparison.


Apologies, I missed Seanick's question until Fritzlein answered it.

On my mac, I maintain a mirror  of Omar's game database in MySQL. I have a PHP script that queries the Arimaa server for any new games, downloads them, and inserts them into my DB.  (Anyone who would like a copy of that script, let me know and I can email it to you.)  I run this script as a cron job at 3:00am to keep my database updated.

Then I have a  library of Java classes I've written that can query this DB,  step through games, model the board and material states, and display the information in various ways.   My display tool isn't quite as sophisticated as the one on the site (for playing back games), but it's not far off and it displays a bunch of information useful for programming (like hex maps of the board, current collapsed material state, etc.)

Ultimately these classes will be used to develop a bot or bots, hopefully.  Right now I'm applying them to simpler problems first.    

Title: Re: Spooky material evaluation result
Post by jdb on Nov 8th, 2006, 9:40pm
IdahoEv,

If you (or anyone else for that matter) want a java implementation of FAME, or hostage recognition, frame recognition, elephant blockade, just let me know and I'll let you have it.

I find this thread very interesting, keep up the good work.

Title: Re: Spooky material evaluation result
Post by Janzert on Nov 8th, 2006, 10:29pm

on 11/07/06 at 23:03:21, seanick wrote:
what method do you use to process the games?


I use some python (http://www.python.org/) scripts for parsing and analysing the games with sqlite (http://www.sqlite.org/) serving for the database storage.

The gamegrapher is a turbogears (http://www.turbogears.org/) app and uses postgresql (http://www.postgresql.org/) for the database.

A script runs every 4 hours and updates the sqlite database then pulls the relevant information back out and updates the gamegrapher db.

More simply it's a hacked together mess of python that I'd be embarrassed to actually let anyone see. :) Although I don't think the FAME implementation is too nasty to look at and took a few attempts to get correct. So if anyone wants that feel free to ask or take the javascript one out of the FAME calculator (http://arimaa.janzert.com/fame.html).

Janzert

Title: Re: Spooky material evaluation result
Post by Fritzlein on Nov 8th, 2006, 11:25pm

on 11/08/06 at 02:25:46, IdahoEv wrote:
Fritzlein,  I think you just figured out what was wrong with my code,  without ever looking at my code.   How am I suppossed to compete with a brain like that come Saturday?    ;)

Ha!  As the above posts amply demonstrate, I would be in big trouble if Arimaa skill were a function of programming prowess.  It's clear to me that purposeful practice and study are key to doing well at Arimaa, as for most other things.  (And it's clear that chessandgo's advantage in recent practice makes him the most dangerous player in the World Championship.) Thanks for the compliment, though, and thanks for sharing your algorithms.  I like to think about algorithms even when I lack the skills to implement them myself.



Arimaa Forum » Powered by YaBB 1 Gold - SP 1.3.1!
YaBB © 2000-2003. All Rights Reserved.