Arimaa Forum - Spooky material evaluation result

Welcome, Guest. Please Login or Register.
Jul 14^th, 2025, 8:26pm

Home

Help

Members

Arimaa Forum « Spooky material evaluation result »

   Arimaa Forum
   Arimaa
   General Discussion (Moderator: supersamu)
   Spooky material evaluation result

« Previous topic | Next topic »

Pages: 1 2

Notify of replies

Send Topic

Author

Topic: Spooky material evaluation result (Read 2649 times)

IdahoEv
Forum Guru

Arimaa player #1753

Gender: male

Posts: 405

Spooky material evaluation result
« on: Nov 7^th, 2006, 1:58pm »

Quote

Modify

I picked up some code I started back during the summer to come up with an "optimal" material evaluation function. The idea is to generate some formulas that might serve as good material eval functions, then measure how well they predict the ultimate winner of a game just by looking at mid-game material states.

I'd curve-fit the coefficients of said formula with a gradient descent technique, scoring them by what fraction of the time the material eval formula correctly predicts the winner of the game.

I was planning to report in once I had some nice solid results - comparing FAME and other systems to other ones of my own devising. But the results I got last night were so strange I just had to share them. Maybe they'll turn out to be the result of a bug.

I started testing the system last night with a very naive function: all rabbits are worth 1 point. Two coefficients determine officer value: A is the ratio of cat:rabbit, and B is the ratio of dog:cat, horse:dog, camel:horse and elephant:camel. (the system also collapses categories when they become equivalent, i.e. cats get promoted to "dogs" if the opponent has no cats or dogs, etc.)

I expected the best performance to be around A=1.5 B=1.5, just intuitively; each rank of pieces is worth 50% more than the last.

But remarkably, I find the best fit when A and B are both exactly 1.0 ... in other words, with this formula, the absolute best I can do is to simply count the number pieces each side has and predict that the player with the largest number of pieces is winning. If either of the coefficients changes by even 1% there is a noticeable increase in the number of states for which the eval incorrectly predicts the eventual winner. By simply counting the number of pieces and assuming the player with more pieces is winning, this evaluator can predict the outcome of the game correctly over 85% of the time.

There's a secondary minimum in the error around where B=1.4, corresponding to my intuition, but that form of the formula only correctly predicts about 82% of outcomes. Counting the pieces definitely does slightl better.

I don't know if that's caused by a bug in my system (seems unlikely; the code is very thoroughly unit-tested and all my spot checks show it is analyzing states correctly) or some artifact of the particular games in the archive and/or the way decent players play. Maybe because good players only generally allow captures of weaker pieces, counting the extant pieces is as good as an "actually correct" eval formula when examining the actual database of games.

I'm only considering games where both players are rated at least 1600, that were rated, and which progressed to a natural win by one player. I tried excluding h-b games because so many players bait Bomb by throwing away a cat, but it didn't change the result.

I'll post again when I figure out what's going on. But "simply counting the pieces is the best material eval you can do" was too weird a result not to share.

IP Logged

Janzert
Forum Guru

Arimaa player #247

Gender: male

Posts: 1016

Re: Spooky material evaluation result
« Reply #1 on: Nov 7^th, 2006, 3:31pm »

Quote

Modify

Interesting, have you seen the old thread from earlier this year discussing some similar things in regards to first capture and FAME?

It would seem that you need to get up to a FAME score of around 7 to have the same prediction success? If I'm correctly remembering the meaning of my own graph and understanding your prediction percentage correctly.

Janzert

IP Logged

IdahoEv
Forum Guru

Arimaa player #1753

Gender: male

Posts: 405

Re: Spooky material evaluation result
« Reply #2 on: Nov 7^th, 2006, 6:23pm »

Quote

Modify

Interesting; I've read many discussions about related topics (and started a few) but had missed that particular one for whatever reason.

The biggest difference of my system from the ones described in that thread is that mine is very careful not to include material states from the middle of trades and bloodbaths. I figure there's no point in counting player A's advantage for having just captured a dog if player B captures a horse on the following turn.

So I only quantify material states that persist for at least 2 turns. Also, I'm only looking at games where both players were rated 1600 or greater.

In combination, those two criteria most likely account for the increased accuracy of the predictions.

IP Logged

Janzert
Forum Guru

Arimaa player #247

Gender: male

Posts: 1016

Re: Spooky material evaluation result
« Reply #3 on: Nov 7^th, 2006, 6:43pm »

Quote

Modify

Ahh, yes. That makes sense.

Now you got me wanting to plot out a 2d surface, percentage after having at least x score for y moves. Tongue

Hope you can run FAME through your exact setup sometime soon as well.

Janzert

IP Logged

IdahoEv
Forum Guru

Arimaa player #1753

Gender: male

Posts: 405

Re: Spooky material evaluation result
« Reply #4 on: Nov 7^th, 2006, 9:15pm »

Quote

Modify

I am first looking at alternate error functions for my gradient-descent routine.

I think the fact that it counts all mispredictions the same way is why I have that strange minimum at A=B=1.0. Right now the formula is punished just as much for saying white is 0.5 points ahead after capturing a single rabbit (if black won) as it would be if black had lost 10 pieces and white was 350 points ahead when black won. So I want to score formulae in proportion to how right or wrong they were and see what that does to my functions.

I do intend to code FAME in at some point soon though and will tell you what I find.

« Last Edit: Nov 7^th, 2006, 9:16pm by IdahoEv »

IP Logged

seanick
Forum Guru

SeaNICK

Gender:

Posts: 97

Re: Spooky material evaluation result
« Reply #5 on: Nov 7^th, 2006, 11:03pm »

Quote

Modify

I have a couple theories about this. Note, I don't really think either of them even comes close to explaining the phenomenon, but more an attempt at rationalization.

1. Unlike chess, in Arimaa there is very little benefit to a sacrifice. It always means one less piece between your home row and the opponents rabbits. so keeping your pieces is a good strategy in all cases.

2. One person who keeps all of his pieces whenever I play him (+/- 1 or 2, At most) is Fritzlein. Now, he also plays a lot of games. And wins them.
so, either:
A. Fritzlein wins a lot of games because the strategy of keeping as many pieces as possible works, or
B. the evaluation is affected by the fact that a very active and very good player happens to try to keep all of his pieces.

2 is slightly in jest, but slightly believeable. 1, I am sure has some effect, but not 85% effect.

question to those who process games like IdahoEv and Fritzlein (and whoever else does that is willing to admit it): what method do you use to process the games? do you parse the txt files into a database first, or process them directly? I was thinking about doing some game analysis myself, but got sidetracked when the schema definition I was writing for a planned database of games started getting too complex

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Spooky material evaluation result
« Reply #6 on: Nov 7^th, 2006, 11:25pm »

Quote

Modify

IdahoEv, do you really wonder why we missed you so much?

This is an extremely interesting result, and I will reply more thoughtfully later, when it isn't so near my bedtime.

First question: Do you let A and B fall below 1.0? For example, might there be a minimum at A=0.8 and B=1.5?

My second question is one that you anticipated. Any material evaluation function will say the side is winning that has a free piece of any type. If I get a camel for nothing, then everyone predicts it as a win for me, whether the camel is worth one rabbit or five. Counting who is wrong/right provides very little discrimination. How often will "count the pieces" disagree with FAME about who is winning? Maybe only 2% of positions are differentiated, so relative merits of the two systems on the other 98% of positions are ignored by your metric. Thus you may not find "what is the relative value of pieces" so much as "how can you tell who is winning in positions that are almost dead even".

We had a similar issue when designing the WC prediction contest. If you are only allowed to bet for or against someone, then there is no discrimination between someone who thinks robinson is 55% to win and someone else who thinks robinson is 95% to win. Both of them bet on robinson, and indeed, both should bet the maximum possible dollar amount on robinson to maximize their expected return.

I recall from a previous thread that an initial capture of a camel resulted in higher win percentage for the capturing side than an initial capture of a rabbit, so I imagine the data must somehow support the value of heavy pieces versus rabbits, even if it doesn't prove cats are worth more than rabbits. (Which brings me back to allowing A < 1)

I would be interested to see the optimal values when the material advantage is converted into a winning percentage, and then scored by square error, just like in the prediction contest. So if the material function predicted a 55% chance of winning, it wouldn't get much for being right or be penalized much for being wrong.

One way to convert a linear material advantage M into a percentage prediction is to take 1/(1+exp(-M/k)), where k is something like 4. Unfortunately, since k must now be allowed to vary as well, that gives your model three parameters instead of two. On the bright side, when you find the optimum for the three-parameter system, you will also have a percentage prediction for a given material imbalance, e.g. "a free horse gives a 65% chance of winning" rather than merely "a free horse is worth 2.347 rabbits".

Thanks again for your fascinating commentary and analysis.

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Spooky material evaluation result
« Reply #7 on: Nov 8^th, 2006, 12:35am »

Quote

Modify

After I went to bed, I thought of a likely bug in your code that would produce the spooky result. I tried to go to sleep anyway, but I just couldn't relax.

I'll bet that when the evaluation sees the material as equal, you let it get away with not making any prediction at all. Thus the more often the function can say a position is even, the less often it has to predict, and the more often it is right among the predictions it does make.

Suppose there is an initial trade of a horse for a rabbit. A reasonable material function picks the side up a horse to win, and is right, say, 65% of the time. This is pretty good, much better than saying the position is equal. But the "count the pieces" algorithm, in contrast, makes no prediction whatsoever, and therefore doesn't bring down its 85% average of being right in its predictions.

If this is in fact the bug, it makes sense that the secondary minimum was with A=1, because that at least allows the function to say cats and rabbits are equal, avoiding predictions when there has been a cat-for-rabbit trade.

To fix this problem, force every function to pick a winner in every position where the pieces left on each side are not identical. If each function has to make the same number of predictions, then A=B=1 will not get the most predictions correct, unless I miss my guess.

Phew, now maybe I can sleep.

IP Logged

IdahoEv
Forum Guru

Arimaa player #1753

Gender: male

Posts: 405

Re: Spooky material evaluation result
« Reply #8 on: Nov 8^th, 2006, 2:25am »

Quote

Modify

Well I'll be d**ned, Karl. That's a very good point: my error function (as is) only punishes wrong predictions. Failures to predict anything (i.e. scores within some small delta of zero) don't generate any contribution to the error function.

This adds up with another result I had tonight. The "best score" of A=B=1 was found by hand; I hadn't coded the gradient descent algorithm yet.

When I actually coded the gradient descent function today, I found that the well in the error function at A=B=1 was so incredibly narrow that the optimizer couldn't find it no matter what my annealing parameters; I had only found it because I tried those values by hand. In fact, it pretty reliably settled on a minimum where A ~= 1.1 and B ~= 1.35, values nicely matching my intuition of the piece values.

In fact, I am willing to hazard now that the width of that fitness well is actually the same width as the delta I include around 0 within which I consider the score to actually be zero.

Fritzlein, I think you just figured out what was wrong with my code, without ever looking at my code. How am I suppossed to compete with a brain like that come Saturday? Wink

(Incidentally my attempts at a different error function tonight were dramatic failures.)

« Last Edit: Nov 8^th, 2006, 2:28am by IdahoEv »

IP Logged

IdahoEv
Forum Guru

Arimaa player #1753

Gender: male

Posts: 405

Re: Spooky material evaluation result
« Reply #9 on: Nov 8^th, 2006, 3:21am »

Quote

Modify

on Nov 7^th, 2006, 11:25pm, Fritzlein wrote:

First question: Do you let A and B fall below 1.0? For example, might there be a minimum at A=0.8 and B=1.5?

there are no constraints on the coefficients whatsoever, they can even go negative. If cats are worth less than rabbits, the optimizer should find it.

Quote:

I would be interested to see the optimal values when the material advantage is converted into a winning percentage, ... One way ... is to take 1/(1+exp(-M/k)), where k is something like 4.

Yes, I've considered using a sigmoid to convert the score into a probability, much like a typical NN cell. The only problem is that I have quantified something like 11,000 different material states that have existed, but < 200 of them have occurred more than once and < 50 of them have occurred more than 10 times. So "probability" becomes a pretty sketchy question for most of the dataset.

Quote:

Unfortunately, since k must now be allowed to vary as well, that gives your model three parameters instead of two.

That's no worry. The two-parameter system was just for testing things out. I have formulae in mind I want to try with up to 5 coefficients. In particular I think considering all 8 rabbits to be of the same value is highly unlikely to be the best solution and some of the formulae take this into account at some level. Also, it seems unlikely that a material state with 0 or 1 rabbits should score well even if all the other pieces are still in place.

« Last Edit: Nov 8^th, 2006, 3:22am by IdahoEv »

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Spooky material evaluation result
« Reply #10 on: Nov 8^th, 2006, 8:31am »

Quote

Modify

on Nov 8^th, 2006, 2:25am, IdahoEv wrote:

In fact, it pretty reliably settled on a minimum where A ~= 1.1 and B ~= 1.35, values nicely matching my intuition of the piece values.

Well, that's less spooky than A=B=1, but still a bit a of a slap against my proposed static values. I have said

R=1
C=1.5
D=2
H=3
M=5

but this suggests

R=1
C=1.1
D=1.5
H=2
M=2.7

It would re-write Arimaa theory once again if an initial camel for three rabbits trade favored the side with three rabbits.

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Spooky material evaluation result
« Reply #11 on: Nov 8^th, 2006, 8:36am »

Quote

Modify

on Nov 8^th, 2006, 3:21am, IdahoEv wrote:

You don't have to compare a probability to a probability. You can score each prediction individually, even if it is a one-of-a-kind situation.

If you predict 55% and you are right, the penalty is (1-0.55)^2 = 0.2025 for that single prediction. If you predict 55% and you are wrong, the penalty is (0-0.55)^2 = 0.3025 for that single prediction. You add up all the penalties for each system without ever having to calculate, "how often am I right in this exact situation".

IP Logged

IdahoEv
Forum Guru

Arimaa player #1753

Gender: male

Posts: 405

Re: Spooky material evaluation result
« Reply #12 on: Nov 8^th, 2006, 1:48pm »

Quote

Modify

on Nov 8^th, 2006, 8:31am, Fritzlein wrote:

It would re-write Arimaa theory once again if an initial camel for three rabbits trade favored the side with three rabbits.

I'm going to go back and rework my material state collapse function before I go forward. I'm using the one I proposed last summer that collapses levels when entire levels are missing for some players, for example EMHRRedccrrr is expressed as 120002-103003, with the MH collapsed into a single level for white and the dcc collapsed into a single level for black because the pieces are functionally equivalent.

However, expressed the way it is, the dcc have been collapsed *up* two levels, and this two parameter system would value them at A*B*B each - the normal value for horses. But is that state really more like an elephant and two camels vs. an elephant and three horses, or is it more like a horse and two dogs vs. a horse and three cats? (These are *NOT* equivalent, because the value of rabbits is fixed!) Intuition tells me the latter.

If there are enough such states in the dataset, collapsing upward would probably tend to drive B somewhat lower to avoid overvaluing the small pieces relative to the rabbits when fitting the data.

So instead I want it try collapsing downward, expressing that same state as 001202-001033. The relative value of the officers is preserved, but the rabbits themselves become relatively more valuable in the endgame, which is intuitively satisfying for me.

Yes this would mean the elephant is less valuable in an absolute sense on such an emptied board, but I suspect that's actually accurate. If the downward-collapse allows the function to fit the data as well or better, then I will go with it and start comparing the two-parameter system to other formulae.

« Last Edit: Nov 8^th, 2006, 1:53pm by IdahoEv »

IP Logged

IdahoEv
Forum Guru

Arimaa player #1753

Gender: male

Posts: 405

Re: Spooky material evaluation result
« Reply #13 on: Nov 8^th, 2006, 2:24pm »

Quote

Modify

on Nov 8^th, 2006, 8:36am, Fritzlein wrote:

You don't have to compare a probability to a probability. You can score each prediction individually, even if it is a one-of-a-kind situation.

This is true. The other drawback of the K-sigmoid-probability formula however is that filtering the score through that function will mask the "true meaning" of the other coefficients for humans who want to make sense of the resulting values.

IP Logged

RonWeasley
Forum Guru

Harry's friend (Arimaa player #441)

Gender: male

Posts: 882

Re: Spooky material evaluation result
« Reply #14 on: Nov 8^th, 2006, 2:28pm »

Quote

Modify

Collapsing down also, in your example, reduces the material expression to four significant figures. This may make calculations more intuitive since it reflects the fact that there are indeed exactly four distinct piece levels remaining in the game. To me this is an indicator of a promising approach.

IP Logged

Pages: 1 2

Notify of replies

Send Topic


« Previous topic \| Next topic »