Arimaa Forum - Empirically derived material evaluators Part II

Welcome, Guest. Please Login or Register.
Jul 12^th, 2025, 1:16am

Home

Help

Members

Arimaa Forum « Empirically derived material evaluators Part II »

   Arimaa Forum
   Arimaa
   General Discussion (Moderator: supersamu)
   Empirically derived material evaluators Part II

« Previous topic | Next topic »

Pages: 1 2 3

Notify of replies

Send Topic

Author

Topic: Empirically derived material evaluators Part II (Read 12108 times)

IdahoEv
Forum Guru

Arimaa player #1753

Gender: male

Posts: 405

Empirically derived material evaluators Part II
« on: Nov 16^th, 2006, 4:43pm »

Quote

Modify

I tested the ability of nine material eval functions to give a probability estimate of who was winning games in the historical database. The best performer was DAPE, but not when using its default coefficients.

Each algorithm produces a score that is positive for a gold advantage and negative for a silver advantage. I convert this number into a probability of a gold win with this formula:

P_{gold win}=1.0/(1.0+exp(-score/K))

And in each case I curve-fit the constant K to generate the best score for that particular material eval algorithm. The optimized K can be viewed as just a measure of the scale of that algorithm.

The error measured is the sum-squared-difference between estimated P_{gold win} and 1 (for a gold win) or 0 (for a silver win) over all states that appear in the database subject to the same criteria I defined in the last posting (players rated 1600+, exchanges only measured at the end of the exchange, etc.). The fit optimizer I used attempts to minimize that error. Algorithms which more accurately predict who is winning most often will have a lower error value.

For some of the algorithms, I also performed a best-fit on the coefficients of the algorithm itself. Here's a summary of the results, from worst to best. I'll explain the particulars of each one afterwards. The error +/- column gives the standard deviation over 5 runs the Total Error column.

RESULTS TABLE

Algorithm	K	Total Error	Error +/-
Count Pieces	1.40	2223.27	0.026
Linear AB (Default coefficients)	3.13	2145.69	0.007
DAPE (Default coefficients)	3.90	2124.19	0.000
FAME (Default coefficients)	2.92	2042.81	0.002
FAME-X (Optimized Coefficients	6.06	1908.44	0.085
FAME (Optimized Coefficients)	5.58	1895.01	0.037
RabbitCurveABC (Optimized Coefficients)	2.82	1891.21	0.646
LinearAB (Optimized Coefficients)	1.77	1889.10	0.015
DAPE (Optimized Coefficients)	3.931	1871.10	1.063

The DAPE Algorithm was clearly the best performer - but only once the coefficients were optimized by gradient descent over the database. And those optimized coefficients were decidedly unusual.

The standard deviations of the errors show that even though I did 5 runs of the optimization for each algorithm often with slightly different results, the overall performance of the different algorithms listed here is non-overlapping. There was some variance to the performance of LinearAB, for example, but it always beat RabbitCurve ABC and always lost to DAPE (Optimized Coefficients).

ALGORITHM RUNDOWN

Count Pieces

Just what you'd expect; the score is +1.0 point for each piece gold has, -1.0 point for each piece silver has. This is here to give you a sense of the overall performance of the algorithms relative to a baseline. Most of the time, a player with more pieces is ahead. The performance improvement of the algorithms below is a measure of their ability to correctly evaluate the relatively few cases where a player has a numerical deficit but a functional advantage.

LinearAB (Default Coefficients)

As described in my previous post, this is a simple increasing value for each functional rank of pieces.

Rabbits are worth 1.0 points, A is the value of a cat and B is the ratio of each higher officer level to the last. Cats are worth A points. Dogs are worth A*B points, horses are worth A*B*B points, etc. (With the exception that these points are actually applied to collapsed functional levels. If cats have been eliminated from the game entirely, dogs are worth A points and horses are worth A*B points, etc.)

Here the default coefficients (my original guess last week) are:
A = 2.0 (A cat is worth two rabbits)
B = 1.5 (A dog is worth 1.5 cats, etc.)

I intended from the start for these coefficients "official version" to be optimized automatically, but I put this here as an illustration of the guessing ability of humans. As with the other human guesses about piece value, it performed miserably.

DAPE (Default Coefficients)

DAPE implemented exactly as on Janzert's calculator, including the constant division to make initial rabbit worth 1.0. The only thing I varied was the K value in order to optimize the probability function.
The other coefficients are: A=30, S=0.375, E=0.3, AR=200.0 BR=0.4, AN=200.0, BN=0.75.

FAME (Default Coefficients)

FAME implemented exactly as on Janzert's calculator, including the constant division to make initial rabbit worth 1.0. The only thing I varied was the K value in order to optimize the probability function.
The coefficients are: Level_1=256.0, Level_2=87.0, A=1.5, B=600, C=2.0. (Where A is the ratio of Level 2 to Level 3, and residual rabbits are scored at B/(R+CP) for opposing rabbits R and opposing officers P.

FAME-X (Optimized Coefficients)

FAME-X is a slight variation on FAME I tried for which the rabbit score is applied to all the rabbits a player has, instead of the "residual rabbits" after others are used to fill in vs. opposing officers. It did not improve on FAME, so I'll spare you further analysis for now.

FAME (Optimized Coefficients)

With the values of Level 1 and Level 2 fixed at 256 and 87, I allowed the other three coefficients A, B, and C to float and the optimizer to find best values for them. The results were a fairly consistent and dramatic improvement over the default FAME coefficients, and the resulting coefficients were also quite consistent. They were:

A = 1.249 +/- 0.005
B = 900.1 +/- 14.13
C = 0.96 +/- 0.023

The most dramatic result is that C is 2.0 in Fritzl's formulation but 0.96 when optimized. My best-guess explanation is this: C attempts to devalue rabbits in inverse proportion to the number of opposing officers. However, in FAME, opponents with many officers already punish rabbit score by consuming some of the rabbits in the "filling in" part of the function, preventing those rabbits from being scored entirely. It is my guess that in the average case this "filling in" effect does more than is necessary to punish a player's rabbit score for being behind in officers, and so C is dropping below 1.0 to compensate.

FAME-X was an attempt to test this hypothesis, and while it did result in a much higher C value (~4.9) it didn't improve on FAME's ability to score the database.

RabbitCurveABC

Described in the previous post, this is identical to LinearAB except that the 8 points of the rabbits as a whole are distributed according to
this function: http://idahoev.com/arimaa/images/rabbitcurve.png

The coefficient A scales the value of a cat to the average rabbit, and B scales subsequent officers to cats. B was very consistent at 1.316 +/- 0.003, but A and C covaried and ranged pretty widely. C seems to prefer values from 0.2 to 0.3, indicating some usefulness of the curve shape, but A varied inversely to C and needed to range wider to compensate from ~0.5 to ~2.5! I believe A is attempting to scale the value of the officers relative to the value of the first rabbits lost rather than the average rabbits lost, especially given how well-represented these states are in the database. Since the value of the initial rabbit lost is very dependent on C, A is extremely sensitive to that
value.

LinearAB

The optimized values:
A=1.241 +/- 0.004
B=1.316 +/- 0.002

I'm continually surprised at how well this evaluator has performed. I originally implemented it as a toss-off algorithm just to test my optimizer. In particular, the fact that it simply values all rabbits at 1.0 points seems incredibly naive to me, as I believe that the 8th rabbit lost should be much more expensive than the 1st rabbit lost.

The empirical optimizer seems to disagree. What can I say.

DAPE (Optimized Coefficients)

With coefficients optimized by my system, DAPE outperformed the other algorithms and not by a small amount! How much of this is due to DAPE being a better description of Arimaa's mechanics (pieces evaluated with respect to how many other pieces they can beat) and how much is due to the fact that DAPE has seven tweakable coefficients is not clear. While the end performance was fairly consistent, the actual coefficients varied quite a bit; each time I ran the optimizer it found a slightly different solution for DAPE. With seven interdependent coefficients, there are bound to be many local minima in the error function.

Here are the coefficients (using Toby's terminology from this post and the code on Janzert's calculator) at the end of the best-performing optimizing run:

A=27.9307
S=0.7833
E=0.4282
AR = -0.46
BR = -1.0683
AN = 428.7
BN = 0.9755
(Error = 1870.34)

The higher values of S and E (relative to the defaults) are telling us that, like with FAME, having a larger number of pieces is more important than having stronger pieces (relative to what we think, anyhow.) AN and AR are not described in 99of9's original post on DAPE but are alluded to later in the thread and can be found in the code on Janzert's FAME/DAPE web page.

The value of AR is the coefficient of the "Low-rabbits punisher" function. That value is fascinating, and it was not particular to this run. In every run I performed, the optimizer immediately reduced AR to something very near zero, in fact going slightly negative in 4 out of 5 runs. Essentially, the low-rabbits optimizer with any positive value hurts the ability of DAPE to accurately score the database, and the optimizer eliminated it. However, the overall "low piece punisher" function was reliably deemed to be more important than Toby had originally estimated it: AN optimized to 434.8 +/- 53.0 (compared to the default of 200).

So, performance-wise DAPE is the best algorithm we have so far but it needs better coefficients. The "low rabbits" function could be removed entirely without reducing DAPE's ability to predict the outcome.

ANALYSIS

What's most interesting to me is how large of an improvement algorithms like LinearAB, FAME and DAPE get when an automatic system tweaks their coefficients. What this says to me is that our intuition about the value of the pieces, despite years of experience and thousands of games played, is still not quite correct and we as humans aren't very good at guessing.

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Empirically derived material evaluators Part I
« Reply #1 on: Nov 16^th, 2006, 9:56pm »

Quote

Modify

Thanks for running all the optimizations and giving detailed description of each. I will respond more thoughtfully soon, but my first question is: Which of the systems is the IdahoEv System? My vote would be for the optimized linearAB to take a place next to the "official" FAME and DAPE scores on Janzert's material calculator. Or maybe you aren't done yet?

IP Logged

IdahoEv
Forum Guru

Arimaa player #1753

Gender: male

Posts: 405

Re: Empirically derived material evaluators Part I
« Reply #2 on: Nov 16^th, 2006, 10:06pm »

Quote

Modify

on Nov 16^th, 2006, 9:56pm, Fritzlein wrote:

Which of the systems is the IdahoEv System? My vote would be for the optimized linearAB to take a place next to the "official" FAME and DAPE scores on Janzert's material calculator. Or maybe you aren't done yet?

I have an entirely separate one that I'm working on, but I haven't figured out how I want to score rabbits yet. I'll call it my official one if it outperforms LinearAB. Of course, I probably won't stop tweaking it until it does.

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Empirically derived material evaluators Part I
« Reply #3 on: Nov 19^th, 2006, 9:50pm »

Quote

Modify

Again, I want to come back to this in more detail, but I thought I would throw out an observation: Shouldn't the CurveABC always perform at least as well as the LinearAB? If nothing else, it can set C=0 in CurveABC and then it is linear. If it is performing worse, then it means your optimizer is getting caught in local minima that aren't as good as the ones LinearAB is finding. In other words, curving the rabbits makes it harder to make good predictions.

Now, I absolutely believe that the last few rabbits are worth more than the first few. Curving the rabbits must be better than having them linear. But since the results of curving are worse, that casts some doubt on the whole procedure. Something I don't understand is affecting the results, and I'll be suspicious until I can think of a good explanation.

IP Logged

IdahoEv
Forum Guru

Arimaa player #1753

Gender: male

Posts: 405

Re: Empirically derived material evaluators Part I
« Reply #4 on: Nov 20^th, 2006, 3:23am »

Quote

Modify

Sadly, being the terrible mathematician I am, the first function I found for RabbitCurveABC that had the shape I wanted is discontinuous at C=0, has a zero denominator somewhere I believe. (I don't have it in front of me ATM).

So actually, RabbitCurveABC can't set C=0. Stupid, I know, and if any mathematicians present want to craft a better curve function (hint) i'd be happy to run it. But I don't think that's what's going on in any case.

Given the observed interdependence of A and C, I suspect the local minimum issue is what's really happening. I probably could have run it a dozen more times, or annealed more slowly, and found better solutions. In "Part I", with the simpler error function, RabbitCurveABC generally did outperform LinearAB when run long enough, but also often fell into local minima that were slightly worse.

In any case I was trying to be more-or-less scientifially honest and so peformed 5 runs and for each and left it at that for now. I suspect RabbitCurveABC could slightly outperform LinearAB if given enough time, but since optimized DAPE did such a lovely job I wanted to move on to other things rather than spend another 6 hours of CPU time trying to squeeze a few more points out of an algorithm I didn't like much anyway.

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Empirically derived material evaluators Part I
« Reply #5 on: Nov 20^th, 2006, 9:41pm »

Quote

Modify

on Nov 20^th, 2006, 3:23am, IdahoEv wrote:

if any mathematicians present want to craft a better curve function (hint) i'd be happy to run it.

A pretty standard family of curves from (0,0) to (1,1) is f(x) = x^p where p>0. Since you are trying to get from (0,0) to (8,8), you could use f(x) = 8*(x/8)^p. Then f(8), i.e. the value of all 8 rabbits, is always 8. f(1) is the value of having one rabbit rather than zero, whereas 8 - f(7) is the value of having eight rabbis rather than seven.

When p=1 it is a straight line of all rabbits being equally valued; p<1 suggests the last few rabbits on the board are worth more than the first few to be captured, while p>1 suggests the reverse. With this parameter you can start out the rabbits as linear and let the line curve in either direction. I would laugh if the optimizer got stuck in a curve of the wrong direction! But one possible explanation is that later rabbits are already getting promoted in value when you collapse the strong pieces downward, so the optimizer has to compensate by curving the other way.

« Last Edit: Nov 20^th, 2006, 9:43pm by Fritzlein »

IP Logged

IdahoEv
Forum Guru

Arimaa player #1753

Gender: male

Posts: 405

Re: Empirically derived material evaluators Part I
« Reply #6 on: Nov 20^th, 2006, 11:39pm »

Quote

Modify

99of9 asked me to re-run the analysis on DAPE but constraining the system to only those games where the opponents were within 100 ratings points of each other. I did so, here are the results. I'm showing the full results for all 5 runs both without (first) and with that constraint.

Without the constraint (these are the same results summarized above, but all 5 runs are shown instead of only the best one).

51513 different states from 7913 games evaluated

Err	A	S	E	AR	BR	AN	BN	K
1870.3447	27.9307	0.7833	0.4282	-0.46	-1.0683	428.7	0.9755	4.7811
1872.9452	28.0058	0.7924	0.5608	-5.64	-0.0848	412.24	0.9746	4.5903
1871.0219	31.156	0.7874	0.5253	-0.88	-0.9979	434.96	0.974	5.0967
1870.6683	31.0662	1.1669	0.6471	55.96	0.976	377.52	0.9854	3.7867
1870.507	28.1805	0.8544	0.452	-1.74	-0.8613	520.76	0.9818	4.6596

With the additional constraint (ABS(wrating-brating)<100):

18563 different states from 2967 games evaluated

Err	A	S	E	AR	BR	AN	BN	K
1020.1945	31.2871	2.1109	-0.0433	139.74	0.9758	422.7	0.354	3.1555
1020.0298	29.3184	2.1287	-0.0221	174.32	0.9829	1195.58	0.293	2.9082
1019.6417	27.5985	0.8862	0.0631	36.14	0.9221	457.68	0.984	5.4417
1019.7013	29.1201	1.2575	0.0688	89.56	0.9708	363.58	0.9892	4.4402
1019.7303	29.3904	0.8043	0.0466	37.66	0.9267	505.8	0.9821	6.2032

The changes I see - AR/BR is now a factor, but E is not. AN/BN now varies much more widely, as does S.

I'll let 99 provide the interpretation for why this might be and/or what these numbers mean, but this does demonstrate that the empirical approach is noticeably dependent on *which* historical games you are studying.

« Last Edit: Nov 20^th, 2006, 11:39pm by IdahoEv »

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Empirically derived material evaluators Part I
« Reply #7 on: Nov 20^th, 2006, 11:44pm »

Quote

Modify

Ah, constraining the games to even matches is a good idea I didn't think of. If the games are lopsided in rating it might encourage the optimizer to over-react along the lines of predicting that whoever takes the first piece will win. If the stronger player almost always takes the first piece, it suddenly looks like a huge advantage...

IP Logged

IdahoEv
Forum Guru

Arimaa player #1753

Gender: male

Posts: 405

Re: Empirically derived material evaluators Part I
« Reply #8 on: Nov 21^st, 2006, 12:44am »

Quote

Modify

on Nov 20^th, 2006, 11:44pm, Fritzlein wrote:

If the stronger player almost always takes the first piece, it suddenly looks like a huge advantage...

Do we have a sense of how true this is?

IP Logged

IdahoEv
Forum Guru

Arimaa player #1753

Gender: male

Posts: 405

Re: Empirically derived material evaluators Part I
« Reply #9 on: Nov 21^st, 2006, 3:23am »

Quote

Modify

on Nov 20^th, 2006, 9:41pm, Fritzlein wrote:

A pretty standard family of curves from (0,0) to (1,1) is f(x) = x^p where p>0. Since you are trying to get from (0,0) to (8, Cool

, you could use f(x) = 8*(x/ Cool

^p.

Thank you, Fritz!

I wrote RabbitCurve2ABC which uses this function. Here's a plot of the function, for the curious. It has a noticeably different shape than the one I used in RabbitCurveABC. The new one more closely approximates a linear region with the first rabbits lost all costing a similar amount followed by a sudden increase in value for the last rabbit. It seems reasonable that it might more closely fit the actual behavior of arimaa games.

Tested on the same data, RabbitCurve2ABC was an improvement on RabbitCurveABC but only a tiny improvement on LinearAB:

Algorithm	K	Total Error	Error +/-
RabbitCurveABC (Optimized Coefficients)	2.82	1891.21	0.646
LinearAB (Optimized Coefficients)	1.77	1889.10	0.015
RabbitCurve2ABC (Optimized Coefficients)	1.80	1889.06	0.004

Moreover, it essentially made itself into LinearAB:
A = 1.2427 +/- .0096
B = 1.3185 +/- .0010
C = 1.0205 +/- .0092

« Last Edit: Nov 21^st, 2006, 3:24am by IdahoEv »

IP Logged

99of9
Forum Guru

Gnobby's creator (player #314)

Gender:

Posts: 1413

Re: Empirically derived material evaluators Part I
« Reply #10 on: Nov 21^st, 2006, 4:17am »

Quote

Modify

on Nov 20^th, 2006, 11:39pm, IdahoEv wrote:

Great, I like these results much more. Even the E value 0.06 is not unfeasible. Thanks for all this work Idaho, it certainly gives a different perspective on material eval.

I think the fact that DAPE can fit your data better than the other methods is mostly due to the fact that it has extra fitting parameters, and only partly due to the functional form.

This is borne out in the fact that the parameters are so sensitive to the training set, and are still so different from my intuitive values. (Not that my intuitive values are necessarily good... but the optimized ones still don't look quite like the Arimaa I play.)

... but, maybe familiarity will convince me. I'd appreciate if Janzert could put Idaho's 3rd set (lowest error) into his calculator side by side with my coefficients. Interestingly, of the 5 you got... I agree with your error function that the 3rd one best (because it has the lowest BR).

IP Logged

IdahoEv
Forum Guru

Arimaa player #1753

Gender: male

Posts: 405

Re: Empirically derived material evaluators Part I
« Reply #11 on: Nov 21^st, 2006, 8:15am »

Quote

Modify

on Nov 21^st, 2006, 4:17am, 99of9 wrote:

... but, maybe familiarity will convince me.

And I'll re-run them periodically as more games accumulate in the DB, and we can experiment with different constraints on the games considered. As it is, the most recent results are using a game DB that is two weeks old. I have not updated my db since I ran the first experiment, because I wanted the number of states to remain constant so the error functions would be comparable.

Quote:

I'd appreciate if Janzert could put Idaho's 3rd set (lowest error) into his calculator side by side with my coefficients.

Incidentally, by using the K values listed and the sigmoid formula above, Janzert could add a probability output to all three, so it would show for example "FAME thinks gold is ahead by 1.53 (74% win probability)".

For unaltered FAME and DAPE algorithms, use the K value for "FAME (default coefficients)" and "DAPE (default coefficients)" in the first post in this thread. In those cases K was the only variable that was empirically fit, so it essentially functions as a measure of the scale of FAME/DAPE's output.

IP Logged

Janzert
Forum Guru

Arimaa player #247

Gender: male

Posts: 1016

Re: Empirically derived material evaluators Part I
« Reply #12 on: Nov 21^st, 2006, 9:16pm »

Quote

Modify

Ok, I added DAPE with the constants 99of9 mentioned to the calculator.

One thing that struck me is how close it is in evaluating rabbits to FAME.

Janzert

« Last Edit: Nov 21^st, 2006, 9:41pm by Janzert »

IP Logged

99of9
Forum Guru

Gnobby's creator (player #314)

Gender:

Posts: 1413

Re: Empirically derived material evaluators Part I
« Reply #13 on: Nov 21^st, 2006, 9:33pm »

Quote

Modify

on Nov 21^st, 2006, 9:16pm, Janzert wrote:

Ok, I added DAPE with the constants 99of9 mentioned to the calculator.

Thanks Janzert!

Quote:

One thing that struck me is how close it is in evaluating rabbits.

Yeah, it's amazingly good agreement when all 3 methods value one rabbit as 1.000

.

Hey Idaho. Since optimized DAPE predicts that one H is worth almost exactly 2R out of the opening. Could you access your actual stats and tell us how often each has won (say for the Rating Diff <100 criteria)?

I'll make a bold prediction that the guy with the horse wins more often.

IP Logged

Janzert
Forum Guru

Arimaa player #247

Gender: male

Posts: 1016

Re: Empirically derived material evaluators Part I
« Reply #14 on: Nov 21^st, 2006, 9:42pm »

Quote

Modify

hehe, oops forgot the words "to FAME" the first time. Tongue

Janzert

IP Logged

Pages: 1 2 3

Notify of replies

Send Topic


« Previous topic \| Next topic »