Arimaa Forum - Print Page


    
      
        Arimaa Forum
        (http://arimaa.com/arimaa/forum/cgi/YaBB.cgi)
      

        Arimaa >> General Discussion >> Empirically derived material evaluators Part II
        
(Message started by: IdahoEv on Nov 16^th, 2006, 4:43pm)

Title: Empirically derived material evaluators Part II
Post by IdahoEv on Nov 16^th, 2006, 4:43pm

I tested the ability of nine material eval functions to give a probability estimate of who was winning games in the historical database. The best performer was DAPE, but not when using its default coefficients.

Each algorithm produces a score that is positive for a gold advantage and negative for a silver advantage. I convert this number into a probability of a gold win with this formula:

P_{gold win}=1.0/(1.0+exp(-score/K))

And in each case I curve-fit the constant K to generate the best score for that particular material eval algorithm. The optimized K can be viewed as just a measure of the scale of that algorithm.

The error measured is the sum-squared-difference between estimated P_{gold win} and 1 (for a gold win) or 0 (for a silver win) over all states that appear in the database subject to the same criteria I defined in the last posting (players rated 1600+, exchanges only measured at the end of the exchange, etc.). The fit optimizer I used attempts to minimize that error. Algorithms which more accurately predict who is winning most often will have a lower error value.

For some of the algorithms, I also performed a best-fit on the coefficients of the algorithm itself. Here's a summary of the results, from worst to best. I'll explain the particulars of each one afterwards. The error +/- column gives the standard deviation over 5 runs the Total Error column.

RESULTS TABLE

Algorithm	K	Total Error	Error +/-
Count Pieces	1.40	2223.27	0.026
Linear AB (Default coefficients)	3.13	2145.69	0.007
DAPE (Default coefficients)	3.90	2124.19	0.000
FAME (Default coefficients)	2.92	2042.81	0.002
FAME-X (Optimized Coefficients	6.06	1908.44	0.085
FAME (Optimized Coefficients)	5.58	1895.01	0.037
RabbitCurveABC (Optimized Coefficients)	2.82	1891.21	0.646
LinearAB (Optimized Coefficients)	1.77	1889.10	0.015
DAPE (Optimized Coefficients)	3.931	1871.10	1.063

The DAPE Algorithm was clearly the best performer - but only once the coefficients were optimized by gradient descent over the database. And those optimized coefficients were decidedly unusual.

The standard deviations of the errors show that even though I did 5 runs of the optimization for each algorithm often with slightly different results, the overall performance of the different algorithms listed here is non-overlapping. There was some variance to the performance of LinearAB, for example, but it always beat RabbitCurve ABC and always lost to DAPE (Optimized Coefficients).

ALGORITHM RUNDOWN

Count Pieces

Just what you'd expect; the score is +1.0 point for each piece gold has, -1.0 point for each piece silver has. This is here to give you a sense of the overall performance of the algorithms relative to a baseline. Most of the time, a player with more pieces is ahead. The performance improvement of the algorithms below is a measure of their ability to correctly evaluate the relatively few cases where a player has a numerical deficit but a functional advantage.

LinearAB (Default Coefficients)

As described in my previous post (http://arimaa.com/arimaa/forum/cgi/YaBB.cgi?board=talk;action=display;num=1163044066), this is a simple increasing value for each functional rank of pieces.

Rabbits are worth 1.0 points, A is the value of a cat and B is the ratio of each higher officer level to the last. Cats are worth A points. Dogs are worth A*B points, horses are worth A*B*B points, etc. (With the exception that these points are actually applied to collapsed functional levels. If cats have been eliminated from the game entirely, dogs are worth A points and horses are worth A*B points, etc.)

Here the default coefficients (my original guess last week) are:
A = 2.0 (A cat is worth two rabbits)
B = 1.5 (A dog is worth 1.5 cats, etc.)

I intended from the start for these coefficients "official version" to be optimized automatically, but I put this here as an illustration of the guessing ability of humans. As with the other human guesses about piece value, it performed miserably. :)

DAPE (Default Coefficients)

DAPE implemented exactly as on Janzert's calculator, including the constant division to make initial rabbit worth 1.0. The only thing I varied was the K value in order to optimize the probability function.
The other coefficients are: A=30, S=0.375, E=0.3, AR=200.0 BR=0.4, AN=200.0, BN=0.75.

FAME (Default Coefficients)

FAME implemented exactly as on Janzert's calculator, including the constant division to make initial rabbit worth 1.0. The only thing I varied was the K value in order to optimize the probability function.
The coefficients are: Level_1=256.0, Level_2=87.0, A=1.5, B=600, C=2.0. (Where A is the ratio of Level 2 to Level 3, and residual rabbits are scored at B/(R+CP) for opposing rabbits R and opposing officers P.

FAME-X (Optimized Coefficients)

FAME-X is a slight variation on FAME I tried for which the rabbit score is applied to all the rabbits a player has, instead of the "residual rabbits" after others are used to fill in vs. opposing officers. It did not improve on FAME, so I'll spare you further analysis for now.

FAME (Optimized Coefficients)

With the values of Level 1 and Level 2 fixed at 256 and 87, I allowed the other three coefficients A, B, and C to float and the optimizer to find best values for them. The results were a fairly consistent and dramatic improvement over the default FAME coefficients, and the resulting coefficients were also quite consistent. They were:

A = 1.249 +/- 0.005
B = 900.1 +/- 14.13
C = 0.96 +/- 0.023

The most dramatic result is that C is 2.0 in Fritzl's formulation but 0.96 when optimized. My best-guess explanation is this: C attempts to devalue rabbits in inverse proportion to the number of opposing officers. However, in FAME, opponents with many officers already punish rabbit score by consuming some of the rabbits in the "filling in" part of the function, preventing those rabbits from being scored entirely. It is my guess that in the average case this "filling in" effect does more than is necessary to punish a player's rabbit score for being behind in officers, and so C is dropping below 1.0 to compensate.

FAME-X was an attempt to test this hypothesis, and while it did result in a much higher C value (~4.9) it didn't improve on FAME's ability to score the database.

RabbitCurveABC

Described in the previous post, this is identical to LinearAB except that the 8 points of the rabbits as a whole are distributed according to
this function: http://idahoev.com/arimaa/images/rabbitcurve.png

The coefficient A scales the value of a cat to the average rabbit, and B scales subsequent officers to cats. B was very consistent at 1.316 +/- 0.003, but A and C covaried and ranged pretty widely. C seems to prefer values from 0.2 to 0.3, indicating some usefulness of the curve shape, but A varied inversely to C and needed to range wider to compensate from ~0.5 to ~2.5! I believe A is attempting to scale the value of the officers relative to the value of the first rabbits lost rather than the average rabbits lost, especially given how well-represented these states are in the database. Since the value of the initial rabbit lost is very dependent on C, A is extremely sensitive to that
value.

LinearAB

The optimized values:
A=1.241 +/- 0.004
B=1.316 +/- 0.002

I'm continually surprised at how well this evaluator has performed. I originally implemented it as a toss-off algorithm just to test my optimizer. In particular, the fact that it simply values all rabbits at 1.0 points seems incredibly naive to me, as I believe that the 8th rabbit lost should be much more expensive than the 1st rabbit lost.

The empirical optimizer seems to disagree. What can I say.

DAPE (Optimized Coefficients)

With coefficients optimized by my system, DAPE outperformed the other algorithms and not by a small amount! How much of this is due to DAPE being a better description of Arimaa's mechanics (pieces evaluated with respect to how many other pieces they can beat) and how much is due to the fact that DAPE has seven tweakable coefficients is not clear. While the end performance was fairly consistent, the actual coefficients varied quite a bit; each time I ran the optimizer it found a slightly different solution for DAPE. With seven interdependent coefficients, there are bound to be many local minima in the error function.

Here are the coefficients (using Toby's terminology from this post (http://arimaa.com/arimaa/forum/cgi/YaBB.cgi?board=devTalk;action=display;num=1163129051) and the code on Janzert's calculator (http://arimaa.janzert.com/fame.html)) at the end of the best-performing optimizing run:

A=27.9307
S=0.7833
E=0.4282
AR = -0.46
BR = -1.0683
AN = 428.7
BN = 0.9755
(Error = 1870.34)

The higher values of S and E (relative to the defaults) are telling us that, like with FAME, having a larger number of pieces is more important than having stronger pieces (relative to what we think, anyhow.) AN and AR are not described in 99of9's original post on DAPE but are alluded to later in the thread and can be found in the code on Janzert's FAME/DAPE web page.

The value of AR is the coefficient of the "Low-rabbits punisher" function. That value is fascinating, and it was not particular to this run. In every run I performed, the optimizer immediately reduced AR to something very near zero, in fact going slightly negative in 4 out of 5 runs. Essentially, the low-rabbits optimizer with any positive value hurts the ability of DAPE to accurately score the database, and the optimizer eliminated it. However, the overall "low piece punisher" function was reliably deemed to be more important than Toby had originally estimated it: AN optimized to 434.8 +/- 53.0 (compared to the default of 200).

So, performance-wise DAPE is the best algorithm we have so far but it needs better coefficients. The "low rabbits" function could be removed entirely without reducing DAPE's ability to predict the outcome.

ANALYSIS

What's most interesting to me is how large of an improvement algorithms like LinearAB, FAME and DAPE get when an automatic system tweaks their coefficients. What this says to me is that our intuition about the value of the pieces, despite years of experience and thousands of games played, is still not quite correct and we as humans aren't very good at guessing.

Title: Re: Empirically derived material evaluators Part I
Post by Fritzlein on Nov 16^th, 2006, 9:56pm

Thanks for running all the optimizations and giving detailed description of each. I will respond more thoughtfully soon, but my first question is: Which of the systems is the IdahoEv System? My vote would be for the optimized linearAB to take a place next to the "official" FAME and DAPE scores on Janzert's material calculator. Or maybe you aren't done yet?

Title: Re: Empirically derived material evaluators Part I
Post by IdahoEv on Nov 16^th, 2006, 10:06pm

on 11/16/06 at 21:56:50, Fritzlein wrote:

Which of the systems is the IdahoEv System? My vote would be for the optimized linearAB to take a place next to the "official" FAME and DAPE scores on Janzert's material calculator. Or maybe you aren't done yet?

I have an entirely separate one that I'm working on, but I haven't figured out how I want to score rabbits yet. I'll call it my official one if it outperforms LinearAB. Of course, I probably won't stop tweaking it until it does. :-)

Title: Re: Empirically derived material evaluators Part I
Post by Fritzlein on Nov 19^th, 2006, 9:50pm

Again, I want to come back to this in more detail, but I thought I would throw out an observation: Shouldn't the CurveABC always perform at least as well as the LinearAB? If nothing else, it can set C=0 in CurveABC and then it is linear. If it is performing worse, then it means your optimizer is getting caught in local minima that aren't as good as the ones LinearAB is finding. In other words, curving the rabbits makes it harder to make good predictions.

Now, I absolutely believe that the last few rabbits are worth more than the first few. Curving the rabbits must be better than having them linear. But since the results of curving are worse, that casts some doubt on the whole procedure. Something I don't understand is affecting the results, and I'll be suspicious until I can think of a good explanation.

Title: Re: Empirically derived material evaluators Part I
Post by IdahoEv on Nov 20^th, 2006, 3:23am

Sadly, being the terrible mathematician I am, the first function I found for RabbitCurveABC that had the shape I wanted is discontinuous at C=0, has a zero denominator somewhere I believe. (I don't have it in front of me ATM).

So actually, RabbitCurveABC can't set C=0. Stupid, I know, and if any mathematicians present want to craft a better curve function (hint) i'd be happy to run it. But I don't think that's what's going on in any case.

Given the observed interdependence of A and C, I suspect the local minimum issue is what's really happening. I probably could have run it a dozen more times, or annealed more slowly, and found better solutions. In "Part I", with the simpler error function, RabbitCurveABC generally did outperform LinearAB when run long enough, but also often fell into local minima that were slightly worse.

In any case I was trying to be more-or-less scientifially honest and so peformed 5 runs and for each and left it at that for now. I suspect RabbitCurveABC could slightly outperform LinearAB if given enough time, but since optimized DAPE did such a lovely job I wanted to move on to other things rather than spend another 6 hours of CPU time trying to squeeze a few more points out of an algorithm I didn't like much anyway.

Title: Re: Empirically derived material evaluators Part I
Post by Fritzlein on Nov 20^th, 2006, 9:41pm

on 11/20/06 at 03:23:25, IdahoEv wrote:

if any mathematicians present want to craft a better curve function (hint) i'd be happy to run it.

A pretty standard family of curves from (0,0) to (1,1) is f(x) = x^p where p>0. Since you are trying to get from (0,0) to (8,8), you could use f(x) = 8*(x/8)^p. Then f(8), i.e. the value of all 8 rabbits, is always 8. f(1) is the value of having one rabbit rather than zero, whereas 8 - f(7) is the value of having eight rabbis rather than seven.

When p=1 it is a straight line of all rabbits being equally valued; p<1 suggests the last few rabbits on the board are worth more than the first few to be captured, while p>1 suggests the reverse. With this parameter you can start out the rabbits as linear and let the line curve in either direction. I would laugh if the optimizer got stuck in a curve of the wrong direction! But one possible explanation is that later rabbits are already getting promoted in value when you collapse the strong pieces downward, so the optimizer has to compensate by curving the other way.

Title: Re: Empirically derived material evaluators Part I
Post by IdahoEv on Nov 20^th, 2006, 11:39pm

99of9 asked me to re-run the analysis on DAPE but constraining the system to only those games where the opponents were within 100 ratings points of each other. I did so, here are the results. I'm showing the full results for all 5 runs both without (first) and with that constraint.

Without the constraint (these are the same results summarized above, but all 5 runs are shown instead of only the best one).

51513 different states from 7913 games evaluated

Err	A	S	E	AR	BR	AN	BN	K
1870.3447	27.9307	0.7833	0.4282	-0.46	-1.0683	428.7	0.9755	4.7811
1872.9452	28.0058	0.7924	0.5608	-5.64	-0.0848	412.24	0.9746	4.5903
1871.0219	31.156	0.7874	0.5253	-0.88	-0.9979	434.96	0.974	5.0967
1870.6683	31.0662	1.1669	0.6471	55.96	0.976	377.52	0.9854	3.7867
1870.507	28.1805	0.8544	0.452	-1.74	-0.8613	520.76	0.9818	4.6596

With the additional constraint (ABS(wrating-brating)<100):

18563 different states from 2967 games evaluated

Err	A	S	E	AR	BR	AN	BN	K
1020.1945	31.2871	2.1109	-0.0433	139.74	0.9758	422.7	0.354	3.1555
1020.0298	29.3184	2.1287	-0.0221	174.32	0.9829	1195.58	0.293	2.9082
1019.6417	27.5985	0.8862	0.0631	36.14	0.9221	457.68	0.984	5.4417
1019.7013	29.1201	1.2575	0.0688	89.56	0.9708	363.58	0.9892	4.4402
1019.7303	29.3904	0.8043	0.0466	37.66	0.9267	505.8	0.9821	6.2032

The changes I see - AR/BR is now a factor, but E is not. AN/BN now varies much more widely, as does S.

I'll let 99 provide the interpretation for why this might be and/or what these numbers mean, but this does demonstrate that the empirical approach is noticeably dependent on *which* historical games you are studying.

Title: Re: Empirically derived material evaluators Part I
Post by Fritzlein on Nov 20^th, 2006, 11:44pm

Ah, constraining the games to even matches is a good idea I didn't think of. If the games are lopsided in rating it might encourage the optimizer to over-react along the lines of predicting that whoever takes the first piece will win. If the stronger player almost always takes the first piece, it suddenly looks like a huge advantage...

Title: Re: Empirically derived material evaluators Part I
Post by IdahoEv on Nov 21^st, 2006, 12:44am

on 11/20/06 at 23:44:41, Fritzlein wrote:

If the stronger player almost always takes the first piece, it suddenly looks like a huge advantage...

Do we have a sense of how true this is?

Title: Re: Empirically derived material evaluators Part I
Post by IdahoEv on Nov 21^st, 2006, 3:23am

on 11/20/06 at 21:41:29, Fritzlein wrote:

A pretty standard family of curves from (0,0) to (1,1) is f(x) = x^p where p>0. Since you are trying to get from (0,0) to (8,8), you could use f(x) = 8*(x/8)^p.

Thank you, Fritz!

I wrote RabbitCurve2ABC which uses this function. Here's a plot of the function (http://idahoev.com/arimaa/images/rabbitcurve2.png), for the curious. It has a noticeably different shape than the one I used in RabbitCurveABC (http://idahoev.com/arimaa/images/rabbitcurve.png). The new one more closely approximates a linear region with the first rabbits lost all costing a similar amount followed by a sudden increase in value for the last rabbit. It seems reasonable that it might more closely fit the actual behavior of arimaa games.

Tested on the same data, RabbitCurve2ABC was an improvement on RabbitCurveABC but only a tiny improvement on LinearAB:

Algorithm	K	Total Error	Error +/-
RabbitCurveABC (Optimized Coefficients)	2.82	1891.21	0.646
LinearAB (Optimized Coefficients)	1.77	1889.10	0.015
RabbitCurve2ABC (Optimized Coefficients)	1.80	1889.06	0.004

Moreover, it essentially made itself into LinearAB:
A = 1.2427 +/- .0096
B = 1.3185 +/- .0010
C = 1.0205 +/- .0092

Title: Re: Empirically derived material evaluators Part I
Post by 99of9 on Nov 21^st, 2006, 4:17am

on 11/20/06 at 23:39:40, IdahoEv wrote:

Great, I like these results much more. Even the E value 0.06 is not unfeasible. Thanks for all this work Idaho, it certainly gives a different perspective on material eval.

I think the fact that DAPE can fit your data better than the other methods is mostly due to the fact that it has extra fitting parameters, and only partly due to the functional form.

This is borne out in the fact that the parameters are so sensitive to the training set, and are still so different from my intuitive values. (Not that my intuitive values are necessarily good... but the optimized ones still don't look quite like the Arimaa I play.)

... but, maybe familiarity will convince me. I'd appreciate if Janzert could put Idaho's 3rd set (lowest error) into his calculator side by side with my coefficients. Interestingly, of the 5 you got... I agree with your error function that the 3rd one best (because it has the lowest BR).

Title: Re: Empirically derived material evaluators Part I
Post by IdahoEv on Nov 21^st, 2006, 8:15am

on 11/21/06 at 04:17:30, 99of9 wrote:

... but, maybe familiarity will convince me.

And I'll re-run them periodically as more games accumulate in the DB, and we can experiment with different constraints on the games considered. As it is, the most recent results are using a game DB that is two weeks old. I have not updated my db since I ran the first experiment, because I wanted the number of states to remain constant so the error functions would be comparable.

Quote:

I'd appreciate if Janzert could put Idaho's 3rd set (lowest error) into his calculator side by side with my coefficients.

Incidentally, by using the K values listed and the sigmoid formula above, Janzert could add a probability output to all three, so it would show for example "FAME thinks gold is ahead by 1.53 (74% win probability)".

For unaltered FAME and DAPE algorithms, use the K value for "FAME (default coefficients)" and "DAPE (default coefficients)" in the first post in this thread. In those cases K was the only variable that was empirically fit, so it essentially functions as a measure of the scale of FAME/DAPE's output.

Title: Re: Empirically derived material evaluators Part I
Post by Janzert on Nov 21^st, 2006, 9:16pm

Ok, I added DAPE with the constants 99of9 mentioned to the calculator (http://arimaa.janzert.com/fame.html).

One thing that struck me is how close it is in evaluating rabbits to FAME.

Janzert

Title: Re: Empirically derived material evaluators Part I
Post by 99of9 on Nov 21^st, 2006, 9:33pm

on 11/21/06 at 21:16:36, Janzert wrote:

Ok, I added DAPE with the constants 99of9 mentioned to the calculator (http://arimaa.janzert.com/fame.html).

Thanks Janzert!

Quote:

One thing that struck me is how close it is in evaluating rabbits.

Yeah, it's amazingly good agreement when all 3 methods value one rabbit as 1.000 :-).

Hey Idaho. Since optimized DAPE predicts that one H is worth almost exactly 2R out of the opening. Could you access your actual stats and tell us how often each has won (say for the Rating Diff <100 criteria)?

I'll make a bold prediction that the guy with the horse wins more often.

Title: Re: Empirically derived material evaluators Part I
Post by Janzert on Nov 21^st, 2006, 9:42pm

hehe, oops forgot the words "to FAME" the first time. :P

Janzert

Title: Re: Empirically derived material evaluators Part I
Post by IdahoEv on Nov 22^nd, 2006, 2:17am

on 11/21/06 at 21:42:28, Janzert wrote:

oops forgot the words "to FAME"

Fame!

I'm gonna live forever
I'm gonna learn how to fly
High!

I feel it coming together
People will see me and cry
Fame!

I'm gonna make it to heaven
Light up the sky like a flame
Fame!

I'm gonna live forever
Baby remember my name

Remember
Remember
Remember
Remember...

Or was that not what you meant...

Title: Re: Empirically derived material evaluators Part I
Post by Fritzlein on Nov 22^nd, 2006, 9:51am

on 11/21/06 at 03:23:26, IdahoEv wrote:

Algorithm	K	Total Error	Error +/-
RabbitCurveABC (Optimized Coefficients)	2.82	1891.21	0.646
LinearAB (Optimized Coefficients)	1.77	1889.10	0.015
RabbitCurve2ABC (Optimized Coefficients)	1.80	1889.06	0.004

Moreover, it essentially made itself into LinearAB:
A = 1.2427 +/- .0096
B = 1.3185 +/- .0010
C = 1.0205 +/- .0092

Ha, ha! Yes, when the rabbit curve was allowed to bend itself either way, it decided to stay essentially straight, but what curvature it does have goes the wrong way! Losing the first rabbit apparent hurts you more than losing the seventh rabbit...

This result has to make us stop and think: to what extent are the optimizations a commentary on the true value of rabbits, as opposed to a commentary on the dataset from which they are generated? The sudden change in DAPE coefficients when the data was restricted to a subset raises the same question. It may be as valid to wonder what is biasing the data as to wonder what is biasing our judgment about piece values.

Title: Re: Empirically derived material evaluators Part I
Post by IdahoEv on Nov 22^nd, 2006, 2:26pm

on 11/22/06 at 09:51:05, Fritzlein wrote:

To what extent are the optimizations a commentary on the true value of rabbits, as opposed to a commentary on the dataset from which they are generated? The sudden change in DAPE coefficients when the data was restricted to a subset raises the same question. It may be as valid to wonder what is biasing the data as to wonder what is biasing our judgment about piece values.

Indeed. I think the likely issue is the overrepresentation of early losses in the DB. Since there are many many examples of a single rabbit loss in the DB - and since those equate to a 55/45 win/loss or so, the optimizer has to work very hard to generate a 0.55 output for that case in order to minimize the error.

On the other hand, there are fewer cases where someone has, say, lost five rabbits. And in most of those cases they've probably lost a few officers as well and are thoroughly losing the game. So all the optimizer needs to do is punish them for having lost a lot of pieces and produce a probability near 1.0 or 0.0 as appropriate and very little error will accumulate for those cases.

This is the inevitable drawback of an empirical approach.

At the same time, one could argue that this is correct as far as a bot's material evaluator is concerned in any case. As with any player, once you're way ahead on material it matters less if you're correctly evaluating an additional exchange of D vs. RC. But correctly evaluating the early exchanges is far more important.

Put differently, correctly fitting your evaluator to the cases that appear in real games may be more useful in the practical sense of winning real games than an abstract concept of "true piece value". Especially since material evaluation alone is somewhat specious in Arimaa and is always subject to position, it may be that certain material states are more likely to be relevant than others simply because positional constraints prevent the other from actually appearing. For example, how important is it for a material eval to "correctly" evaluate M vs. RRRR .... if that exchange never actually occurs in practice? What if, in fact, it cannot occur because reasonable play simply will not create the positional circumstances for it? Then of course an empirically-derived evaluator will not necessarily evaluate that state correctly ... but maybe that doesn't actually matter.

Evaluating the increased value of the seventh rabbit lost may be a moot point if the evaluator can already tell you've lost with 99% confidence three pieces earlier. Or, if losing seven rabbits has only happened 10 times in a training set of 50,000 states. If so, we wouldn't expect it to put much effort into fitting that state well, and maybe the fact that it doesn't care about that state is telling us that we shouldn't worry about it, either, because it's not a relevant distinction to make in terms of winning games.

I have some thoughts about the difference beween the two different DAPE optimizations that I'll share when I have a moment.

Title: Re: Empirically derived material evaluators Part I
Post by IdahoEv on Nov 22^nd, 2006, 2:42pm

Sudden thought: now that the optimizer is scoring win probability rather than just "who is winning", we can't consider the cat loss at the beginning of a bait-and-tackle game as a simple "incorrect state" that no evaluator will get correct anymore, as we did in the analysis of Part I.

Instead, the optimizer will be desperately trying to reduce the probability error of the hundreds of examples of that technique, and that will definitely throw a wrench into the works.

I bet you that ABS(wrating-brating)<100 eliminates the vast majority of bait-and-tackle cases from the dataset.

Title: Re: Empirically derived material evaluators Part I
Post by IdahoEv on Nov 27^th, 2006, 10:14pm

Fritzl: to answer your question from chat last week, Optimized LinearAB prefers RH to M, in fact by quite a lot. Opt. Linear AB values (as optimized above, anyway) initial pieces as:

R: 1.0
C: 1.24
D: 1.54
H: 1.91
M: 2.38

LinearAB is strictly additive, so HR = 2.91. More than half a rabbit better than M.

"Discuss".

Title: Re: Empirically derived material evaluators Part I
Post by 99of9 on Nov 27^th, 2006, 10:35pm

on 11/21/06 at 21:33:12, 99of9 wrote:

Hey Idaho. Since optimized DAPE predicts that one H is worth almost exactly 2R out of the opening. Could you access your actual stats and tell us how often each has won (say for the Rating Diff <100 criteria)?

I'll make a bold prediction that the guy with the horse wins more often.

When you get a chance you could try this for M vs HR as well... but more tellingly you should try M vs DR, because Opt. linearAB even favours the DR there, which is totally whacky.

Title: Re: Empirically derived material evaluators Part I
Post by IdahoEv on Nov 28^th, 2006, 12:36am

The problem, 99, is that the data of exactly those specific states is pretty sparse. We can estimate the value of RR by how often it wins games, but the specific state of H traded for RR is pretty rare. But here's what we've got.

Using the same data set I used for training (both ratings > 1600, id>5000) (only including HvH and BvB, not HvB)

112228-112226 (gold ahead two rabbits)
52 occurrences, gold wins 45 of them (86.5%)

112226-112228 (silver ahead two rabbits
49 occurrences, silver wins 37 of them (75.5%)

Combined: RR advantage wins 82/101 or 81.2%

112228-111228 (gold ahead one horse)
78 occurrences, gold wins 54 of them (69.2%)

111228-112228 (silver ahead one horse)
82 occurrences, silver wins 67 of them (81%)

Combined: H advantage wins 121/160 or 75.6%

112226-111228 (gold down RR, silver down H)
6 occurrences, gold wins 2, silver wins 4 (67% silver win)

111228-112226 (gold down H, silver down RR)
6 occurrences, gold wins 5 (83%), silver wins 1 (16%)

Repeating the analysis, additionally constraining ABS(wrating-brating)<100 (only including HvH and BvB, not HvB):

112228-112226 (gold ahead two rabbits)
16 occurrences, gold wins 14 of them (87.5%)

112226-112228 (silver ahead two rabbits
20 occurrences, silver wins 15 of them (75.0%)

Combined: RR advantage wins 29/36 or 80.6%

112228-111228 (gold ahead one horse)
31 occurrences, gold wins 21 of them (67.7%)

111228-112228 (silver ahead one horse)
27 occurrences, silver wins 19 of them (70.4%)

Combined: H advantage wins 40/58 or 69.0%

112226-111228 (gold down RR, silver down H)
1 occurrences, gold wins 0, silver wins 1

111228-112226 (gold down H, silver down RR)
3 occurrences, gold wins 2 (67%), silver wins 1 (33%)

Title: Re: Empirically derived material evaluators Part I
Post by Fritzlein on Nov 28^th, 2006, 8:23am

on 11/28/06 at 00:36:38, IdahoEv wrote:

112226-111228 (gold down RR, silver down H)
6 occurrences, gold wins 2, silver wins 4 (67% silver win)

111228-112226 (gold down H, silver down RR)
6 occurrences, gold wins 5 (83%), silver wins 1 (16%)

Based on the spreadsheet you sent me earlier, I rather expected this. If RR weren't beating H most of the time (plus other similar stuff) then LinearAB wouldn't have settled on the coefficients it chose.

Can you post the game id's of those twelve games? It might be instructive to see exactly how the extra rabbits prove advantageous.

Title: Re: Empirically derived material evaluators Part I
Post by IdahoEv on Nov 28^th, 2006, 12:57pm

on 11/27/06 at 22:35:28, 99of9 wrote:

When you get a chance you could try this for M vs HR as well... but more tellingly you should try M vs DR, because Opt. linearAB even favours the DR there, which is totally whacky.

Well, as it turns out that is in fact what the data say when considering all games with a rating over 1600 (i.e. the dataset I trained LinearAB on).

Using the larger dataset (wrating > 1600 AND brating > 1600 AND wtype=btype)

Summary:
Camel advantage wins: 38/44 (86.4%)
HR advantage wins: 61/65 (91.0%)
DR advantage wins: 62/70 (88.6%)
Camel traded for HR: player with the camel wins 12/27 (44%)
Camel traded for DR: player with the camel wins 1/3 (33%)

State	N	gold wins	silver wins
112228-102228	20	16(80%)	4(20%)
102228-112228	24	2(8.3%)	22(91.7%)
112228-111227	30	27(90%)	3(10%)
111227-112228	31	2(6.5%)	29(93.55)
112228-112127	29	24(82.8%)	5(17.24%)
112127-112228	41	3(7.32%	38(92.68%)
111227-102228	14	3(21.4%)	11(78.6%)
102228-111227	13	4(30.8%)	9(69.2%)
112127-102228	no occurrences
102228-112127	3	2	1

Using the constrained dataset (wrating > 1600 AND brating > 1600 AND ABS(wrating-brating)<100 AND wtype=btype):

Summary:
Camel advantage wins: 13/15 (87%)
HR advantage wins: 22/26 (85%)
DR advantage wins: 18/23 (78%)
Camel vs. HR: player with the camel wins 3/12 (25%)
Camel vs. DR: player with the camel wins 1/1 (100%)

State	N	gold wins	silver wins
112228-102228	6	5	1
102228-112228	9	1	8
112228-111227	15	12	3
111227-112228	11	1	10
112228-112127	9	7	2
112127-112228	14	3	11
111227-102228	9	2	7
102228-111227	3	2	1
112127-102228	no occurrences
102228-112127	1	0	1

Using the larger data set, capturing DR and HR both seem to be better than capturing M. This does reverse when using the constrained data set, but in that case the sample sizes are sufficiently small that I don't have great confidence in the results.

Title: Re: Empirically derived material evaluators Part I
Post by IdahoEv on Nov 28^th, 2006, 1:39pm

on 11/28/06 at 08:23:56, Fritzlein wrote:

Can you post the game id's of those twelve games? It might be instructive to see exactly how the extra rabbits prove advantageous.

Here you go; the matching games for H vs RR, M vs. HR, and M vs. DR. "Turn" is the turn at which my system records the material state, which is generally 1 full turn (2 ply) after the capture that created the state. (This is how I avoid including mid-exchange states.)

State	game	turn	winner
H for RR:
112226-111228	7756 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=7756)	34w1	Black
112226-111228	15167 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=15167)	12b1	White
112226-111228	16004 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=16004)	27b1	Black
112226-111228	32195 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=32195)	37b1	Black
112226-111228	35376 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=35376)	19b1	Black
112226-111228	36369 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=36369)	17w1	White
111228-112226	11449 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=11449)	22b1	White
111228-112226	31668 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=31668)	26w1	White
111228-112226	33165 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=33165)	29w1	White
111228-112226	34455 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=34455)	18w1	White
111228-112226	35510 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=35510)	31b1	White
111228-112226	38603 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=38603)	18w1	Black
M for HR:
111227-102228	10480 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=10480)	16w1	White
111227-102228	11235 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=11235)	29b1	Black
111227-102228	11632 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=11632)	14b1	Black
111227-102228	13030 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=13030)	23b1	Black
111227-102228	15862 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=15862)	16b1	Black
111227-102228	21919 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=21919)	43w1	Black
111227-102228	23272 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=23272)	14b1	Black
111227-102228	23287 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=23287)	52b1	Black
111227-102228	23525 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=23525)	17b1	Black
111227-102228	24623 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=24623)	21b1	White
111227-102228	27204 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=27204)	25b1	Black
111227-102228	29619 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=29619)	43b1	Black
111227-102228	31362 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=31362)	21b1	White
111227-102228	36325 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=36325)	30b1	Black
102228-111227	9235 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=9235)	18b1	White
102228-111227	15294 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=15294)	18w1	Black
102228-111227	17649 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=17649)	37w1	White
102228-111227	18824 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=18824)	23b1	Black
102228-111227	18871 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=18871)	21w1	White
102228-111227	24682 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=24682)	25w1	Black
102228-111227	28137 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=28137)	27w1	Black
102228-111227	29508 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=29508)	24w1	Black
102228-111227	32663 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=32663)	20b1	Black
102228-111227	33262 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=33262)	26b1	Black
102228-111227	33934 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=33934)	16w1	White
102228-111227	39958 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=39958)	25w1	Black
102228-111227	40846 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=40846)	20b1	Black
M for DR:
102228-112127	19073 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=19073)	32w1	White
102228-112127	27445 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=27445)	23w1	Black
102228-112127	38494 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=38494)	27w1	White

Title: Re: Empirically derived material evaluators Part I
Post by IdahoEv on Nov 28^th, 2006, 4:02pm

Argh. I just realized my last three posts seached the database with an additional constraint ... wtype=btype. So those numbers only include HvH games and BvB games; no HvB games.

I can re-run them if y'all care.

Title: Re: Empirically derived material evaluators Part I
Post by IdahoEv on Nov 28^th, 2006, 5:20pm

Okay, the same results (summarized only - these posts are too long!), but now including HvB games as well as HvH and BvB.

Interestingly, the results more closely match 99of9's intuition when HvB games are included. Does this this mean that our feel for the game is based on thousands of games vs. bots, and isn't quite accurate when we are playing against a human opponent?

HvB included, ratings for both players > 1600. Positions in order of greatest to least statistical advantage.

HR advantage wins 183/211, 86%
DR advantage wins 327/383, 85%
M advantage wins 132/159, 83%
RR advantage wins 446/568, 79%
H advantage wins 459/621, 74%

Trade M for HR, player with camel wins 7/17, 41%
Trade M for DR, player with camel wins 54/95, 57%
Trade H for RR, player with horse wins 19/44, 43%

HvB included, both ratings >100 and abs(wrating-brating)<100

M advantage wins 64/72, 89%
HR advantage wins 59/71, 83%
DR advantage wins 94/117, 80%
RR advantage wins 167/214, 78%
H advantage wins 165/250, 66%

Trade M for HR, player with camel wins 18/33, 54.5%
Trade M for DR, player with camel wins 4/7, 57%
Trade H for RR, player with horse wins 10/18, 56%

So when HvB games are included AND we constrain to players with similar ratings, an M advantage becomes the strongest advantage we have. But in all other combinations considered, HR advantage beats M, and even DR beats M for most combinations of constraints.

Also, in every combination yet considered, RR advantage leads to win more often than H advantage. (Actual trades of H for RR contraindicate, but the number of those is low)

Title: Re: Empirically derived material evaluators Part I
Post by aaaa on May 11^th, 2007, 6:22pm

on 11/16/06 at 16:43:51, IdahoEv wrote:

ALGORITHM RUNDOWN

Count Pieces

Just what you'd expect; the score is +1.0 point for each piece gold has, -1.0 point for each piece silver has. This is here to give you a sense of the overall performance of the algorithms relative to a baseline. Most of the time, a player with more pieces is ahead. The performance improvement of the algorithms below is a measure of their ability to correctly evaluate the relatively few cases where a player has a numerical deficit but a functional advantage.

It's kind of a shame that you didn't use the opportunity to also see how the various algorithms would stack up against the significantly less (but still quite) naive method of simply assigning point values to the different (collapsed) piece types à la chess, which would be similarly optimized. These numbers would also make for nice base values for bots to use.

I'm not surprised in the least that an optimized DAPE performs the best here; the seven parameters just scream "overfitting" to me. If one were to wish to discover which general method (that is, ignoring the exact values used as parameters) would be best in capturing the intricacies of Arimaa, then cross-validation would be in order here.

Title: Re: Empirically derived material evaluators Part I
Post by IdahoEv on Jun 11^th, 2007, 1:20pm

on 05/11/07 at 18:22:01, aaaa wrote:

That's not a bad idea, and if and when I re-run this analysis, I'll include something like that. The only trouble is that it would have to use collapsed piece types to be useful at all, and it's not at all clear how to assign the values to a collapsed piece list because the number of possible pieces changes.

Do we fix the elephant value and count down in levels, or fix the cat and count up? And do we float the rabbit value as well, or leave it fixed relative to the elephant or cat? We might have to try it a couple of different ways.

Quote:

I'm not surprised in the least that an optimized DAPE performs the best here; the seven parameters just scream "overfitting" to me.

See my post in the fairy pieces thread for discussion on this. As an additional comment, though, most of the variability in the oDAPE coefficients comes because what most of mattered in some cases was the ratio between two coefficients rather than the coefficients themselves. So, for example, AR and BR vary widely, but only because they vary inversely. The ratio AR/BR was fairly consistent over multiple runs.

Title: Re: Empirically derived material evaluators Part I
Post by aaaa on Jun 12^th, 2007, 2:05pm

A case can be made both for collapsing the pieces downwards as well as upwards. On one hand, collapsing downwards makes sense from the viewpoint that as the board empties out, pieces and especially rabbits become more valuable simply by existing (i.e. that quantity becomes more important than quality), in which case you would want to minimize the discrepancy in value between a rabbit and the next weakest piece (by normalizing the latter as a cat). On the other hand, by collapsing upwards you won't have the strategically indispensable elephant inflating the base value of lower pieces after piece levels get eliminated.

This suggests adding three more systems for testing: collapsing downwards, collapsing upwards and a naive system with no collapsing at all in order to see whether it actually matters that much. I'm guessing it doesn't really.

Title: Re: Empirically derived material evaluators Part I
Post by 99of9 on Apr 22^nd, 2008, 2:15am

on 11/16/06 at 21:56:50, Fritzlein wrote:

I'd recommend against putting your name to this method just yet :-). Reasons below.

on 11/16/06 at 16:43:51, IdahoEv wrote:

LinearAB

The optimized values:
A=1.241 +/- 0.004
B=1.316 +/- 0.002

I'm continually surprised at how well this evaluator has performed. I originally implemented it as a toss-off algorithm just to test my optimizer. In particular, the fact that it simply values all rabbits at 1.0 points seems incredibly naive to me, as I believe that the 8th rabbit lost should be much more expensive than the 1st rabbit lost.

The empirical optimizer seems to disagree. What can I say.

I just discovered a fairly serious problem with linearAB (and everything based on it). Sometimes it will refuse to kill free enemy pieces!!

Say you have a full army and your opponent has d8r, the collapsed notation is:
004228-000108 linearAB_score = 12.71

Now you have the opportunity to kill the dog, leaving the board as:
000088-000008 linearAB_score = 9.93

Because it reduces your score (!) you will not make the capture. Or if the sides were reversed, this could result in deliberate suicides.

The problem is related to the fact that levels collapse down, but the value of each level remains the same. Perhaps collapsing up would be better, but this might cause other problems.

PS. Under this eval, Arimaabuff's CR_only handicap is actually more valuable than an R_only handicap!!!

Title: Re: Empirically derived material evaluators Part I
Post by aaaa on Apr 22^nd, 2008, 8:49am

Best then to just forgo any collapsing at all with this system and take the lack of ability to recognize equivalent states for granted.

Title: Re: Empirically derived material evaluators Part I
Post by IdahoEv on Apr 22^nd, 2008, 1:03pm

on 04/22/08 at 02:15:22, 99of9 wrote:

That's definitely a very interesting point.

However, trying it out I am unable to construct any such cases (where capturing reduces your score) when both sides still possess their elephant. And even after a lost elephant, it looks like that sort of collapse fault only occurs in extreme corner cases where one side has essentially no material left.

The primary goal of a material eval function, for me, is bot development, and I don't think cases like this one occur in competitive games.

Title: Re: Empirically derived material evaluators Part I
Post by 99of9 on Apr 22^nd, 2008, 8:34pm

on 04/22/08 at 13:03:07, IdahoEv wrote:

However, trying it out I am unable to construct any such cases (where capturing reduces your score) when both sides still possess their elephant. And even after a lost elephant, it looks like that sort of collapse fault only occurs in extreme corner cases where one side has essentially no material left.

When you have a full army and your opponent has ed8r the score is
[013228-010108] score 10.56

when you kill his dog it becomes:
[000178-000108] score 8.69

Now I agree these anti-capture situations are only for quite unbalanced armies (or very strange trades), but the issue of collapsing may cause other more subtle problems even in more normal situations.

Title: Re: Empirically derived material evaluators Part I
Post by 99of9 on Apr 22^nd, 2008, 8:48pm

on 04/22/08 at 20:34:38, 99of9 wrote:

the issue of collapsing may cause other more subtle problems even in more normal situations.

Here's one (which according to linearAB is almost balanced):
EMHH4R vs ed8r (silver ahead by 0.34)
EMHH4R vs e8r (silver ahead by 0.28)

In this case gold may prefer some positional advantage worth more than 0.06 rather than killing the opponent's second strongest piece!!

Title: Re: Empirically derived material evaluators Part I
Post by IdahoEv on Apr 23^rd, 2008, 4:05am

Okay! You've convinced me. :-)

I still believe that understanding level collapse at some level is important, but it is clear that simple downward collapse will not suffice.

Title: Re: Empirically derived material evaluators Part I
Post by 99of9 on Apr 23^rd, 2008, 5:39am

on 04/23/08 at 04:05:57, IdahoEv wrote:

I still believe that understanding level collapse at some level is important, but it is clear that simple downward collapse will not suffice.

Yes, I agree with that. DAPE accounts for it automatically because values only depend on the number of stronger and the number of equal pieces. FAME collapses up in a fancy way from memory?

Title: Re: Empirically derived material evaluators Part I
Post by Fritzlein on Apr 23^rd, 2008, 8:09am

on 04/23/08 at 05:39:32, 99of9 wrote:

FAME collapses up in a fancy way from memory?

Yes, FAME collapses the pieces up, and the rabbit value is divided by the amount of enemy material, so it goes up too as pieces are captured. One thing that FAME does generally right (although probably not accurately) is consider the same absolute material advantage to be worth more as the board empties out. One thing that FAME does generally wrong (and DAPE does right?) is fail to consider the each piece relative to all opposing pieces instead of only the piece it "lines up against".

Title: Re: Empirically derived material evaluators Part I
Post by 99of9 on Apr 23^rd, 2008, 8:42am

on 04/23/08 at 08:09:04, Fritzlein wrote:

One thing that FAME does generally right (although probably not accurately) is consider the same absolute material advantage to be worth more as the board empties out.

That is only good if the pieces that are emptying are higher in value than the place where the inequality is, or if the imbalance includes an imbalance in the number of pieces. FAME and DAPE do that right. If the pieces that are emptying are lower than the place where the inequality is, and the inequality is an equal numbered trade (say M for H), I believe the same absolute material advantage decreases in value, but both DAPE and FAME keep it constant.

Quote:

One thing that FAME does generally wrong (and DAPE does right?) is fail to consider the each piece relative to all opposing pieces instead of only the piece it "lines up against".

Yes, DAPE does this, but everything depends on the word "consider"... I'm sure there are better and worse ways to consider them.