Author 
Topic: Empirically derived material evaluators Part II (Read 11717 times) 

IdahoEv
Forum Guru
Arimaa player #1753
Gender:
Posts: 405


Empirically derived material evaluators Part II
« on: Nov 16^{th}, 2006, 4:43pm » 
Quote Modify

I tested the ability of nine material eval functions to give a probability estimate of who was winning games in the historical database. The best performer was DAPE, but not when using its default coefficients. Each algorithm produces a score that is positive for a gold advantage and negative for a silver advantage. I convert this number into a probability of a gold win with this formula: P_{gold win}=1.0/(1.0+exp(score/K)) And in each case I curvefit the constant K to generate the best score for that particular material eval algorithm. The optimized K can be viewed as just a measure of the scale of that algorithm. The error measured is the sumsquareddifference between estimated P_{gold win} and 1 (for a gold win) or 0 (for a silver win) over all states that appear in the database subject to the same criteria I defined in the last posting (players rated 1600+, exchanges only measured at the end of the exchange, etc.). The fit optimizer I used attempts to minimize that error. Algorithms which more accurately predict who is winning most often will have a lower error value. For some of the algorithms, I also performed a bestfit on the coefficients of the algorithm itself. Here's a summary of the results, from worst to best. I'll explain the particulars of each one afterwards. The error +/ column gives the standard deviation over 5 runs the Total Error column. RESULTS TABLE Algorithm  K  Total Error  Error +/  Count Pieces  1.40  2223.27  0.026  Linear AB (Default coefficients)  3.13  2145.69  0.007  DAPE (Default coefficients)  3.90  2124.19  0.000  FAME (Default coefficients)  2.92  2042.81  0.002  FAMEX (Optimized Coefficients  6.06  1908.44  0.085  FAME (Optimized Coefficients)  5.58  1895.01  0.037  RabbitCurveABC (Optimized Coefficients)  2.82  1891.21  0.646  LinearAB (Optimized Coefficients)  1.77  1889.10  0.015  DAPE (Optimized Coefficients)  3.931  1871.10  1.063  The DAPE Algorithm was clearly the best performer  but only once the coefficients were optimized by gradient descent over the database. And those optimized coefficients were decidedly unusual. The standard deviations of the errors show that even though I did 5 runs of the optimization for each algorithm often with slightly different results, the overall performance of the different algorithms listed here is nonoverlapping. There was some variance to the performance of LinearAB, for example, but it always beat RabbitCurve ABC and always lost to DAPE (Optimized Coefficients). ALGORITHM RUNDOWN Count Pieces Just what you'd expect; the score is +1.0 point for each piece gold has, 1.0 point for each piece silver has. This is here to give you a sense of the overall performance of the algorithms relative to a baseline. Most of the time, a player with more pieces is ahead. The performance improvement of the algorithms below is a measure of their ability to correctly evaluate the relatively few cases where a player has a numerical deficit but a functional advantage. LinearAB (Default Coefficients) As described in my previous post, this is a simple increasing value for each functional rank of pieces. Rabbits are worth 1.0 points, A is the value of a cat and B is the ratio of each higher officer level to the last. Cats are worth A points. Dogs are worth A*B points, horses are worth A*B*B points, etc. (With the exception that these points are actually applied to collapsed functional levels. If cats have been eliminated from the game entirely, dogs are worth A points and horses are worth A*B points, etc.) Here the default coefficients (my original guess last week) are: A = 2.0 (A cat is worth two rabbits) B = 1.5 (A dog is worth 1.5 cats, etc.) I intended from the start for these coefficients "official version" to be optimized automatically, but I put this here as an illustration of the guessing ability of humans. As with the other human guesses about piece value, it performed miserably. DAPE (Default Coefficients) DAPE implemented exactly as on Janzert's calculator, including the constant division to make initial rabbit worth 1.0. The only thing I varied was the K value in order to optimize the probability function. The other coefficients are: A=30, S=0.375, E=0.3, AR=200.0 BR=0.4, AN=200.0, BN=0.75. FAME (Default Coefficients) FAME implemented exactly as on Janzert's calculator, including the constant division to make initial rabbit worth 1.0. The only thing I varied was the K value in order to optimize the probability function. The coefficients are: Level_1=256.0, Level_2=87.0, A=1.5, B=600, C=2.0. (Where A is the ratio of Level 2 to Level 3, and residual rabbits are scored at B/(R+CP) for opposing rabbits R and opposing officers P. FAMEX (Optimized Coefficients) FAMEX is a slight variation on FAME I tried for which the rabbit score is applied to all the rabbits a player has, instead of the "residual rabbits" after others are used to fill in vs. opposing officers. It did not improve on FAME, so I'll spare you further analysis for now. FAME (Optimized Coefficients) With the values of Level 1 and Level 2 fixed at 256 and 87, I allowed the other three coefficients A, B, and C to float and the optimizer to find best values for them. The results were a fairly consistent and dramatic improvement over the default FAME coefficients, and the resulting coefficients were also quite consistent. They were: A = 1.249 +/ 0.005 B = 900.1 +/ 14.13 C = 0.96 +/ 0.023 The most dramatic result is that C is 2.0 in Fritzl's formulation but 0.96 when optimized. My bestguess explanation is this: C attempts to devalue rabbits in inverse proportion to the number of opposing officers. However, in FAME, opponents with many officers already punish rabbit score by consuming some of the rabbits in the "filling in" part of the function, preventing those rabbits from being scored entirely. It is my guess that in the average case this "filling in" effect does more than is necessary to punish a player's rabbit score for being behind in officers, and so C is dropping below 1.0 to compensate. FAMEX was an attempt to test this hypothesis, and while it did result in a much higher C value (~4.9) it didn't improve on FAME's ability to score the database. RabbitCurveABC Described in the previous post, this is identical to LinearAB except that the 8 points of the rabbits as a whole are distributed according to this function: http://idahoev.com/arimaa/images/rabbitcurve.png The coefficient A scales the value of a cat to the average rabbit, and B scales subsequent officers to cats. B was very consistent at 1.316 +/ 0.003, but A and C covaried and ranged pretty widely. C seems to prefer values from 0.2 to 0.3, indicating some usefulness of the curve shape, but A varied inversely to C and needed to range wider to compensate from ~0.5 to ~2.5! I believe A is attempting to scale the value of the officers relative to the value of the first rabbits lost rather than the average rabbits lost, especially given how wellrepresented these states are in the database. Since the value of the initial rabbit lost is very dependent on C, A is extremely sensitive to that value. LinearAB The optimized values: A=1.241 +/ 0.004 B=1.316 +/ 0.002 I'm continually surprised at how well this evaluator has performed. I originally implemented it as a tossoff algorithm just to test my optimizer. In particular, the fact that it simply values all rabbits at 1.0 points seems incredibly naive to me, as I believe that the 8th rabbit lost should be much more expensive than the 1st rabbit lost. The empirical optimizer seems to disagree. What can I say. DAPE (Optimized Coefficients) With coefficients optimized by my system, DAPE outperformed the other algorithms and not by a small amount! How much of this is due to DAPE being a better description of Arimaa's mechanics (pieces evaluated with respect to how many other pieces they can beat) and how much is due to the fact that DAPE has seven tweakable coefficients is not clear. While the end performance was fairly consistent, the actual coefficients varied quite a bit; each time I ran the optimizer it found a slightly different solution for DAPE. With seven interdependent coefficients, there are bound to be many local minima in the error function. Here are the coefficients (using Toby's terminology from this post and the code on Janzert's calculator) at the end of the bestperforming optimizing run: A=27.9307 S=0.7833 E=0.4282 AR = 0.46 BR = 1.0683 AN = 428.7 BN = 0.9755 (Error = 1870.34) The higher values of S and E (relative to the defaults) are telling us that, like with FAME, having a larger number of pieces is more important than having stronger pieces (relative to what we think, anyhow.) AN and AR are not described in 99of9's original post on DAPE but are alluded to later in the thread and can be found in the code on Janzert's FAME/DAPE web page. The value of AR is the coefficient of the "Lowrabbits punisher" function. That value is fascinating, and it was not particular to this run. In every run I performed, the optimizer immediately reduced AR to something very near zero, in fact going slightly negative in 4 out of 5 runs. Essentially, the lowrabbits optimizer with any positive value hurts the ability of DAPE to accurately score the database, and the optimizer eliminated it. However, the overall "low piece punisher" function was reliably deemed to be more important than Toby had originally estimated it: AN optimized to 434.8 +/ 53.0 (compared to the default of 200). So, performancewise DAPE is the best algorithm we have so far but it needs better coefficients. The "low rabbits" function could be removed entirely without reducing DAPE's ability to predict the outcome. ANALYSIS What's most interesting to me is how large of an improvement algorithms like LinearAB, FAME and DAPE get when an automatic system tweaks their coefficients. What this says to me is that our intuition about the value of the pieces, despite years of experience and thousands of games played, is still not quite correct and we as humans aren't very good at guessing.


IP Logged 



Fritzlein
Forum Guru
Arimaa player #706
Gender:
Posts: 5928


Re: Empirically derived material evaluators Part I
« Reply #1 on: Nov 16^{th}, 2006, 9:56pm » 
Quote Modify

Thanks for running all the optimizations and giving detailed description of each. I will respond more thoughtfully soon, but my first question is: Which of the systems is the IdahoEv System? My vote would be for the optimized linearAB to take a place next to the "official" FAME and DAPE scores on Janzert's material calculator. Or maybe you aren't done yet?


IP Logged 



IdahoEv
Forum Guru
Arimaa player #1753
Gender:
Posts: 405


Re: Empirically derived material evaluators Part I
« Reply #2 on: Nov 16^{th}, 2006, 10:06pm » 
Quote Modify

on Nov 16^{th}, 2006, 9:56pm, Fritzlein wrote: Which of the systems is the IdahoEv System? My vote would be for the optimized linearAB to take a place next to the "official" FAME and DAPE scores on Janzert's material calculator. Or maybe you aren't done yet? 
 I have an entirely separate one that I'm working on, but I haven't figured out how I want to score rabbits yet. I'll call it my official one if it outperforms LinearAB. Of course, I probably won't stop tweaking it until it does.


IP Logged 



Fritzlein
Forum Guru
Arimaa player #706
Gender:
Posts: 5928


Re: Empirically derived material evaluators Part I
« Reply #3 on: Nov 19^{th}, 2006, 9:50pm » 
Quote Modify

Again, I want to come back to this in more detail, but I thought I would throw out an observation: Shouldn't the CurveABC always perform at least as well as the LinearAB? If nothing else, it can set C=0 in CurveABC and then it is linear. If it is performing worse, then it means your optimizer is getting caught in local minima that aren't as good as the ones LinearAB is finding. In other words, curving the rabbits makes it harder to make good predictions. Now, I absolutely believe that the last few rabbits are worth more than the first few. Curving the rabbits must be better than having them linear. But since the results of curving are worse, that casts some doubt on the whole procedure. Something I don't understand is affecting the results, and I'll be suspicious until I can think of a good explanation.


IP Logged 



IdahoEv
Forum Guru
Arimaa player #1753
Gender:
Posts: 405


Re: Empirically derived material evaluators Part I
« Reply #4 on: Nov 20^{th}, 2006, 3:23am » 
Quote Modify

Sadly, being the terrible mathematician I am, the first function I found for RabbitCurveABC that had the shape I wanted is discontinuous at C=0, has a zero denominator somewhere I believe. (I don't have it in front of me ATM). So actually, RabbitCurveABC can't set C=0. Stupid, I know, and if any mathematicians present want to craft a better curve function (hint) i'd be happy to run it. But I don't think that's what's going on in any case. Given the observed interdependence of A and C, I suspect the local minimum issue is what's really happening. I probably could have run it a dozen more times, or annealed more slowly, and found better solutions. In "Part I", with the simpler error function, RabbitCurveABC generally did outperform LinearAB when run long enough, but also often fell into local minima that were slightly worse. In any case I was trying to be moreorless scientifially honest and so peformed 5 runs and for each and left it at that for now. I suspect RabbitCurveABC could slightly outperform LinearAB if given enough time, but since optimized DAPE did such a lovely job I wanted to move on to other things rather than spend another 6 hours of CPU time trying to squeeze a few more points out of an algorithm I didn't like much anyway.


IP Logged 



Fritzlein
Forum Guru
Arimaa player #706
Gender:
Posts: 5928


Re: Empirically derived material evaluators Part I
« Reply #5 on: Nov 20^{th}, 2006, 9:41pm » 
Quote Modify

on Nov 20^{th}, 2006, 3:23am, IdahoEv wrote:if any mathematicians present want to craft a better curve function (hint) i'd be happy to run it. 
 A pretty standard family of curves from (0,0) to (1,1) is f(x) = x^p where p>0. Since you are trying to get from (0,0) to (8,8), you could use f(x) = 8*(x/8)^p. Then f(8), i.e. the value of all 8 rabbits, is always 8. f(1) is the value of having one rabbit rather than zero, whereas 8  f(7) is the value of having eight rabbis rather than seven. When p=1 it is a straight line of all rabbits being equally valued; p<1 suggests the last few rabbits on the board are worth more than the first few to be captured, while p>1 suggests the reverse. With this parameter you can start out the rabbits as linear and let the line curve in either direction. I would laugh if the optimizer got stuck in a curve of the wrong direction! But one possible explanation is that later rabbits are already getting promoted in value when you collapse the strong pieces downward, so the optimizer has to compensate by curving the other way.

« Last Edit: Nov 20^{th}, 2006, 9:43pm by Fritzlein » 
IP Logged 



IdahoEv
Forum Guru
Arimaa player #1753
Gender:
Posts: 405


Re: Empirically derived material evaluators Part I
« Reply #6 on: Nov 20^{th}, 2006, 11:39pm » 
Quote Modify

99of9 asked me to rerun the analysis on DAPE but constraining the system to only those games where the opponents were within 100 ratings points of each other. I did so, here are the results. I'm showing the full results for all 5 runs both without (first) and with that constraint. Without the constraint (these are the same results summarized above, but all 5 runs are shown instead of only the best one). 51513 different states from 7913 games evaluated Err  A  S  E  AR  BR  AN  BN  K  1870.3447  27.9307  0.7833  0.4282  0.46  1.0683  428.7  0.9755  4.7811  1872.9452  28.0058  0.7924  0.5608  5.64  0.0848  412.24  0.9746  4.5903  1871.0219  31.156  0.7874  0.5253  0.88  0.9979  434.96  0.974  5.0967  1870.6683  31.0662  1.1669  0.6471  55.96  0.976  377.52  0.9854  3.7867  1870.507  28.1805  0.8544  0.452  1.74  0.8613  520.76  0.9818  4.6596  With the additional constraint (ABS(wratingbrating)<100): 18563 different states from 2967 games evaluated Err  A  S  E  AR  BR  AN  BN  K  1020.1945  31.2871  2.1109  0.0433  139.74  0.9758  422.7  0.354  3.1555  1020.0298  29.3184  2.1287  0.0221  174.32  0.9829  1195.58  0.293  2.9082  1019.6417  27.5985  0.8862  0.0631  36.14  0.9221  457.68  0.984  5.4417  1019.7013  29.1201  1.2575  0.0688  89.56  0.9708  363.58  0.9892  4.4402  1019.7303  29.3904  0.8043  0.0466  37.66  0.9267  505.8  0.9821  6.2032  The changes I see  AR/BR is now a factor, but E is not. AN/BN now varies much more widely, as does S. I'll let 99 provide the interpretation for why this might be and/or what these numbers mean, but this does demonstrate that the empirical approach is noticeably dependent on *which* historical games you are studying.

« Last Edit: Nov 20^{th}, 2006, 11:39pm by IdahoEv » 
IP Logged 



Fritzlein
Forum Guru
Arimaa player #706
Gender:
Posts: 5928


Re: Empirically derived material evaluators Part I
« Reply #7 on: Nov 20^{th}, 2006, 11:44pm » 
Quote Modify

Ah, constraining the games to even matches is a good idea I didn't think of. If the games are lopsided in rating it might encourage the optimizer to overreact along the lines of predicting that whoever takes the first piece will win. If the stronger player almost always takes the first piece, it suddenly looks like a huge advantage...


IP Logged 



IdahoEv
Forum Guru
Arimaa player #1753
Gender:
Posts: 405


Re: Empirically derived material evaluators Part I
« Reply #8 on: Nov 21^{st}, 2006, 12:44am » 
Quote Modify

on Nov 20^{th}, 2006, 11:44pm, Fritzlein wrote:If the stronger player almost always takes the first piece, it suddenly looks like a huge advantage... 
 Do we have a sense of how true this is?


IP Logged 



IdahoEv
Forum Guru
Arimaa player #1753
Gender:
Posts: 405


Re: Empirically derived material evaluators Part I
« Reply #9 on: Nov 21^{st}, 2006, 3:23am » 
Quote Modify

on Nov 20^{th}, 2006, 9:41pm, Fritzlein wrote: A pretty standard family of curves from (0,0) to (1,1) is f(x) = x^p where p>0. Since you are trying to get from (0,0) to (8,, you could use f(x) = 8*(x/^p. 
 Thank you, Fritz! I wrote RabbitCurve2ABC which uses this function. Here's a plot of the function, for the curious. It has a noticeably different shape than the one I used in RabbitCurveABC. The new one more closely approximates a linear region with the first rabbits lost all costing a similar amount followed by a sudden increase in value for the last rabbit. It seems reasonable that it might more closely fit the actual behavior of arimaa games. Tested on the same data, RabbitCurve2ABC was an improvement on RabbitCurveABC but only a tiny improvement on LinearAB: Algorithm  K  Total Error  Error +/  RabbitCurveABC (Optimized Coefficients)  2.82  1891.21  0.646  LinearAB (Optimized Coefficients)  1.77  1889.10  0.015  RabbitCurve2ABC (Optimized Coefficients)  1.80  1889.06  0.004  Moreover, it essentially made itself into LinearAB: A = 1.2427 +/ .0096 B = 1.3185 +/ .0010 C = 1.0205 +/ .0092

« Last Edit: Nov 21^{st}, 2006, 3:24am by IdahoEv » 
IP Logged 



99of9
Forum Guru
Gnobby's creator (player #314)
Gender:
Posts: 1413


Re: Empirically derived material evaluators Part I
« Reply #10 on: Nov 21^{st}, 2006, 4:17am » 
Quote Modify

on Nov 20^{th}, 2006, 11:39pm, IdahoEv wrote:The changes I see  AR/BR is now a factor, but E is not. AN/BN now varies much more widely, as does S. I'll let 99 provide the interpretation for why this might be and/or what these numbers mean, but this does demonstrate that the empirical approach is noticeably dependent on *which* historical games you are studying. 
 Great, I like these results much more. Even the E value 0.06 is not unfeasible. Thanks for all this work Idaho, it certainly gives a different perspective on material eval. I think the fact that DAPE can fit your data better than the other methods is mostly due to the fact that it has extra fitting parameters, and only partly due to the functional form. This is borne out in the fact that the parameters are so sensitive to the training set, and are still so different from my intuitive values. (Not that my intuitive values are necessarily good... but the optimized ones still don't look quite like the Arimaa I play.) ... but, maybe familiarity will convince me. I'd appreciate if Janzert could put Idaho's 3rd set (lowest error) into his calculator side by side with my coefficients. Interestingly, of the 5 you got... I agree with your error function that the 3rd one best (because it has the lowest BR).


IP Logged 



IdahoEv
Forum Guru
Arimaa player #1753
Gender:
Posts: 405


Re: Empirically derived material evaluators Part I
« Reply #11 on: Nov 21^{st}, 2006, 8:15am » 
Quote Modify

on Nov 21^{st}, 2006, 4:17am, 99of9 wrote: ... but, maybe familiarity will convince me. 
 And I'll rerun them periodically as more games accumulate in the DB, and we can experiment with different constraints on the games considered. As it is, the most recent results are using a game DB that is two weeks old. I have not updated my db since I ran the first experiment, because I wanted the number of states to remain constant so the error functions would be comparable. Quote:I'd appreciate if Janzert could put Idaho's 3rd set (lowest error) into his calculator side by side with my coefficients. 
 Incidentally, by using the K values listed and the sigmoid formula above, Janzert could add a probability output to all three, so it would show for example "FAME thinks gold is ahead by 1.53 (74% win probability)". For unaltered FAME and DAPE algorithms, use the K value for "FAME (default coefficients)" and "DAPE (default coefficients)" in the first post in this thread. In those cases K was the only variable that was empirically fit, so it essentially functions as a measure of the scale of FAME/DAPE's output.


IP Logged 



Janzert
Forum Guru
Arimaa player #247
Gender:
Posts: 1016


Re: Empirically derived material evaluators Part I
« Reply #12 on: Nov 21^{st}, 2006, 9:16pm » 
Quote Modify

Ok, I added DAPE with the constants 99of9 mentioned to the calculator. One thing that struck me is how close it is in evaluating rabbits to FAME. Janzert

« Last Edit: Nov 21^{st}, 2006, 9:41pm by Janzert » 
IP Logged 



99of9
Forum Guru
Gnobby's creator (player #314)
Gender:
Posts: 1413


Re: Empirically derived material evaluators Part I
« Reply #13 on: Nov 21^{st}, 2006, 9:33pm » 
Quote Modify

on Nov 21^{st}, 2006, 9:16pm, Janzert wrote:Ok, I added DAPE with the constants 99of9 mentioned to the calculator. 
 Thanks Janzert! Quote:One thing that struck me is how close it is in evaluating rabbits. 
 Yeah, it's amazingly good agreement when all 3 methods value one rabbit as 1.000 . Hey Idaho. Since optimized DAPE predicts that one H is worth almost exactly 2R out of the opening. Could you access your actual stats and tell us how often each has won (say for the Rating Diff <100 criteria)? I'll make a bold prediction that the guy with the horse wins more often.


IP Logged 



Janzert
Forum Guru
Arimaa player #247
Gender:
Posts: 1016


Re: Empirically derived material evaluators Part I
« Reply #14 on: Nov 21^{st}, 2006, 9:42pm » 
Quote Modify

hehe, oops forgot the words "to FAME" the first time. Janzert


IP Logged 



