|
||||||||||||||||||||||||||||||||||||||||
Title: Empirically derived material evaluators, part 1 Post by IdahoEv on Nov 8th, 2006, 9:47pm Now that I've resolved the "spooky" problem, I have my system attempting to optimize various material eval functions so that they can correctly predict who is winning mid-game in all games in the database. This post is the results for the first, most trivial material eval function. In general, the functions I'm working with have adjustable constants which are tweaked to maximize the number of states in the historical database for which they accurately predict the winner. They are penalized one point for each inaccurate prediction, including evaluating a any position as "tied" unless the pieces on each side are in fact functionally identical. (this fixes the "spooky" problem) Here are the criteria for games & material states I considered in all cases: 1) Both players rated over 1600 2) Game resulted in a win (no draws or unfinished games) 3) Game was rated 4) Game occurred after Jan 12, 2004 5) No takebacks 6) Material state was not tied 7) Material state was not transient (i.e. the state persisted at least 2 ply with no further captures) The first eval function Same one mentioned in the previous thread. Given a collapsed material state represented like 112228-112228 (all pieces alive) or 001033-001103 (EHDCRRRemrrr), those numbers represent the number of pieces in levels 4,3,2,1,0,and "R", respectively. If L is the number of a level and NL is the number of pieces in that level, and R is the number of rabbits, and A and B are adjustable coefficients, the score for each player is then: score = R + SUMover L(A*BL*NL) Basically, A is the ratio of the value of a cat to a rabbit, and B is the ratio of each level of officer to the level below it. Rabbits are worth one point each. Results I ran this two ways: first, considering only those games where both players were both bots or both humans (no HvB games) (query: 'wtype = btype'). This was to avoid the numerous games where the use of bait-and-tackle has meant an initial loss of a cat or dog led to a win. The results were very consistent over numerous runs. 10869 states were considered from 1929 games. In every run, the best fit failed to correctly classify 2028 states (18.65%) A = 1.1403 +/- 0.004 (n=10) B = 1.3347 +/- 0.035 (n=10) This implies:
second, considering all games including HvB games 51513 different states from 7913 games evaluated In all cases, the optimized function misclassified 9518 states (18.48%). A = 1.4113 +/- 0.0044 (n=7) B = 1.2757 +/- 0.0011 (n=7) These coefficients imply:
Analysis One thing that is interesting to note is that there are really very few states available to train the system to make fine distinctions. I seeded this optimizer with A=2.0 B=1.5, just as a random guess. Even with those unoptimized values, the formula only misclassifies 10210 states (19.82%). That means the thing can only improve its' classification on about 500 states in the history of the game by tweaking the coefficients from my guess. That's not a lot of material to work with. There's probably about 75-80% of the states that are so clear that the classification is nearly 100% with any system, and about 15-17% where there was a win from an obviously losing position, like coming back from a rabbit loss, and no material evaluator will ever get those positions "right". All we have to play with is the few percent in between. I was very surprised to see that the inclusion of HvB games increased the value of the cat relative to the rabbit, because of the large number of bait-and-tackle examples in the database. But most likely, those examples simply fall in the unclassifiable 17% and so can't really affect the outcome. Further analysis I leave to other folks. I'm working on fitting other functions now to see how much a more nuanced function can improve over this one. |
||||||||||||||||||||||||||||||||||||||||
Title: Re: Empirically derived material evaluators, part Post by Fritzlein on Nov 8th, 2006, 11:44pm I say again, very cool analysis. Thanks again for all the details you share in addition to the final results. I'm not sure that bait-and-tackle games are as numerous as you fear, but in any event, I agree that they shouldn't skew these numbers. Every material function will get those games wrong, so there will be no discrimination on that basis. I am interested whether FAME beats these two results. I am more interested what happens when the discrimination isn't occurring on only 1% of the material states due to agreement on the other 99% of the material states. This would involve expressing a magnitude of preference rather than just preference. Come to think of it, I am interested in whatever you do next. :-) |
||||||||||||||||||||||||||||||||||||||||
Title: Re: Empirically derived material evaluators, part Post by Janzert on Nov 9th, 2006, 12:18am Interesting results. Just curious though why the following cut off? Quote:
It might be something obvious if I took a look back at the archives but don't have time at the moment and was wondering. Janzert |
||||||||||||||||||||||||||||||||||||||||
Title: Re: Empirically derived material evaluators, part Post by IdahoEv on Nov 9th, 2006, 12:38am on 11/09/06 at 00:18:21, Janzert wrote:
Actually nothing in particular happened at that time that I'm aware of. The earliest games were before gameplay developed to anything like its current levels, so I wanted to include primarily more recent play in my analysis. More importantly, a few really weird games happened in early experimentation that I wanted to exclude. To achieve both goals I more-or-less arbitrarily picked a cutoff of gameid > 5000. As it turns out, game 5000 occurred on 2004-01-12. I suspect it makes very little real difference in the end. |
||||||||||||||||||||||||||||||||||||||||
Title: Re: Empirically derived material evaluators, part Post by IdahoEv on Nov 9th, 2006, 1:18am on 11/08/06 at 23:44:14, Fritzlein wrote:
Habit as a scientist. I almost wrote this like a paper, with introduction, materials & methods, conclusion, etc... Quote:
I intend to not only quantify FAME's performance compared to these as soon as I code it, but also to curve fit FAME's coefficients to see if its' performance can be improved just by tweaking the numbers. As I see it, FAME's current coefficients are 256 (level 1 value), 3 (the level 1:level 2 value ratio), 1.5 (level 2:3, 3:4, 4:5, 5:6 etc. value) and 600 (the denominator of the rabbit function). Also on tap for quantification * A modified version of this simple evaluator where the rabbit value is a curve function; later rabbits more valuable than early rabbits (This is running now; analysis expects to take about 6 hours of CPU time so I'll give you the results tomorrow.) * Another modification where the rabbit score, ranging from 0 to 1, is used as a multiplier vs. the officer score instead of adding to it. * The 99 system * A FAME-inspired system that checks the full matrix of possible piece interactions instead of just the columnwise straight-across pairings. * A couple other things I have in mind, possibly |
||||||||||||||||||||||||||||||||||||||||
Title: Re: Empirically derived material evaluators, part Post by 99of9 on Nov 9th, 2006, 6:25pm on 11/09/06 at 01:18:44, IdahoEv wrote:
Ahhh... you can skip the 99 system - it's had it's day. The next version of Gnobby has a material evaluator (called DAPE - depreciated arimaa piece evaluator), that sounds a little like your "full matrix" thingy. It has about 4 tuneable coefficients, and I think it's better than FAME :-). Unfortunately Gnobby won't be ready to enter 2007, so maybe I'll just have to release the material evaluator. |
||||||||||||||||||||||||||||||||||||||||
Title: Re: Empirically derived material evaluators, part Post by IdahoEv on Nov 9th, 2006, 7:12pm on 11/09/06 at 18:25:34, 99of9 wrote:
Unless you want to keep it to yourself to avoid handing secret tech to your opponent developers. :-) My matrix system has two coefficients for the officer evaluator; it's essentially just a second-order FAME that considers all interactions. However I haven't at all decided how to handle rabbits yet. The last 24 hours of experience with scanning the database tells me that there probably isn't a whole lot of point putting too much effort into improving material evaluators: the actual number of cases where a better material evaluator can improve on a naive evaluator is really very small. |
||||||||||||||||||||||||||||||||||||||||
Title: Re: Empirically derived material evaluators, part Post by 99of9 on Nov 9th, 2006, 7:28pm on 11/09/06 at 19:12:29, IdahoEv wrote:
If it's only material, it doesn't matter much, for the reason you give below! Quote:
|
||||||||||||||||||||||||||||||||||||||||
Title: Part II Post by IdahoEv on Nov 11th, 2006, 6:56pm My second step was to start trying more versatile / intuitively accurate material eval functions to see if they could fit the existing data better. In summary, I've definitively learned that there aren't enough ambiguous or interesting material states in the database for any function to improve by much on even the naive evaluator i used in the previous results. In any case, here are the results. I used a slight modification of the naive evaluator above. Coefficients A and B have the same meaning. Rabbits are worth on *average* one point, and in total eight points. However, the value of the remaining rabbits is a function that reflects that the first rabbit or two lost is not as expensive as later ones. I'll spare you the precise details of the function, which is slightly messy, but with a single tunable parameter C to alter the shape, the function looks like this: http://idahoev.com/arimaa/images/rabbitcurve-small.png In the previous experiment, the optimizer consistently settled on a solution that misclassified 9517 states out of 51513. With the new one, after 24+ hours of CPU time, the optimizer found two formulations that slightly improved that number to 9514, and one that matched the original at 9517. (and two others that were worse!) In those three states, the coefficients A and C covaried; as the rabbit curve became more steeply inflected, the ratio of the value of a cat to the value of a rabbit changed, generally to increase the relative value of a cat as the value of an initial rabbit decreased. The three solutions were:
Basically what's clear is that the number of ambiguous states is so small that the system is over-fitting the parameters to attempt to grab only one or two more categorizable states. Because the values of A of the two "best" solutions are so dramatically different, I'm not even going to try to rationalize a real meaning to these resuts. So, that's as far as I will pursue this scoring the system in terms of a point for each accurate prediction of which player is ahead. I will finish coding FAME just to see if it scores similarly to these simple functions, but then will move on to a system that tries to guage the probability of win / relative advantage, as suggested by Fritzlein above. |
||||||||||||||||||||||||||||||||||||||||
Title: Part III - FAME Post by IdahoEv on Nov 12th, 2006, 11:46am With a bit of help from Fritz in chat last night, I've coded FAME into my system. With unmodified coefficients, FAME correctly categorizes the set with 9747 errors, slightly more than the formulae above. However, it's worth pointing out that I've seen the above systems settle into a solution with exactly 9747 errors before later moving down to 9717, and in a couple of runs some formulae got stuck on 9747. That tells me that most likely there are actually 9747 cases where the player that was ahead lost, but that 30 of those cases are borderline and can be nabbed by overfitting. So I'll move on to probabilities and see what that does. |
||||||||||||||||||||||||||||||||||||||||
Title: Re: Part III - FAME Post by Fritzlein on Nov 12th, 2006, 3:36pm on 11/12/06 at 11:46:08, IdahoEv wrote:
Wait, earlier you said 9517, so it is 230 cases, not 30. Quite a few to blame on overfitting, eh? It might be worth drilling down on what exact material relationship is being flip-flopped. |
||||||||||||||||||||||||||||||||||||||||
Title: Re: Part III - FAME Post by IdahoEv on Nov 12th, 2006, 3:58pm on 11/12/06 at 15:36:06, Fritzlein wrote:
That it might. I'll do that tonight if I have energy left after work. Leave it to the mathematician to correct my arithmetic. I am terrible at math of any kind. |
||||||||||||||||||||||||||||||||||||||||
Title: Re: Empirically derived material evaluators, part Post by IdahoEv on Nov 13th, 2006, 3:35am Well, it isn't simple. In the database as queried described above, there appear 1357 states where the naive evaluator as optimized above (A~1.41, B~1.28 ) and FAME disagree on who is winning. Of those disagreements, in 757 cases the naive evaluator gives the advantage to the eventual winner and in 600 cases FAME gives the advantage to the eventual winner, a difference of 157. Critical: none of those 1357 states occurred more than once in the DB, so we don't have statistical strength on any of them. That 157 doesn't quite add up to 229 (the difference in score above). The scoring function also docks a point to an algorithm if it claims a tie between silver and gold unless the collapsed material state is in fact exactly tied. Both score a few of these, FAME more because its coefficients are rounded. These penalties for declaring ties balance out to make the difference of 229. In any case, with 1357 cases of disagreement I'm not going to list them all. I'll put just a couple of examples here.
The disagreed states where FAME was most confident but ended up mispredicting the outcome involved elephant losses by a player who eventually won. (!) Possibly those are games where the winning player had an extra step, and decided to suicide the elephant for fun while takingthe rabbit to goal? I'll find out which games these states come from. Maybe in future runs I will need to exclude material states that occur in the last ply or two of a game. |
||||||||||||||||||||||||||||||||||||||||
Title: Re: Empirically derived material evaluators, part Post by Fritzlein on Nov 13th, 2006, 8:36am on 11/13/06 at 03:35:01, IdahoEv wrote:
Oh, shouldn't it just dock half a point? If you forced it to flip a coin, it would still be right half the time... Quote:
This is fascinating. Thanks for digging out some examples. It is too bad that all of the cases of disagreement occur only once. (Although this seems like a warning flag for a bug: wouldn't you expect at least one case to have occurred twice?) If there is a wide variety of swing cases, it makes it harder to blame the victory of your system on overfitting. How could such a range of mismatched material states all be caught by tweaking two variables, unless in fact something systematic was going on? Anyway, taking it at face value, it looks like FAME may simply be over-valuing camels and undervaluing rabbits. The first case shows a cat and two rabbits beating a camel, with HDR also gone from each side. The third case again shows a cat and two rabbits beating a camel, with HD gone from each side. The fourth case has been collapsed, unless there was an elephant trade :P, so we don't know which category is missing, but in the eyes of the IdahoEv eval, three rabbits and a cat beat a horse and a dog in a somewhat traded-down situation. The stronger pieces might actually have been a camel plus dog if the horses were gone, or a camel plus horse if it was the dogs missing. We need to know more before this is conclusive, but I am open to the possibility that the results are not due to overfitting so much as a general mistake by FAME to think a small number of big pieces outweigh a larger number of small pieces. Or rather, call it a systematic bias. Perhaps the losing players in those swing games just don't know how to handle their heavy pieces. ;-) |
||||||||||||||||||||||||||||||||||||||||
Title: Re: Empirically derived material evaluators, part Post by IdahoEv on Nov 13th, 2006, 12:05pm on 11/13/06 at 08:36:20, Fritzlein wrote:
Probably, but we're talking a hundred cases or so out of 50,000+. I'm switching to the sigmoid-probability-estimate approach for future attemts to optimize coefficients in any case. Quote:
I have all 1357 in a spreadsheet if you want 'em. :-) Quote:
Not really. A couple hundred states do occur more than once. 112227-112228 occurs several hundred times. The evaluators simply agree on those. States that occur more than once are just a minority, and it doesn't surprise me that the evaluators get them correct because the evals were designed from intuition based on experience with the states we are familiar with. Note - the AB Naive evaluator is not to be called "The IdahoEv Evaluator" because that is something different that is forthcoming. :-) Your logic is sound. In any case the fact that the AB+rabbit curve eval would find the same 9517 solution and then transition (more than once) to a 9514 solution, but with different coefficients, smelled of an overfit to me. Perhaps a solution can yet be found that is better than 9514. I refuse to accept (yet) that the AB eval is the best algorithm to be had. Quote:
Functionally, it shouldn't matter, right? |
||||||||||||||||||||||||||||||||||||||||
Title: Re: Empirically derived material evaluators, part Post by Fritzlein on Nov 13th, 2006, 12:57pm on 11/13/06 at 12:05:15, IdahoEv wrote:
Excellent. I realize using a sigmoid throws in another source of error (i.e. the appropriate curvature), it does also open up a much bigger source of useful data. Quote:
I want 'em. yangfuli@yahoo.com Quote:
Yes, it also doesn't matter to FAME (I think) which category of piece was missing, even though FAME is collapsing up instead of down. |
||||||||||||||||||||||||||||||||||||||||
Title: Re: Empirically derived material evaluators, part Post by IdahoEv on Nov 13th, 2006, 1:09pm on 11/13/06 at 12:57:40, Fritzlein wrote:
Almost the biggest annoyance is that the error function of the older system was so transparent. An error of 7514 meant that 7514 states were classified "incorrectly". Easy for humans to interpret. |
||||||||||||||||||||||||||||||||||||||||
Title: Re: Empirically derived material evaluators, part Post by Fritzlein on Nov 14th, 2006, 9:29pm on 11/13/06 at 12:05:15, IdahoEv wrote:
OK, a look at the spreadsheet verifies what one would suspect from the coefficients. FAME likes heavy pieces while LinearAB likes many pieces. Of the 1357 cases of disagreement, 85 have the same number of pieces on each side. For these, FAME wins 44 by preferring, for example, MR over HD. Of the remaining disagreement cases only two had FAME favoring the side with more pieces while LinearAB favored the side with fewer pieces. Those were both midgames where FAME barely preferred RRR over DC, and was wrong both times. The other 1270 had FAME favoring the side with fewer pieces while LinearAB favored the side with more pieces. Of these, FAME only wins 556, or 44% of the disputed positions. We can further break this down by how many pieces fewer the side preferred by FAME had:
It's hard to see this as a issue of LinearAB overfitting on borderline cases so much as an issue of FAME badly fitting on extreme cases. One thing that give me pause in jumping on the bandwagon to overhaul FAME is that the A coefficient in LinearAB had such a outlier result in one trial. on 11/11/06 at 18:56:19, IdahoEv wrote:
But now that I carefully re-read that post, I see that A and C are in fact working together to keep the value of the officers as a group in constant proportion to the value of rabbits as a group. Indeed, if I am reading the results right, the curvature basically does not matter. What does matter is the ratio of big officers to small officers, and the ratio of the officers collectively to the rabbits collectively. So I guess I'm now very open to the notion that FAME needs to value quantity more and quality less. This is quite amusing given how far FAME has already gone in that direction from what I used to think. If I now have to think a dog is worth less than two rabbits and a horse is worth less than three rabbits, the only thing I have left to hold on to is that a cat is worth more than a rabbit. Don't take that away from me, please! |
||||||||||||||||||||||||||||||||||||||||
Title: Re: Empirically derived material evaluators, part Post by IdahoEv on Nov 16th, 2006, 2:28pm Karl, Fascinating analysis! I certainly didn't have the patience to analyze all those conflict states so closely. What makes me the most glad is that we seem to be actually learning something from these experiments about what works in the real world. Also, I'm continually surprised by the relative strength of LinearAB as an evaluator. It was originally just a toss-off idea I used to test out my code. on 11/14/06 at 21:29:38, Fritzlein wrote:
The one you're pointing out wasn't LinearAB (which simply evaluates all rabbits as worth 1 point), it was the extension which gave values the rabbits on a curve function (Call it RabbitCurveABC for the moment). And yes, in RabbitCurveABC, B is constant but A and C tend to co-vary to keep the value of the officers more-or-less constant relative to the value of the first rabbits lost. I'm confident that it's the first rabbits lost, because if it were the rabbits collectively, A would find a constant solution. In RabbitCurveABC (like LinearAB), the rabbits as a whole are worth 8 points; it's only the distribution of those points among the rabbits that changes (by C). For a fixed B, A alone sets the collective value of the officers relative to rabbits. I see evidence of overfitting in RabbitCurveABC and some other algorithms, but not in LinearAB. What I mean by that is that the system very quickly settles into states that will go unchanged for thousands of iterations before making sudden jump-shifts to another nearby state that will gain them just a few more "correct" cases. After the error gets down to 9600 or so, the behavior of the fitting system is decided non-smooth for most of these algorithms. Quote:
Well, my loyalty is always to the facts. But, despite my early skewed result last summer I'm pretty confident that the facts bear out that a cat is worth more than an initial rabbit. |
||||||||||||||||||||||||||||||||||||||||
Title: Re: Empirically derived material evaluators, part Post by PMertens on Nov 16th, 2006, 4:13pm Fritzl did not use the word initial ... and I am quite positive that it is not the last rabbit which is the first to be worth more than a cat ;-) |
||||||||||||||||||||||||||||||||||||||||
Title: Re: Empirically derived material evaluators, part Post by IdahoEv on Nov 16th, 2006, 5:00pm on 11/16/06 at 16:13:58, PMertens wrote:
If you take my RabbitCurveABC results at face value (and using the median result of several runs) a cat is worth 1.50 points (where the average rabbit is worth 1.0 points). The first rabbit lost is worth 0.39 points, and the last is worth 1.95 points. As it turns out, the seventh rabbit is worth 1.55 points. Meanwhile, the first three rabbits added together are worth 1.51 points. Note that this system was outperformed by two others ... including the one that valued all rabbits at a flat 1.0 points and cats at 1.24. But when allowed to vary the value of rabbits, rabbit #7 = one cat is what it settles on. |
||||||||||||||||||||||||||||||||||||||||
Title: Re: Empirically derived material evaluators, part Post by Microbe on Nov 16th, 2006, 5:03pm Indeed. A rabbit on the last rank wins, so in many situations it is probably worth a cat. Maybe more. I obviously do not have the experience or ability to know this, just somehting that seems to make sense to me. |
||||||||||||||||||||||||||||||||||||||||
Title: Re: Empirically derived material evaluators, part Post by IdahoEv on Nov 16th, 2006, 5:09pm I don't mean a rabbit on the 7th rank. I mean the cost of the 7th rabbit lost: what does one rabbit cost when you are down to only two? |
||||||||||||||||||||||||||||||||||||||||
Title: Re: Empirically derived material evaluators, part Post by jdb on Nov 16th, 2006, 6:56pm Just an observation, A material evaluation score is only valid if there are no positional features that take precedence. For example, when analyzing a game, if there is a goal race going on, the material situation is no longer really relevant. So the analysis gleaned from the database at that stage of the game should probably not be used. |
||||||||||||||||||||||||||||||||||||||||
Title: Re: Empirically derived material evaluators, part Post by Fritzlein on Nov 16th, 2006, 8:44pm on 11/16/06 at 14:28:00, IdahoEv wrote:
Yes, you are quite right. The value of the officers is kept constant relative to the value of the first few rabbits, not relative to the value of the rabbits as a whole, by A and C together. Even if there were some reason to keep the value of the officers constant relative to the value of all the rabbits, the system couldn't do it, because it is being rewarded or punished once per material state that occurs. Since the states with most rabbits on the board occur much more often than the states with most rabbits off the board, the system will necessarily tune itself to the value of the first few rabbits rather than the last few. In that sense, whatever coefficients you come up with will probably be wildly inaccurate in some endgames, for much the same reason that FAME and DAPE are wildly inaccurate in some endgames: We all tune our systems to deal with familiar situations first, and merely hope they deal with unfamiliar situations by extension. |
||||||||||||||||||||||||||||||||||||||||
Title: Re: Empirically derived material evaluators, part Post by IdahoEv on Nov 16th, 2006, 10:13pm on 11/16/06 at 20:44:42, Fritzlein wrote:
Absolutely. The drawback of an empirical approach is that it is constrained to the data available, and there's no question these results are necessarily skewed by that. I suspect that one of the reasons DAPE does so well with adjusted coefficients (in the other post) is that because every piece's value depends in some sense on the number of other pieces on the board, DAPE's coefficients can be adjusted in such a way that the terms respond more appropriately in endgame situations. |
||||||||||||||||||||||||||||||||||||||||
Arimaa Forum » Powered by YaBB 1 Gold - SP 1.3.1! YaBB © 2000-2003. All Rights Reserved. |