Author |
Topic: Empirically derived material evaluators, part 1 (Read 7657 times) |
|
IdahoEv
Forum Guru
Arimaa player #1753
Gender:
Posts: 405
|
|
Empirically derived material evaluators, part 1
« on: Nov 8th, 2006, 9:47pm » |
Quote Modify
|
Now that I've resolved the "spooky" problem, I have my system attempting to optimize various material eval functions so that they can correctly predict who is winning mid-game in all games in the database. This post is the results for the first, most trivial material eval function. In general, the functions I'm working with have adjustable constants which are tweaked to maximize the number of states in the historical database for which they accurately predict the winner. They are penalized one point for each inaccurate prediction, including evaluating a any position as "tied" unless the pieces on each side are in fact functionally identical. (this fixes the "spooky" problem) Here are the criteria for games & material states I considered in all cases: 1) Both players rated over 1600 2) Game resulted in a win (no draws or unfinished games) 3) Game was rated 4) Game occurred after Jan 12, 2004 5) No takebacks 6) Material state was not tied 7) Material state was not transient (i.e. the state persisted at least 2 ply with no further captures) The first eval function Same one mentioned in the previous thread. Given a collapsed material state represented like 112228-112228 (all pieces alive) or 001033-001103 (EHDCRRRemrrr), those numbers represent the number of pieces in levels 4,3,2,1,0,and "R", respectively. If L is the number of a level and NL is the number of pieces in that level, and R is the number of rabbits, and A and B are adjustable coefficients, the score for each player is then: score = R + SUMover L(A*BL*NL) Basically, A is the ratio of the value of a cat to a rabbit, and B is the ratio of each level of officer to the level below it. Rabbits are worth one point each. Results I ran this two ways: first, considering only those games where both players were both bots or both humans (no HvB games) (query: 'wtype = btype'). This was to avoid the numerous games where the use of bait-and-tackle has meant an initial loss of a cat or dog led to a win. The results were very consistent over numerous runs. 10869 states were considered from 1929 games. In every run, the best fit failed to correctly classify 2028 states (18.65%) A = 1.1403 +/- 0.004 (n=10) B = 1.3347 +/- 0.035 (n=10) This implies: Rabbit: | 1.00 | Cat: | 1.14 | Dog: | 1.52 | Horse: | 2.03 | Camel: | 2.71 | Phant: | 3.62 | second, considering all games including HvB games 51513 different states from 7913 games evaluated In all cases, the optimized function misclassified 9518 states (18.48%). A = 1.4113 +/- 0.0044 (n=7) B = 1.2757 +/- 0.0011 (n=7) These coefficients imply: Rabbit: | 1.00 | Cat: | 1.41 | Dog: | 1.80 | Horse: | 2.41 | Camel: | 3.21 | Phant: | 4.29 | Analysis One thing that is interesting to note is that there are really very few states available to train the system to make fine distinctions. I seeded this optimizer with A=2.0 B=1.5, just as a random guess. Even with those unoptimized values, the formula only misclassifies 10210 states (19.82%). That means the thing can only improve its' classification on about 500 states in the history of the game by tweaking the coefficients from my guess. That's not a lot of material to work with. There's probably about 75-80% of the states that are so clear that the classification is nearly 100% with any system, and about 15-17% where there was a win from an obviously losing position, like coming back from a rabbit loss, and no material evaluator will ever get those positions "right". All we have to play with is the few percent in between. I was very surprised to see that the inclusion of HvB games increased the value of the cat relative to the rabbit, because of the large number of bait-and-tackle examples in the database. But most likely, those examples simply fall in the unclassifiable 17% and so can't really affect the outcome. Further analysis I leave to other folks. I'm working on fitting other functions now to see how much a more nuanced function can improve over this one.
|
« Last Edit: Nov 8th, 2006, 9:49pm by IdahoEv » |
IP Logged |
|
|
|
Fritzlein
Forum Guru
Arimaa player #706
Gender:
Posts: 5928
|
|
Re: Empirically derived material evaluators, part
« Reply #1 on: Nov 8th, 2006, 11:44pm » |
Quote Modify
|
I say again, very cool analysis. Thanks again for all the details you share in addition to the final results. I'm not sure that bait-and-tackle games are as numerous as you fear, but in any event, I agree that they shouldn't skew these numbers. Every material function will get those games wrong, so there will be no discrimination on that basis. I am interested whether FAME beats these two results. I am more interested what happens when the discrimination isn't occurring on only 1% of the material states due to agreement on the other 99% of the material states. This would involve expressing a magnitude of preference rather than just preference. Come to think of it, I am interested in whatever you do next.
|
|
IP Logged |
|
|
|
Janzert
Forum Guru
Arimaa player #247
Gender:
Posts: 1016
|
|
Re: Empirically derived material evaluators, part
« Reply #2 on: Nov 9th, 2006, 12:18am » |
Quote Modify
|
Interesting results. Just curious though why the following cut off? Quote: 4) Game occurred after Jan 12, 2004 |
| It might be something obvious if I took a look back at the archives but don't have time at the moment and was wondering. Janzert
|
« Last Edit: Nov 9th, 2006, 12:19am by Janzert » |
IP Logged |
|
|
|
IdahoEv
Forum Guru
Arimaa player #1753
Gender:
Posts: 405
|
|
Re: Empirically derived material evaluators, part
« Reply #3 on: Nov 9th, 2006, 12:38am » |
Quote Modify
|
on Nov 9th, 2006, 12:18am, Janzert wrote:Interesting results. Just curious though why the following cut off? (Game occurred after January 12, 2004) It might be something obvious if I took a look back at the archives but don't have time at the moment and was wondering. |
| Actually nothing in particular happened at that time that I'm aware of. The earliest games were before gameplay developed to anything like its current levels, so I wanted to include primarily more recent play in my analysis. More importantly, a few really weird games happened in early experimentation that I wanted to exclude. To achieve both goals I more-or-less arbitrarily picked a cutoff of gameid > 5000. As it turns out, game 5000 occurred on 2004-01-12. I suspect it makes very little real difference in the end.
|
|
IP Logged |
|
|
|
IdahoEv
Forum Guru
Arimaa player #1753
Gender:
Posts: 405
|
|
Re: Empirically derived material evaluators, part
« Reply #4 on: Nov 9th, 2006, 1:18am » |
Quote Modify
|
on Nov 8th, 2006, 11:44pm, Fritzlein wrote:I say again, very cool analysis. Thanks again for all the details you share in addition to the final results. |
| Habit as a scientist. I almost wrote this like a paper, with introduction, materials & methods, conclusion, etc... Quote:I am interested whether FAME beats these two results. |
| I intend to not only quantify FAME's performance compared to these as soon as I code it, but also to curve fit FAME's coefficients to see if its' performance can be improved just by tweaking the numbers. As I see it, FAME's current coefficients are 256 (level 1 value), 3 (the level 1:level 2 value ratio), 1.5 (level 2:3, 3:4, 4:5, 5:6 etc. value) and 600 (the denominator of the rabbit function). Also on tap for quantification * A modified version of this simple evaluator where the rabbit value is a curve function; later rabbits more valuable than early rabbits (This is running now; analysis expects to take about 6 hours of CPU time so I'll give you the results tomorrow.) * Another modification where the rabbit score, ranging from 0 to 1, is used as a multiplier vs. the officer score instead of adding to it. * The 99 system * A FAME-inspired system that checks the full matrix of possible piece interactions instead of just the columnwise straight-across pairings. * A couple other things I have in mind, possibly
|
« Last Edit: Nov 9th, 2006, 1:21am by IdahoEv » |
IP Logged |
|
|
|
99of9
Forum Guru
Gnobby's creator (player #314)
Gender:
Posts: 1413
|
|
Re: Empirically derived material evaluators, part
« Reply #5 on: Nov 9th, 2006, 6:25pm » |
Quote Modify
|
on Nov 9th, 2006, 1:18am, IdahoEv wrote: * The 99 system * A FAME-inspired system that checks the full matrix of possible piece interactions instead of just the columnwise straight-across pairings. |
| Ahhh... you can skip the 99 system - it's had it's day. The next version of Gnobby has a material evaluator (called DAPE - depreciated arimaa piece evaluator), that sounds a little like your "full matrix" thingy. It has about 4 tuneable coefficients, and I think it's better than FAME . Unfortunately Gnobby won't be ready to enter 2007, so maybe I'll just have to release the material evaluator.
|
« Last Edit: Nov 9th, 2006, 6:26pm by 99of9 » |
IP Logged |
|
|
|
IdahoEv
Forum Guru
Arimaa player #1753
Gender:
Posts: 405
|
|
Re: Empirically derived material evaluators, part
« Reply #6 on: Nov 9th, 2006, 7:12pm » |
Quote Modify
|
on Nov 9th, 2006, 6:25pm, 99of9 wrote: The next version of Gnobby has a material evaluator (called DAPE - depreciated arimaa piece evaluator) ... Unfortunately Gnobby won't be ready to enter 2007, so maybe I'll just have to release the material evaluator. |
| Unless you want to keep it to yourself to avoid handing secret tech to your opponent developers. My matrix system has two coefficients for the officer evaluator; it's essentially just a second-order FAME that considers all interactions. However I haven't at all decided how to handle rabbits yet. The last 24 hours of experience with scanning the database tells me that there probably isn't a whole lot of point putting too much effort into improving material evaluators: the actual number of cases where a better material evaluator can improve on a naive evaluator is really very small.
|
|
IP Logged |
|
|
|
99of9
Forum Guru
Gnobby's creator (player #314)
Gender:
Posts: 1413
|
|
Re: Empirically derived material evaluators, part
« Reply #7 on: Nov 9th, 2006, 7:28pm » |
Quote Modify
|
on Nov 9th, 2006, 7:12pm, IdahoEv wrote:Unless you want to keep it to yourself to avoid handing secret tech to your opponent developers. |
| If it's only material, it doesn't matter much, for the reason you give below! Quote:The last 24 hours of experience with scanning the database tells me that there probably isn't a whole lot of point putting too much effort into improving material evaluators: the actual number of cases where a better material evaluator can improve on a naive evaluator is really very small. |
|
|
|
IP Logged |
|
|
|
IdahoEv
Forum Guru
Arimaa player #1753
Gender:
Posts: 405
|
|
Part II
« Reply #8 on: Nov 11th, 2006, 6:56pm » |
Quote Modify
|
My second step was to start trying more versatile / intuitively accurate material eval functions to see if they could fit the existing data better. In summary, I've definitively learned that there aren't enough ambiguous or interesting material states in the database for any function to improve by much on even the naive evaluator i used in the previous results. In any case, here are the results. I used a slight modification of the naive evaluator above. Coefficients A and B have the same meaning. Rabbits are worth on *average* one point, and in total eight points. However, the value of the remaining rabbits is a function that reflects that the first rabbit or two lost is not as expensive as later ones. I'll spare you the precise details of the function, which is slightly messy, but with a single tunable parameter C to alter the shape, the function looks like this: In the previous experiment, the optimizer consistently settled on a solution that misclassified 9517 states out of 51513. With the new one, after 24+ hours of CPU time, the optimizer found two formulations that slightly improved that number to 9514, and one that matched the original at 9517. (and two others that were worse!) In those three states, the coefficients A and C covaried; as the rabbit curve became more steeply inflected, the ratio of the value of a cat to the value of a rabbit changed, generally to increase the relative value of a cat as the value of an initial rabbit decreased. The three solutions were: Error | A | B | C | 9514 | 1.56 | 1.263 | 0.237 | 9514 | 3.12 | 1.261 | 0.203 | 9517 | 1.95 | 1.265 | 0.226 | Basically what's clear is that the number of ambiguous states is so small that the system is over-fitting the parameters to attempt to grab only one or two more categorizable states. Because the values of A of the two "best" solutions are so dramatically different, I'm not even going to try to rationalize a real meaning to these resuts. So, that's as far as I will pursue this scoring the system in terms of a point for each accurate prediction of which player is ahead. I will finish coding FAME just to see if it scores similarly to these simple functions, but then will move on to a system that tries to guage the probability of win / relative advantage, as suggested by Fritzlein above.
|
|
IP Logged |
|
|
|
IdahoEv
Forum Guru
Arimaa player #1753
Gender:
Posts: 405
|
|
Part III - FAME
« Reply #9 on: Nov 12th, 2006, 11:46am » |
Quote Modify
|
With a bit of help from Fritz in chat last night, I've coded FAME into my system. With unmodified coefficients, FAME correctly categorizes the set with 9747 errors, slightly more than the formulae above. However, it's worth pointing out that I've seen the above systems settle into a solution with exactly 9747 errors before later moving down to 9717, and in a couple of runs some formulae got stuck on 9747. That tells me that most likely there are actually 9747 cases where the player that was ahead lost, but that 30 of those cases are borderline and can be nabbed by overfitting. So I'll move on to probabilities and see what that does.
|
|
IP Logged |
|
|
|
Fritzlein
Forum Guru
Arimaa player #706
Gender:
Posts: 5928
|
|
Re: Part III - FAME
« Reply #10 on: Nov 12th, 2006, 3:36pm » |
Quote Modify
|
on Nov 12th, 2006, 11:46am, IdahoEv wrote:However, it's worth pointing out that I've seen the above systems settle into a solution with exactly 9747 errors before later moving down to 9717, and in a couple of runs some formulae got stuck on 9747. |
| Wait, earlier you said 9517, so it is 230 cases, not 30. Quite a few to blame on overfitting, eh? It might be worth drilling down on what exact material relationship is being flip-flopped.
|
« Last Edit: Nov 12th, 2006, 8:28pm by Fritzlein » |
IP Logged |
|
|
|
IdahoEv
Forum Guru
Arimaa player #1753
Gender:
Posts: 405
|
|
Re: Part III - FAME
« Reply #11 on: Nov 12th, 2006, 3:58pm » |
Quote Modify
|
on Nov 12th, 2006, 3:36pm, Fritzlein wrote: It might be worth drilling down on what exact material relationship is being flip-flopped. |
| That it might. I'll do that tonight if I have energy left after work. Leave it to the mathematician to correct my arithmetic. I am terrible at math of any kind.
|
|
IP Logged |
|
|
|
IdahoEv
Forum Guru
Arimaa player #1753
Gender:
Posts: 405
|
|
Re: Empirically derived material evaluators, part
« Reply #12 on: Nov 13th, 2006, 3:35am » |
Quote Modify
|
Well, it isn't simple. In the database as queried described above, there appear 1357 states where the naive evaluator as optimized above (A~1.41, B~1.28 ) and FAME disagree on who is winning. Of those disagreements, in 757 cases the naive evaluator gives the advantage to the eventual winner and in 600 cases FAME gives the advantage to the eventual winner, a difference of 157. Critical: none of those 1357 states occurred more than once in the DB, so we don't have statistical strength on any of them. That 157 doesn't quite add up to 229 (the difference in score above). The scoring function also docks a point to an algorithm if it claims a tie between silver and gold unless the collapsed material state is in fact exactly tied. Both score a few of these, FAME more because its coefficients are rounded. These penalties for declaring ties balance out to make the difference of 229. In any case, with 1357 cases of disagreement I'm not going to list them all. I'll put just a couple of examples here. State | Naive score | FAME score | Who won | 101127-111115 | 0.4797 | -76 | white | 012127-111124 | 1.55 | -205 | black | 111116-101128 | -0.4797 | 80 | black | 011215-010128 | -0.3113 | 113 | black | In a few examples (like #2 above), FAME outperforms the naive eval when an elephant has been lost but the other side has more pieces. This is probably because the naive evaluator scores phants as B*camelvalue, while FAME (more correctly) values the elephant much higher. I noticed several of these states where FAME has the advantage. The disagreed states where FAME was most confident but ended up mispredicting the outcome involved elephant losses by a player who eventually won. (!) Possibly those are games where the winning player had an extra step, and decided to suicide the elephant for fun while takingthe rabbit to goal? I'll find out which games these states come from. Maybe in future runs I will need to exclude material states that occur in the last ply or two of a game.
|
« Last Edit: Nov 13th, 2006, 3:39am by IdahoEv » |
IP Logged |
|
|
|
Fritzlein
Forum Guru
Arimaa player #706
Gender:
Posts: 5928
|
|
Re: Empirically derived material evaluators, part
« Reply #13 on: Nov 13th, 2006, 8:36am » |
Quote Modify
|
on Nov 13th, 2006, 3:35am, IdahoEv wrote:That 157 doesn't quite add up to 229 (the difference in score above). The scoring function also docks a point to an algorithm if it claims a tie between silver and gold unless the collapsed material state is in fact exactly tied. |
| Oh, shouldn't it just dock half a point? If you forced it to flip a coin, it would still be right half the time... Quote: State | Naive score | FAME score | Who won | 101127-111115 | 0.4797 | -76 | white | 012127-111124 | 1.55 | -205 | black | 111116-101128 | -0.4797 | 80 | black | 011215-010128 | -0.3113 | 113 | black | |
| This is fascinating. Thanks for digging out some examples. It is too bad that all of the cases of disagreement occur only once. (Although this seems like a warning flag for a bug: wouldn't you expect at least one case to have occurred twice?) If there is a wide variety of swing cases, it makes it harder to blame the victory of your system on overfitting. How could such a range of mismatched material states all be caught by tweaking two variables, unless in fact something systematic was going on? Anyway, taking it at face value, it looks like FAME may simply be over-valuing camels and undervaluing rabbits. The first case shows a cat and two rabbits beating a camel, with HDR also gone from each side. The third case again shows a cat and two rabbits beating a camel, with HD gone from each side. The fourth case has been collapsed, unless there was an elephant trade , so we don't know which category is missing, but in the eyes of the IdahoEv eval, three rabbits and a cat beat a horse and a dog in a somewhat traded-down situation. The stronger pieces might actually have been a camel plus dog if the horses were gone, or a camel plus horse if it was the dogs missing. We need to know more before this is conclusive, but I am open to the possibility that the results are not due to overfitting so much as a general mistake by FAME to think a small number of big pieces outweigh a larger number of small pieces. Or rather, call it a systematic bias. Perhaps the losing players in those swing games just don't know how to handle their heavy pieces.
|
|
IP Logged |
|
|
|
IdahoEv
Forum Guru
Arimaa player #1753
Gender:
Posts: 405
|
|
Re: Empirically derived material evaluators, part
« Reply #14 on: Nov 13th, 2006, 12:05pm » |
Quote Modify
|
on Nov 13th, 2006, 8:36am, Fritzlein wrote: Oh, shouldn't it just dock half a point? If you forced it to flip a coin, it would still be right half the time... |
| Probably, but we're talking a hundred cases or so out of 50,000+. I'm switching to the sigmoid-probability-estimate approach for future attemts to optimize coefficients in any case. Quote:This is fascinating. Thanks for digging out some examples. |
| I have all 1357 in a spreadsheet if you want 'em. Quote:(Although this seems like a warning flag for a bug: wouldn't you expect at least one case to have occurred twice?) |
| Not really. A couple hundred states do occur more than once. 112227-112228 occurs several hundred times. The evaluators simply agree on those. States that occur more than once are just a minority, and it doesn't surprise me that the evaluators get them correct because the evals were designed from intuition based on experience with the states we are familiar with. Note - the AB Naive evaluator is not to be called "The IdahoEv Evaluator" because that is something different that is forthcoming. Your logic is sound. In any case the fact that the AB+rabbit curve eval would find the same 9517 solution and then transition (more than once) to a 9514 solution, but with different coefficients, smelled of an overfit to me. Perhaps a solution can yet be found that is better than 9514. I refuse to accept (yet) that the AB eval is the best algorithm to be had. Quote:The stronger pieces might actually have been a camel plus dog if the horses were gone, or a camel plus horse if it was the dogs missing. |
| Functionally, it shouldn't matter, right?
|
|
IP Logged |
|
|
|
|