Author |
Topic: Empirically derived material evaluators, part 1 (Read 7664 times) |
|
Fritzlein
Forum Guru
Arimaa player #706
Gender:
Posts: 5928
|
|
Re: Empirically derived material evaluators, part
« Reply #15 on: Nov 13th, 2006, 12:57pm » |
Quote Modify
|
on Nov 13th, 2006, 12:05pm, IdahoEv wrote:I'm switching to the sigmoid-probability-estimate approach for future attemts to optimize coefficients in any case. |
| Excellent. I realize using a sigmoid throws in another source of error (i.e. the appropriate curvature), it does also open up a much bigger source of useful data. Quote:I have all 1357 in a spreadsheet if you want 'em. |
| I want 'em. yangfuli@yahoo.com Quote:Functionally, it shouldn't matter, right? |
| Yes, it also doesn't matter to FAME (I think) which category of piece was missing, even though FAME is collapsing up instead of down.
|
|
IP Logged |
|
|
|
IdahoEv
Forum Guru
Arimaa player #1753
Gender:
Posts: 405
|
|
Re: Empirically derived material evaluators, part
« Reply #16 on: Nov 13th, 2006, 1:09pm » |
Quote Modify
|
on Nov 13th, 2006, 12:57pm, Fritzlein wrote:Excellent. I realize using a sigmoid throws in another source of error (i.e. the appropriate curvature), it does also open up a much bigger source of useful data. |
| Almost the biggest annoyance is that the error function of the older system was so transparent. An error of 7514 meant that 7514 states were classified "incorrectly". Easy for humans to interpret.
|
|
IP Logged |
|
|
|
Fritzlein
Forum Guru
Arimaa player #706
Gender:
Posts: 5928
|
|
Re: Empirically derived material evaluators, part
« Reply #17 on: Nov 14th, 2006, 9:29pm » |
Quote Modify
|
on Nov 13th, 2006, 12:05pm, IdahoEv wrote:I have all 1357 in a spreadsheet if you want 'em. |
| OK, a look at the spreadsheet verifies what one would suspect from the coefficients. FAME likes heavy pieces while LinearAB likes many pieces. Of the 1357 cases of disagreement, 85 have the same number of pieces on each side. For these, FAME wins 44 by preferring, for example, MR over HD. Of the remaining disagreement cases only two had FAME favoring the side with more pieces while LinearAB favored the side with fewer pieces. Those were both midgames where FAME barely preferred RRR over DC, and was wrong both times. The other 1270 had FAME favoring the side with fewer pieces while LinearAB favored the side with more pieces. Of these, FAME only wins 556, or 44% of the disputed positions. We can further break this down by how many pieces fewer the side preferred by FAME had: deficit | Cases | Correct | Percent | 1 | 770 | 366 | 48% | 2 | 389 | 158 | 41% | 3 | 100 | 29 | 29% | 4+ | 11 | 3 | 27% | So the more pieces FAME is behind by when it claims to actually be winning, the less we should trust FAME. It's hard to see this as a issue of LinearAB overfitting on borderline cases so much as an issue of FAME badly fitting on extreme cases. One thing that give me pause in jumping on the bandwagon to overhaul FAME is that the A coefficient in LinearAB had such a outlier result in one trial. on Nov 11th, 2006, 6:56pm, IdahoEv wrote: The three solutions were: Error | A | B | C | 9514 | 1.56 | 1.263 | 0.237 | 9514 | 3.12 | 1.261 | 0.203 | 9517 | 1.95 | 1.265 | 0.226 | |
| But now that I carefully re-read that post, I see that A and C are in fact working together to keep the value of the officers as a group in constant proportion to the value of rabbits as a group. Indeed, if I am reading the results right, the curvature basically does not matter. What does matter is the ratio of big officers to small officers, and the ratio of the officers collectively to the rabbits collectively. So I guess I'm now very open to the notion that FAME needs to value quantity more and quality less. This is quite amusing given how far FAME has already gone in that direction from what I used to think. If I now have to think a dog is worth less than two rabbits and a horse is worth less than three rabbits, the only thing I have left to hold on to is that a cat is worth more than a rabbit. Don't take that away from me, please!
|
|
IP Logged |
|
|
|
IdahoEv
Forum Guru
Arimaa player #1753
Gender:
Posts: 405
|
|
Re: Empirically derived material evaluators, part
« Reply #18 on: Nov 16th, 2006, 2:28pm » |
Quote Modify
|
Karl, Fascinating analysis! I certainly didn't have the patience to analyze all those conflict states so closely. What makes me the most glad is that we seem to be actually learning something from these experiments about what works in the real world. Also, I'm continually surprised by the relative strength of LinearAB as an evaluator. It was originally just a toss-off idea I used to test out my code. on Nov 14th, 2006, 9:29pm, Fritzlein wrote:One thing that give me pause in jumping on the bandwagon to overhaul FAME is that the A coefficient in LinearAB had such a outlier result in one trial. |
| The one you're pointing out wasn't LinearAB (which simply evaluates all rabbits as worth 1 point), it was the extension which gave values the rabbits on a curve function (Call it RabbitCurveABC for the moment). And yes, in RabbitCurveABC, B is constant but A and C tend to co-vary to keep the value of the officers more-or-less constant relative to the value of the first rabbits lost. I'm confident that it's the first rabbits lost, because if it were the rabbits collectively, A would find a constant solution. In RabbitCurveABC (like LinearAB), the rabbits as a whole are worth 8 points; it's only the distribution of those points among the rabbits that changes (by C). For a fixed B, A alone sets the collective value of the officers relative to rabbits. I see evidence of overfitting in RabbitCurveABC and some other algorithms, but not in LinearAB. What I mean by that is that the system very quickly settles into states that will go unchanged for thousands of iterations before making sudden jump-shifts to another nearby state that will gain them just a few more "correct" cases. After the error gets down to 9600 or so, the behavior of the fitting system is decided non-smooth for most of these algorithms. Quote:If I now have to think a dog is worth less than two rabbits and a horse is worth less than three rabbits, the only thing I have left to hold on to is that a cat is worth more than a rabbit. Don't take that away from me, please! |
| Well, my loyalty is always to the facts. But, despite my early skewed result last summer I'm pretty confident that the facts bear out that a cat is worth more than an initial rabbit.
|
|
IP Logged |
|
|
|
PMertens
Forum Guru
Arimaa player #692
Gender:
Posts: 437
|
|
Re: Empirically derived material evaluators, part
« Reply #19 on: Nov 16th, 2006, 4:13pm » |
Quote Modify
|
Fritzl did not use the word initial ... and I am quite positive that it is not the last rabbit which is the first to be worth more than a cat
|
|
IP Logged |
|
|
|
IdahoEv
Forum Guru
Arimaa player #1753
Gender:
Posts: 405
|
|
Re: Empirically derived material evaluators, part
« Reply #20 on: Nov 16th, 2006, 5:00pm » |
Quote Modify
|
on Nov 16th, 2006, 4:13pm, PMertens wrote:Fritzl did not use the word initial ... and I am quite positive that it is not the last rabbit which is the first to be worth more than a cat |
| If you take my RabbitCurveABC results at face value (and using the median result of several runs) a cat is worth 1.50 points (where the average rabbit is worth 1.0 points). The first rabbit lost is worth 0.39 points, and the last is worth 1.95 points. As it turns out, the seventh rabbit is worth 1.55 points. Meanwhile, the first three rabbits added together are worth 1.51 points. Note that this system was outperformed by two others ... including the one that valued all rabbits at a flat 1.0 points and cats at 1.24. But when allowed to vary the value of rabbits, rabbit #7 = one cat is what it settles on.
|
|
IP Logged |
|
|
|
Microbe
Forum Newbie
Arimaa player #1977
Gender:
Posts: 4
|
|
Re: Empirically derived material evaluators, part
« Reply #21 on: Nov 16th, 2006, 5:03pm » |
Quote Modify
|
Indeed. A rabbit on the last rank wins, so in many situations it is probably worth a cat. Maybe more. I obviously do not have the experience or ability to know this, just somehting that seems to make sense to me.
|
|
IP Logged |
|
|
|
IdahoEv
Forum Guru
Arimaa player #1753
Gender:
Posts: 405
|
|
Re: Empirically derived material evaluators, part
« Reply #22 on: Nov 16th, 2006, 5:09pm » |
Quote Modify
|
I don't mean a rabbit on the 7th rank. I mean the cost of the 7th rabbit lost: what does one rabbit cost when you are down to only two?
|
|
IP Logged |
|
|
|
jdb
Forum Guru
Arimaa player #214
Gender:
Posts: 682
|
|
Re: Empirically derived material evaluators, part
« Reply #23 on: Nov 16th, 2006, 6:56pm » |
Quote Modify
|
Just an observation, A material evaluation score is only valid if there are no positional features that take precedence. For example, when analyzing a game, if there is a goal race going on, the material situation is no longer really relevant. So the analysis gleaned from the database at that stage of the game should probably not be used.
|
|
IP Logged |
|
|
|
Fritzlein
Forum Guru
Arimaa player #706
Gender:
Posts: 5928
|
|
Re: Empirically derived material evaluators, part
« Reply #24 on: Nov 16th, 2006, 8:44pm » |
Quote Modify
|
on Nov 16th, 2006, 2:28pm, IdahoEv wrote: And yes, in RabbitCurveABC, B is constant but A and C tend to co-vary to keep the value of the officers more-or-less constant relative to the value of the first rabbits lost. I'm confident that it's the first rabbits lost, because if it were the rabbits collectively, A would find a constant solution. |
| Yes, you are quite right. The value of the officers is kept constant relative to the value of the first few rabbits, not relative to the value of the rabbits as a whole, by A and C together. Even if there were some reason to keep the value of the officers constant relative to the value of all the rabbits, the system couldn't do it, because it is being rewarded or punished once per material state that occurs. Since the states with most rabbits on the board occur much more often than the states with most rabbits off the board, the system will necessarily tune itself to the value of the first few rabbits rather than the last few. In that sense, whatever coefficients you come up with will probably be wildly inaccurate in some endgames, for much the same reason that FAME and DAPE are wildly inaccurate in some endgames: We all tune our systems to deal with familiar situations first, and merely hope they deal with unfamiliar situations by extension.
|
|
IP Logged |
|
|
|
IdahoEv
Forum Guru
Arimaa player #1753
Gender:
Posts: 405
|
|
Re: Empirically derived material evaluators, part
« Reply #25 on: Nov 16th, 2006, 10:13pm » |
Quote Modify
|
on Nov 16th, 2006, 8:44pm, Fritzlein wrote: In that sense, whatever coefficients you come up with will probably be wildly inaccurate in some endgames, for much the same reason that FAME and DAPE are wildly inaccurate in some endgames: We all tune our systems to deal with familiar situations first, and merely hope they deal with unfamiliar situations by extension. |
| Absolutely. The drawback of an empirical approach is that it is constrained to the data available, and there's no question these results are necessarily skewed by that. I suspect that one of the reasons DAPE does so well with adjusted coefficients (in the other post) is that because every piece's value depends in some sense on the number of other pieces on the board, DAPE's coefficients can be adjusted in such a way that the terms respond more appropriately in endgame situations.
|
|
IP Logged |
|
|
|
|