Author |
Topic: Empirically derived material evaluators Part II (Read 11743 times) |
|
IdahoEv
Forum Guru
Arimaa player #1753
Gender:
Posts: 405
|
|
Re: Empirically derived material evaluators Part I
« Reply #15 on: Nov 22nd, 2006, 2:17am » |
Quote Modify
|
on Nov 21st, 2006, 9:42pm, Janzert wrote:oops forgot the words "to FAME" |
| Fame! I'm gonna live forever I'm gonna learn how to fly High! I feel it coming together People will see me and cry Fame! I'm gonna make it to heaven Light up the sky like a flame Fame! I'm gonna live forever Baby remember my name Remember Remember Remember Remember... Or was that not what you meant...
|
|
IP Logged |
|
|
|
Fritzlein
Forum Guru
Arimaa player #706
Gender:
Posts: 5928
|
|
Re: Empirically derived material evaluators Part I
« Reply #16 on: Nov 22nd, 2006, 9:51am » |
Quote Modify
|
on Nov 21st, 2006, 3:23am, IdahoEv wrote:Algorithm | K | Total Error | Error +/- | RabbitCurveABC (Optimized Coefficients) | 2.82 | 1891.21 | 0.646 | LinearAB (Optimized Coefficients) | 1.77 | 1889.10 | 0.015 | RabbitCurve2ABC (Optimized Coefficients) | 1.80 | 1889.06 | 0.004 | Moreover, it essentially made itself into LinearAB: A = 1.2427 +/- .0096 B = 1.3185 +/- .0010 C = 1.0205 +/- .0092 |
| Ha, ha! Yes, when the rabbit curve was allowed to bend itself either way, it decided to stay essentially straight, but what curvature it does have goes the wrong way! Losing the first rabbit apparent hurts you more than losing the seventh rabbit... This result has to make us stop and think: to what extent are the optimizations a commentary on the true value of rabbits, as opposed to a commentary on the dataset from which they are generated? The sudden change in DAPE coefficients when the data was restricted to a subset raises the same question. It may be as valid to wonder what is biasing the data as to wonder what is biasing our judgment about piece values.
|
|
IP Logged |
|
|
|
IdahoEv
Forum Guru
Arimaa player #1753
Gender:
Posts: 405
|
|
Re: Empirically derived material evaluators Part I
« Reply #17 on: Nov 22nd, 2006, 2:26pm » |
Quote Modify
|
on Nov 22nd, 2006, 9:51am, Fritzlein wrote:To what extent are the optimizations a commentary on the true value of rabbits, as opposed to a commentary on the dataset from which they are generated? The sudden change in DAPE coefficients when the data was restricted to a subset raises the same question. It may be as valid to wonder what is biasing the data as to wonder what is biasing our judgment about piece values. |
| Indeed. I think the likely issue is the overrepresentation of early losses in the DB. Since there are many many examples of a single rabbit loss in the DB - and since those equate to a 55/45 win/loss or so, the optimizer has to work very hard to generate a 0.55 output for that case in order to minimize the error. On the other hand, there are fewer cases where someone has, say, lost five rabbits. And in most of those cases they've probably lost a few officers as well and are thoroughly losing the game. So all the optimizer needs to do is punish them for having lost a lot of pieces and produce a probability near 1.0 or 0.0 as appropriate and very little error will accumulate for those cases. This is the inevitable drawback of an empirical approach. At the same time, one could argue that this is correct as far as a bot's material evaluator is concerned in any case. As with any player, once you're way ahead on material it matters less if you're correctly evaluating an additional exchange of D vs. RC. But correctly evaluating the early exchanges is far more important. Put differently, correctly fitting your evaluator to the cases that appear in real games may be more useful in the practical sense of winning real games than an abstract concept of "true piece value". Especially since material evaluation alone is somewhat specious in Arimaa and is always subject to position, it may be that certain material states are more likely to be relevant than others simply because positional constraints prevent the other from actually appearing. For example, how important is it for a material eval to "correctly" evaluate M vs. RRRR .... if that exchange never actually occurs in practice? What if, in fact, it cannot occur because reasonable play simply will not create the positional circumstances for it? Then of course an empirically-derived evaluator will not necessarily evaluate that state correctly ... but maybe that doesn't actually matter. Evaluating the increased value of the seventh rabbit lost may be a moot point if the evaluator can already tell you've lost with 99% confidence three pieces earlier. Or, if losing seven rabbits has only happened 10 times in a training set of 50,000 states. If so, we wouldn't expect it to put much effort into fitting that state well, and maybe the fact that it doesn't care about that state is telling us that we shouldn't worry about it, either, because it's not a relevant distinction to make in terms of winning games. I have some thoughts about the difference beween the two different DAPE optimizations that I'll share when I have a moment.
|
« Last Edit: Nov 22nd, 2006, 2:30pm by IdahoEv » |
IP Logged |
|
|
|
IdahoEv
Forum Guru
Arimaa player #1753
Gender:
Posts: 405
|
|
Re: Empirically derived material evaluators Part I
« Reply #18 on: Nov 22nd, 2006, 2:42pm » |
Quote Modify
|
Sudden thought: now that the optimizer is scoring win probability rather than just "who is winning", we can't consider the cat loss at the beginning of a bait-and-tackle game as a simple "incorrect state" that no evaluator will get correct anymore, as we did in the analysis of Part I. Instead, the optimizer will be desperately trying to reduce the probability error of the hundreds of examples of that technique, and that will definitely throw a wrench into the works. I bet you that ABS(wrating-brating)<100 eliminates the vast majority of bait-and-tackle cases from the dataset.
|
|
IP Logged |
|
|
|
IdahoEv
Forum Guru
Arimaa player #1753
Gender:
Posts: 405
|
|
Re: Empirically derived material evaluators Part I
« Reply #19 on: Nov 27th, 2006, 10:14pm » |
Quote Modify
|
Fritzl: to answer your question from chat last week, Optimized LinearAB prefers RH to M, in fact by quite a lot. Opt. Linear AB values (as optimized above, anyway) initial pieces as: R: 1.0 C: 1.24 D: 1.54 H: 1.91 M: 2.38 LinearAB is strictly additive, so HR = 2.91. More than half a rabbit better than M. "Discuss".
|
|
IP Logged |
|
|
|
99of9
Forum Guru
Gnobby's creator (player #314)
Gender:
Posts: 1413
|
|
Re: Empirically derived material evaluators Part I
« Reply #20 on: Nov 27th, 2006, 10:35pm » |
Quote Modify
|
on Nov 21st, 2006, 9:33pm, 99of9 wrote:Hey Idaho. Since optimized DAPE predicts that one H is worth almost exactly 2R out of the opening. Could you access your actual stats and tell us how often each has won (say for the Rating Diff <100 criteria)? I'll make a bold prediction that the guy with the horse wins more often. |
| When you get a chance you could try this for M vs HR as well... but more tellingly you should try M vs DR, because Opt. linearAB even favours the DR there, which is totally whacky.
|
|
IP Logged |
|
|
|
IdahoEv
Forum Guru
Arimaa player #1753
Gender:
Posts: 405
|
|
Re: Empirically derived material evaluators Part I
« Reply #21 on: Nov 28th, 2006, 12:36am » |
Quote Modify
|
The problem, 99, is that the data of exactly those specific states is pretty sparse. We can estimate the value of RR by how often it wins games, but the specific state of H traded for RR is pretty rare. But here's what we've got. Using the same data set I used for training (both ratings > 1600, id>5000) (only including HvH and BvB, not HvB) 112228-112226 (gold ahead two rabbits) 52 occurrences, gold wins 45 of them (86.5%) 112226-112228 (silver ahead two rabbits 49 occurrences, silver wins 37 of them (75.5%) Combined: RR advantage wins 82/101 or 81.2% 112228-111228 (gold ahead one horse) 78 occurrences, gold wins 54 of them (69.2%) 111228-112228 (silver ahead one horse) 82 occurrences, silver wins 67 of them (81%) Combined: H advantage wins 121/160 or 75.6% 112226-111228 (gold down RR, silver down H) 6 occurrences, gold wins 2, silver wins 4 (67% silver win) 111228-112226 (gold down H, silver down RR) 6 occurrences, gold wins 5 (83%), silver wins 1 (16%) Repeating the analysis, additionally constraining ABS(wrating-brating)<100 (only including HvH and BvB, not HvB): 112228-112226 (gold ahead two rabbits) 16 occurrences, gold wins 14 of them (87.5%) 112226-112228 (silver ahead two rabbits 20 occurrences, silver wins 15 of them (75.0%) Combined: RR advantage wins 29/36 or 80.6% 112228-111228 (gold ahead one horse) 31 occurrences, gold wins 21 of them (67.7%) 111228-112228 (silver ahead one horse) 27 occurrences, silver wins 19 of them (70.4%) Combined: H advantage wins 40/58 or 69.0% 112226-111228 (gold down RR, silver down H) 1 occurrences, gold wins 0, silver wins 1 111228-112226 (gold down H, silver down RR) 3 occurrences, gold wins 2 (67%), silver wins 1 (33%)
|
« Last Edit: Nov 28th, 2006, 4:47pm by IdahoEv » |
IP Logged |
|
|
|
Fritzlein
Forum Guru
Arimaa player #706
Gender:
Posts: 5928
|
|
Re: Empirically derived material evaluators Part I
« Reply #22 on: Nov 28th, 2006, 8:23am » |
Quote Modify
|
on Nov 28th, 2006, 12:36am, IdahoEv wrote: 112226-111228 (gold down RR, silver down H) 6 occurrences, gold wins 2, silver wins 4 (67% silver win) 111228-112226 (gold down H, silver down RR) 6 occurrences, gold wins 5 (83%), silver wins 1 (16%) |
| Based on the spreadsheet you sent me earlier, I rather expected this. If RR weren't beating H most of the time (plus other similar stuff) then LinearAB wouldn't have settled on the coefficients it chose. Can you post the game id's of those twelve games? It might be instructive to see exactly how the extra rabbits prove advantageous.
|
|
IP Logged |
|
|
|
IdahoEv
Forum Guru
Arimaa player #1753
Gender:
Posts: 405
|
|
Re: Empirically derived material evaluators Part I
« Reply #23 on: Nov 28th, 2006, 12:57pm » |
Quote Modify
|
on Nov 27th, 2006, 10:35pm, 99of9 wrote:When you get a chance you could try this for M vs HR as well... but more tellingly you should try M vs DR, because Opt. linearAB even favours the DR there, which is totally whacky. |
| Well, as it turns out that is in fact what the data say when considering all games with a rating over 1600 (i.e. the dataset I trained LinearAB on). Using the larger dataset (wrating > 1600 AND brating > 1600 AND wtype=btype) Summary: Camel advantage wins: 38/44 (86.4%) HR advantage wins: 61/65 (91.0%) DR advantage wins: 62/70 (88.6%) Camel traded for HR: player with the camel wins 12/27 (44%) Camel traded for DR: player with the camel wins 1/3 (33%) State | N | gold wins | silver wins | 112228-102228 | 20 | 16(80%) | 4(20%) | 102228-112228 | 24 | 2(8.3%) | 22(91.7%) | 112228-111227 | 30 | 27(90%) | 3(10%) | 111227-112228 | 31 | 2(6.5%) | 29(93.55) | 112228-112127 | 29 | 24(82.8%) | 5(17.24%) | 112127-112228 | 41 | 3(7.32% | 38(92.68%) | 111227-102228 | 14 | 3(21.4%) | 11(78.6%) | 102228-111227 | 13 | 4(30.8%) | 9(69.2%) | 112127-102228 | no occurrences | 102228-112127 | 3 | 2 | 1 | Using the constrained dataset (wrating > 1600 AND brating > 1600 AND ABS(wrating-brating)<100 AND wtype=btype): Summary: Camel advantage wins: 13/15 (87%) HR advantage wins: 22/26 (85%) DR advantage wins: 18/23 (78%) Camel vs. HR: player with the camel wins 3/12 (25%) Camel vs. DR: player with the camel wins 1/1 (100%) State | N | gold wins | silver wins | 112228-102228 | 6 | 5 | 1 | 102228-112228 | 9 | 1 | 8 | 112228-111227 | 15 | 12 | 3 | 111227-112228 | 11 | 1 | 10 | 112228-112127 | 9 | 7 | 2 | 112127-112228 | 14 | 3 | 11 | 111227-102228 | 9 | 2 | 7 | 102228-111227 | 3 | 2 | 1 | 112127-102228 | no occurrences | 102228-112127 | 1 | 0 | 1 | Using the larger data set, capturing DR and HR both seem to be better than capturing M. This does reverse when using the constrained data set, but in that case the sample sizes are sufficiently small that I don't have great confidence in the results.
|
« Last Edit: Nov 28th, 2006, 4:46pm by IdahoEv » |
IP Logged |
|
|
|
IdahoEv
Forum Guru
Arimaa player #1753
Gender:
Posts: 405
|
|
Re: Empirically derived material evaluators Part I
« Reply #24 on: Nov 28th, 2006, 1:39pm » |
Quote Modify
|
on Nov 28th, 2006, 8:23am, Fritzlein wrote:Can you post the game id's of those twelve games? It might be instructive to see exactly how the extra rabbits prove advantageous. |
| Here you go; the matching games for H vs RR, M vs. HR, and M vs. DR. "Turn" is the turn at which my system records the material state, which is generally 1 full turn (2 ply) after the capture that created the state. (This is how I avoid including mid-exchange states.) State | game | turn | winner | H for RR: | | | 112226-111228 | 7756 | 34w1 | Black | 112226-111228 | 15167 | 12b1 | White | 112226-111228 | 16004 | 27b1 | Black | 112226-111228 | 32195 | 37b1 | Black | 112226-111228 | 35376 | 19b1 | Black | 112226-111228 | 36369 | 17w1 | White | 111228-112226 | 11449 | 22b1 | White | 111228-112226 | 31668 | 26w1 | White | 111228-112226 | 33165 | 29w1 | White | 111228-112226 | 34455 | 18w1 | White | 111228-112226 | 35510 | 31b1 | White | 111228-112226 | 38603 | 18w1 | Black | M for HR: | | | | 111227-102228 | 10480 | 16w1 | White | 111227-102228 | 11235 | 29b1 | Black | 111227-102228 | 11632 | 14b1 | Black | 111227-102228 | 13030 | 23b1 | Black | 111227-102228 | 15862 | 16b1 | Black | 111227-102228 | 21919 | 43w1 | Black | 111227-102228 | 23272 | 14b1 | Black | 111227-102228 | 23287 | 52b1 | Black | 111227-102228 | 23525 | 17b1 | Black | 111227-102228 | 24623 | 21b1 | White | 111227-102228 | 27204 | 25b1 | Black | 111227-102228 | 29619 | 43b1 | Black | 111227-102228 | 31362 | 21b1 | White | 111227-102228 | 36325 | 30b1 | Black | 102228-111227 | 9235 | 18b1 | White | 102228-111227 | 15294 | 18w1 | Black | 102228-111227 | 17649 | 37w1 | White | 102228-111227 | 18824 | 23b1 | Black | 102228-111227 | 18871 | 21w1 | White | 102228-111227 | 24682 | 25w1 | Black | 102228-111227 | 28137 | 27w1 | Black | 102228-111227 | 29508 | 24w1 | Black | 102228-111227 | 32663 | 20b1 | Black | 102228-111227 | 33262 | 26b1 | Black | 102228-111227 | 33934 | 16w1 | White | 102228-111227 | 39958 | 25w1 | Black | 102228-111227 | 40846 | 20b1 | Black | M for DR: | | | | 102228-112127 | 19073 | 32w1 | White | 102228-112127 | 27445 | 23w1 | Black | 102228-112127 | 38494 | 27w1 | White |
|
|
IP Logged |
|
|
|
IdahoEv
Forum Guru
Arimaa player #1753
Gender:
Posts: 405
|
|
Re: Empirically derived material evaluators Part I
« Reply #25 on: Nov 28th, 2006, 4:02pm » |
Quote Modify
|
Argh. I just realized my last three posts seached the database with an additional constraint ... wtype=btype. So those numbers only include HvH games and BvB games; no HvB games. I can re-run them if y'all care.
|
|
IP Logged |
|
|
|
IdahoEv
Forum Guru
Arimaa player #1753
Gender:
Posts: 405
|
|
Re: Empirically derived material evaluators Part I
« Reply #26 on: Nov 28th, 2006, 5:20pm » |
Quote Modify
|
Okay, the same results (summarized only - these posts are too long!), but now including HvB games as well as HvH and BvB. Interestingly, the results more closely match 99of9's intuition when HvB games are included. Does this this mean that our feel for the game is based on thousands of games vs. bots, and isn't quite accurate when we are playing against a human opponent? HvB included, ratings for both players > 1600. Positions in order of greatest to least statistical advantage. HR advantage wins 183/211, 86% DR advantage wins 327/383, 85% M advantage wins 132/159, 83% RR advantage wins 446/568, 79% H advantage wins 459/621, 74% Trade M for HR, player with camel wins 7/17, 41% Trade M for DR, player with camel wins 54/95, 57% Trade H for RR, player with horse wins 19/44, 43% HvB included, both ratings >100 and abs(wrating-brating)<100 M advantage wins 64/72, 89% HR advantage wins 59/71, 83% DR advantage wins 94/117, 80% RR advantage wins 167/214, 78% H advantage wins 165/250, 66% Trade M for HR, player with camel wins 18/33, 54.5% Trade M for DR, player with camel wins 4/7, 57% Trade H for RR, player with horse wins 10/18, 56% So when HvB games are included AND we constrain to players with similar ratings, an M advantage becomes the strongest advantage we have. But in all other combinations considered, HR advantage beats M, and even DR beats M for most combinations of constraints. Also, in every combination yet considered, RR advantage leads to win more often than H advantage. (Actual trades of H for RR contraindicate, but the number of those is low)
|
|
IP Logged |
|
|
|
aaaa
Forum Guru
Arimaa player #958
Posts: 768
|
|
Re: Empirically derived material evaluators Part I
« Reply #27 on: May 11th, 2007, 6:22pm » |
Quote Modify
|
on Nov 16th, 2006, 4:43pm, IdahoEv wrote:ALGORITHM RUNDOWN Count Pieces Just what you'd expect; the score is +1.0 point for each piece gold has, -1.0 point for each piece silver has. This is here to give you a sense of the overall performance of the algorithms relative to a baseline. Most of the time, a player with more pieces is ahead. The performance improvement of the algorithms below is a measure of their ability to correctly evaluate the relatively few cases where a player has a numerical deficit but a functional advantage. |
| It's kind of a shame that you didn't use the opportunity to also see how the various algorithms would stack up against the significantly less (but still quite) naive method of simply assigning point values to the different (collapsed) piece types à la chess, which would be similarly optimized. These numbers would also make for nice base values for bots to use. I'm not surprised in the least that an optimized DAPE performs the best here; the seven parameters just scream "overfitting" to me. If one were to wish to discover which general method (that is, ignoring the exact values used as parameters) would be best in capturing the intricacies of Arimaa, then cross-validation would be in order here.
|
|
IP Logged |
|
|
|
IdahoEv
Forum Guru
Arimaa player #1753
Gender:
Posts: 405
|
|
Re: Empirically derived material evaluators Part I
« Reply #28 on: Jun 11th, 2007, 1:20pm » |
Quote Modify
|
on May 11th, 2007, 6:22pm, aaaa wrote: It's kind of a shame that you didn't use the opportunity to also see how the various algorithms would stack up against the significantly less (but still quite) naive method of simply assigning point values to the different (collapsed) piece types à la chess, which would be similarly optimized. |
| That's not a bad idea, and if and when I re-run this analysis, I'll include something like that. The only trouble is that it would have to use collapsed piece types to be useful at all, and it's not at all clear how to assign the values to a collapsed piece list because the number of possible pieces changes. Do we fix the elephant value and count down in levels, or fix the cat and count up? And do we float the rabbit value as well, or leave it fixed relative to the elephant or cat? We might have to try it a couple of different ways. Quote:I'm not surprised in the least that an optimized DAPE performs the best here; the seven parameters just scream "overfitting" to me. |
| See my post in the fairy pieces thread for discussion on this. As an additional comment, though, most of the variability in the oDAPE coefficients comes because what most of mattered in some cases was the ratio between two coefficients rather than the coefficients themselves. So, for example, AR and BR vary widely, but only because they vary inversely. The ratio AR/BR was fairly consistent over multiple runs.
|
|
IP Logged |
|
|
|
aaaa
Forum Guru
Arimaa player #958
Posts: 768
|
|
Re: Empirically derived material evaluators Part I
« Reply #29 on: Jun 12th, 2007, 2:05pm » |
Quote Modify
|
A case can be made both for collapsing the pieces downwards as well as upwards. On one hand, collapsing downwards makes sense from the viewpoint that as the board empties out, pieces and especially rabbits become more valuable simply by existing (i.e. that quantity becomes more important than quality), in which case you would want to minimize the discrepancy in value between a rabbit and the next weakest piece (by normalizing the latter as a cat). On the other hand, by collapsing upwards you won't have the strategically indispensable elephant inflating the base value of lower pieces after piece levels get eliminated. This suggests adding three more systems for testing: collapsing downwards, collapsing upwards and a naive system with no collapsing at all in order to see whether it actually matters that much. I'm guessing it doesn't really.
|
|
IP Logged |
|
|
|
|