Arimaa Forum - Empirically derived material evaluators Part II

Welcome, Guest. Please Login or Register.
Jul 12^th, 2025, 5:56am

Home

Help

Members

Arimaa Forum « Empirically derived material evaluators Part II »

   Arimaa Forum
   Arimaa
   General Discussion (Moderator: supersamu)
   Empirically derived material evaluators Part II

« Previous topic | Next topic »

Pages: 1 2 3

Notify of replies

Send Topic

Author

Topic: Empirically derived material evaluators Part II (Read 12110 times)

IdahoEv
Forum Guru

Arimaa player #1753

Gender: male

Posts: 405

Re: Empirically derived material evaluators Part I
« Reply #15 on: Nov 22^nd, 2006, 2:17am »

Quote

Modify

on Nov 21^st, 2006, 9:42pm, Janzert wrote:

oops forgot the words "to FAME"

Fame!

I'm gonna live forever
I'm gonna learn how to fly
High!

I feel it coming together
People will see me and cry
Fame!

I'm gonna make it to heaven
Light up the sky like a flame
Fame!

I'm gonna live forever
Baby remember my name

Remember
Remember
Remember
Remember...

Or was that not what you meant...

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Empirically derived material evaluators Part I
« Reply #16 on: Nov 22^nd, 2006, 9:51am »

Quote

Modify

on Nov 21^st, 2006, 3:23am, IdahoEv wrote:

Algorithm	K	Total Error	Error +/-
RabbitCurveABC (Optimized Coefficients)	2.82	1891.21	0.646
LinearAB (Optimized Coefficients)	1.77	1889.10	0.015
RabbitCurve2ABC (Optimized Coefficients)	1.80	1889.06	0.004

Moreover, it essentially made itself into LinearAB:
A = 1.2427 +/- .0096
B = 1.3185 +/- .0010
C = 1.0205 +/- .0092

Ha, ha! Yes, when the rabbit curve was allowed to bend itself either way, it decided to stay essentially straight, but what curvature it does have goes the wrong way! Losing the first rabbit apparent hurts you more than losing the seventh rabbit...

This result has to make us stop and think: to what extent are the optimizations a commentary on the true value of rabbits, as opposed to a commentary on the dataset from which they are generated? The sudden change in DAPE coefficients when the data was restricted to a subset raises the same question. It may be as valid to wonder what is biasing the data as to wonder what is biasing our judgment about piece values.

IP Logged

IdahoEv
Forum Guru

Arimaa player #1753

Gender: male

Posts: 405

Re: Empirically derived material evaluators Part I
« Reply #17 on: Nov 22^nd, 2006, 2:26pm »

Quote

Modify

on Nov 22^nd, 2006, 9:51am, Fritzlein wrote:

To what extent are the optimizations a commentary on the true value of rabbits, as opposed to a commentary on the dataset from which they are generated? The sudden change in DAPE coefficients when the data was restricted to a subset raises the same question. It may be as valid to wonder what is biasing the data as to wonder what is biasing our judgment about piece values.

Indeed. I think the likely issue is the overrepresentation of early losses in the DB. Since there are many many examples of a single rabbit loss in the DB - and since those equate to a 55/45 win/loss or so, the optimizer has to work very hard to generate a 0.55 output for that case in order to minimize the error.

On the other hand, there are fewer cases where someone has, say, lost five rabbits. And in most of those cases they've probably lost a few officers as well and are thoroughly losing the game. So all the optimizer needs to do is punish them for having lost a lot of pieces and produce a probability near 1.0 or 0.0 as appropriate and very little error will accumulate for those cases.

This is the inevitable drawback of an empirical approach.

At the same time, one could argue that this is correct as far as a bot's material evaluator is concerned in any case. As with any player, once you're way ahead on material it matters less if you're correctly evaluating an additional exchange of D vs. RC. But correctly evaluating the early exchanges is far more important.

Put differently, correctly fitting your evaluator to the cases that appear in real games may be more useful in the practical sense of winning real games than an abstract concept of "true piece value". Especially since material evaluation alone is somewhat specious in Arimaa and is always subject to position, it may be that certain material states are more likely to be relevant than others simply because positional constraints prevent the other from actually appearing. For example, how important is it for a material eval to "correctly" evaluate M vs. RRRR .... if that exchange never actually occurs in practice? What if, in fact, it cannot occur because reasonable play simply will not create the positional circumstances for it? Then of course an empirically-derived evaluator will not necessarily evaluate that state correctly ... but maybe that doesn't actually matter.

Evaluating the increased value of the seventh rabbit lost may be a moot point if the evaluator can already tell you've lost with 99% confidence three pieces earlier. Or, if losing seven rabbits has only happened 10 times in a training set of 50,000 states. If so, we wouldn't expect it to put much effort into fitting that state well, and maybe the fact that it doesn't care about that state is telling us that we shouldn't worry about it, either, because it's not a relevant distinction to make in terms of winning games.

I have some thoughts about the difference beween the two different DAPE optimizations that I'll share when I have a moment.

« Last Edit: Nov 22^nd, 2006, 2:30pm by IdahoEv »

IP Logged

IdahoEv
Forum Guru

Arimaa player #1753

Gender: male

Posts: 405

Re: Empirically derived material evaluators Part I
« Reply #18 on: Nov 22^nd, 2006, 2:42pm »

Quote

Modify

Sudden thought: now that the optimizer is scoring win probability rather than just "who is winning", we can't consider the cat loss at the beginning of a bait-and-tackle game as a simple "incorrect state" that no evaluator will get correct anymore, as we did in the analysis of Part I.

Instead, the optimizer will be desperately trying to reduce the probability error of the hundreds of examples of that technique, and that will definitely throw a wrench into the works.

I bet you that ABS(wrating-brating)<100 eliminates the vast majority of bait-and-tackle cases from the dataset.

IP Logged

IdahoEv
Forum Guru

Arimaa player #1753

Gender: male

Posts: 405

Re: Empirically derived material evaluators Part I
« Reply #19 on: Nov 27^th, 2006, 10:14pm »

Quote

Modify

Fritzl: to answer your question from chat last week, Optimized LinearAB prefers RH to M, in fact by quite a lot. Opt. Linear AB values (as optimized above, anyway) initial pieces as:

R: 1.0
C: 1.24
D: 1.54
H: 1.91
M: 2.38

LinearAB is strictly additive, so HR = 2.91. More than half a rabbit better than M.

"Discuss".

IP Logged

99of9
Forum Guru

Gnobby's creator (player #314)

Gender:

Posts: 1413

Re: Empirically derived material evaluators Part I
« Reply #20 on: Nov 27^th, 2006, 10:35pm »

Quote

Modify

on Nov 21^st, 2006, 9:33pm, 99of9 wrote:

Hey Idaho. Since optimized DAPE predicts that one H is worth almost exactly 2R out of the opening. Could you access your actual stats and tell us how often each has won (say for the Rating Diff <100 criteria)?

I'll make a bold prediction that the guy with the horse wins more often.

When you get a chance you could try this for M vs HR as well... but more tellingly you should try M vs DR, because Opt. linearAB even favours the DR there, which is totally whacky.

IP Logged

IdahoEv
Forum Guru

Arimaa player #1753

Gender: male

Posts: 405

Re: Empirically derived material evaluators Part I
« Reply #21 on: Nov 28^th, 2006, 12:36am »

Quote

Modify

The problem, 99, is that the data of exactly those specific states is pretty sparse. We can estimate the value of RR by how often it wins games, but the specific state of H traded for RR is pretty rare. But here's what we've got.

Using the same data set I used for training (both ratings > 1600, id>5000) (only including HvH and BvB, not HvB)

112228-112226 (gold ahead two rabbits)
52 occurrences, gold wins 45 of them (86.5%)

112226-112228 (silver ahead two rabbits
49 occurrences, silver wins 37 of them (75.5%)

Combined: RR advantage wins 82/101 or 81.2%

112228-111228 (gold ahead one horse)
78 occurrences, gold wins 54 of them (69.2%)

111228-112228 (silver ahead one horse)
82 occurrences, silver wins 67 of them (81%)

Combined: H advantage wins 121/160 or 75.6%

112226-111228 (gold down RR, silver down H)
6 occurrences, gold wins 2, silver wins 4 (67% silver win)

111228-112226 (gold down H, silver down RR)
6 occurrences, gold wins 5 (83%), silver wins 1 (16%)

Repeating the analysis, additionally constraining ABS(wrating-brating)<100 (only including HvH and BvB, not HvB):

112228-112226 (gold ahead two rabbits)
16 occurrences, gold wins 14 of them (87.5%)

112226-112228 (silver ahead two rabbits
20 occurrences, silver wins 15 of them (75.0%)

Combined: RR advantage wins 29/36 or 80.6%

112228-111228 (gold ahead one horse)
31 occurrences, gold wins 21 of them (67.7%)

111228-112228 (silver ahead one horse)
27 occurrences, silver wins 19 of them (70.4%)

Combined: H advantage wins 40/58 or 69.0%

112226-111228 (gold down RR, silver down H)
1 occurrences, gold wins 0, silver wins 1

111228-112226 (gold down H, silver down RR)
3 occurrences, gold wins 2 (67%), silver wins 1 (33%)

« Last Edit: Nov 28^th, 2006, 4:47pm by IdahoEv »

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Empirically derived material evaluators Part I
« Reply #22 on: Nov 28^th, 2006, 8:23am »

Quote

Modify

on Nov 28^th, 2006, 12:36am, IdahoEv wrote:

112226-111228 (gold down RR, silver down H)
6 occurrences, gold wins 2, silver wins 4 (67% silver win)

111228-112226 (gold down H, silver down RR)
6 occurrences, gold wins 5 (83%), silver wins 1 (16%)

Based on the spreadsheet you sent me earlier, I rather expected this. If RR weren't beating H most of the time (plus other similar stuff) then LinearAB wouldn't have settled on the coefficients it chose.

Can you post the game id's of those twelve games? It might be instructive to see exactly how the extra rabbits prove advantageous.

IP Logged

IdahoEv
Forum Guru

Arimaa player #1753

Gender: male

Posts: 405

Re: Empirically derived material evaluators Part I
« Reply #23 on: Nov 28^th, 2006, 12:57pm »

Quote

Modify

on Nov 27^th, 2006, 10:35pm, 99of9 wrote:

When you get a chance you could try this for M vs HR as well... but more tellingly you should try M vs DR, because Opt. linearAB even favours the DR there, which is totally whacky.

Well, as it turns out that is in fact what the data say when considering all games with a rating over 1600 (i.e. the dataset I trained LinearAB on).

Using the larger dataset (wrating > 1600 AND brating > 1600 AND wtype=btype)

Summary:
Camel advantage wins: 38/44 (86.4%)
HR advantage wins: 61/65 (91.0%)
DR advantage wins: 62/70 (88.6%)
Camel traded for HR: player with the camel wins 12/27 (44%)
Camel traded for DR: player with the camel wins 1/3 (33%)

State	N	gold wins	silver wins
112228-102228	20	16(80%)	4(20%)
102228-112228	24	2(8.3%)	22(91.7%)
112228-111227	30	27(90%)	3(10%)
111227-112228	31	2(6.5%)	29(93.55)
112228-112127	29	24(82.8%)	5(17.24%)
112127-112228	41	3(7.32%	38(92.68%)
111227-102228	14	3(21.4%)	11(78.6%)
102228-111227	13	4(30.8%)	9(69.2%)
112127-102228	no occurrences
102228-112127	3	2	1

Using the constrained dataset (wrating > 1600 AND brating > 1600 AND ABS(wrating-brating)<100 AND wtype=btype):

Summary:
Camel advantage wins: 13/15 (87%)
HR advantage wins: 22/26 (85%)
DR advantage wins: 18/23 (78%)
Camel vs. HR: player with the camel wins 3/12 (25%)
Camel vs. DR: player with the camel wins 1/1 (100%)

State	N	gold wins	silver wins
112228-102228	6	5	1
102228-112228	9	1	8
112228-111227	15	12	3
111227-112228	11	1	10
112228-112127	9	7	2
112127-112228	14	3	11
111227-102228	9	2	7
102228-111227	3	2	1
112127-102228	no occurrences
102228-112127	1	0	1

Using the larger data set, capturing DR and HR both seem to be better than capturing M. This does reverse when using the constrained data set, but in that case the sample sizes are sufficiently small that I don't have great confidence in the results.

« Last Edit: Nov 28^th, 2006, 4:46pm by IdahoEv »

IP Logged

IdahoEv
Forum Guru

Arimaa player #1753

Gender: male

Posts: 405

Re: Empirically derived material evaluators Part I
« Reply #24 on: Nov 28^th, 2006, 1:39pm »

Quote

Modify

on Nov 28^th, 2006, 8:23am, Fritzlein wrote:

Can you post the game id's of those twelve games? It might be instructive to see exactly how the extra rabbits prove advantageous.

Here you go; the matching games for H vs RR, M vs. HR, and M vs. DR. "Turn" is the turn at which my system records the material state, which is generally 1 full turn (2 ply) after the capture that created the state. (This is how I avoid including mid-exchange states.)

State	game	turn	winner
H for RR:
112226-111228	7756	34w1	Black
112226-111228	15167	12b1	White
112226-111228	16004	27b1	Black
112226-111228	32195	37b1	Black
112226-111228	35376	19b1	Black
112226-111228	36369	17w1	White
111228-112226	11449	22b1	White
111228-112226	31668	26w1	White
111228-112226	33165	29w1	White
111228-112226	34455	18w1	White
111228-112226	35510	31b1	White
111228-112226	38603	18w1	Black
M for HR:
111227-102228	10480	16w1	White
111227-102228	11235	29b1	Black
111227-102228	11632	14b1	Black
111227-102228	13030	23b1	Black
111227-102228	15862	16b1	Black
111227-102228	21919	43w1	Black
111227-102228	23272	14b1	Black
111227-102228	23287	52b1	Black
111227-102228	23525	17b1	Black
111227-102228	24623	21b1	White
111227-102228	27204	25b1	Black
111227-102228	29619	43b1	Black
111227-102228	31362	21b1	White
111227-102228	36325	30b1	Black
102228-111227	9235	18b1	White
102228-111227	15294	18w1	Black
102228-111227	17649	37w1	White
102228-111227	18824	23b1	Black
102228-111227	18871	21w1	White
102228-111227	24682	25w1	Black
102228-111227	28137	27w1	Black
102228-111227	29508	24w1	Black
102228-111227	32663	20b1	Black
102228-111227	33262	26b1	Black
102228-111227	33934	16w1	White
102228-111227	39958	25w1	Black
102228-111227	40846	20b1	Black
M for DR:
102228-112127	19073	32w1	White
102228-112127	27445	23w1	Black
102228-112127	38494	27w1	White

IP Logged

IdahoEv
Forum Guru

Arimaa player #1753

Gender: male

Posts: 405

Re: Empirically derived material evaluators Part I
« Reply #25 on: Nov 28^th, 2006, 4:02pm »

Quote

Modify

Argh. I just realized my last three posts seached the database with an additional constraint ... wtype=btype. So those numbers only include HvH games and BvB games; no HvB games.

I can re-run them if y'all care.

IP Logged

IdahoEv
Forum Guru

Arimaa player #1753

Gender: male

Posts: 405

Re: Empirically derived material evaluators Part I
« Reply #26 on: Nov 28^th, 2006, 5:20pm »

Quote

Modify

Okay, the same results (summarized only - these posts are too long!), but now including HvB games as well as HvH and BvB.

Interestingly, the results more closely match 99of9's intuition when HvB games are included. Does this this mean that our feel for the game is based on thousands of games vs. bots, and isn't quite accurate when we are playing against a human opponent?

HvB included, ratings for both players > 1600. Positions in order of greatest to least statistical advantage.

HR advantage wins 183/211, 86%
DR advantage wins 327/383, 85%
M advantage wins 132/159, 83%
RR advantage wins 446/568, 79%
H advantage wins 459/621, 74%

Trade M for HR, player with camel wins 7/17, 41%
Trade M for DR, player with camel wins 54/95, 57%
Trade H for RR, player with horse wins 19/44, 43%

HvB included, both ratings >100 and abs(wrating-brating)<100

M advantage wins 64/72, 89%
HR advantage wins 59/71, 83%
DR advantage wins 94/117, 80%
RR advantage wins 167/214, 78%
H advantage wins 165/250, 66%

Trade M for HR, player with camel wins 18/33, 54.5%
Trade M for DR, player with camel wins 4/7, 57%
Trade H for RR, player with horse wins 10/18, 56%

So when HvB games are included AND we constrain to players with similar ratings, an M advantage becomes the strongest advantage we have. But in all other combinations considered, HR advantage beats M, and even DR beats M for most combinations of constraints.

Also, in every combination yet considered, RR advantage leads to win more often than H advantage. (Actual trades of H for RR contraindicate, but the number of those is low)

IP Logged

aaaa
Forum Guru

Arimaa player #958

Posts: 768

Re: Empirically derived material evaluators Part I
« Reply #27 on: May 11^th, 2007, 6:22pm »

Quote

Modify

on Nov 16^th, 2006, 4:43pm, IdahoEv wrote:

ALGORITHM RUNDOWN

Count Pieces

Just what you'd expect; the score is +1.0 point for each piece gold has, -1.0 point for each piece silver has. This is here to give you a sense of the overall performance of the algorithms relative to a baseline. Most of the time, a player with more pieces is ahead. The performance improvement of the algorithms below is a measure of their ability to correctly evaluate the relatively few cases where a player has a numerical deficit but a functional advantage.

It's kind of a shame that you didn't use the opportunity to also see how the various algorithms would stack up against the significantly less (but still quite) naive method of simply assigning point values to the different (collapsed) piece types à la chess, which would be similarly optimized. These numbers would also make for nice base values for bots to use.

I'm not surprised in the least that an optimized DAPE performs the best here; the seven parameters just scream "overfitting" to me. If one were to wish to discover which general method (that is, ignoring the exact values used as parameters) would be best in capturing the intricacies of Arimaa, then cross-validation would be in order here.

IP Logged

IdahoEv
Forum Guru

Arimaa player #1753

Gender: male

Posts: 405

Re: Empirically derived material evaluators Part I
« Reply #28 on: Jun 11^th, 2007, 1:20pm »

Quote

Modify

on May 11^th, 2007, 6:22pm, aaaa wrote:

That's not a bad idea, and if and when I re-run this analysis, I'll include something like that. The only trouble is that it would have to use collapsed piece types to be useful at all, and it's not at all clear how to assign the values to a collapsed piece list because the number of possible pieces changes.

Do we fix the elephant value and count down in levels, or fix the cat and count up? And do we float the rabbit value as well, or leave it fixed relative to the elephant or cat? We might have to try it a couple of different ways.

Quote:

I'm not surprised in the least that an optimized DAPE performs the best here; the seven parameters just scream "overfitting" to me.

See my post in the fairy pieces thread for discussion on this. As an additional comment, though, most of the variability in the oDAPE coefficients comes because what most of mattered in some cases was the ratio between two coefficients rather than the coefficients themselves. So, for example, AR and BR vary widely, but only because they vary inversely. The ratio AR/BR was fairly consistent over multiple runs.

IP Logged

aaaa
Forum Guru

Arimaa player #958

Posts: 768

Re: Empirically derived material evaluators Part I
« Reply #29 on: Jun 12^th, 2007, 2:05pm »

Quote

Modify

A case can be made both for collapsing the pieces downwards as well as upwards. On one hand, collapsing downwards makes sense from the viewpoint that as the board empties out, pieces and especially rabbits become more valuable simply by existing (i.e. that quantity becomes more important than quality), in which case you would want to minimize the discrepancy in value between a rabbit and the next weakest piece (by normalizing the latter as a cat). On the other hand, by collapsing upwards you won't have the strategically indispensable elephant inflating the base value of lower pieces after piece levels get eliminated.

This suggests adding three more systems for testing: collapsing downwards, collapsing upwards and a naive system with no collapsing at all in order to see whether it actually matters that much. I'm guessing it doesn't really.

IP Logged

Pages: 1 2 3

Notify of replies

Send Topic


« Previous topic \| Next topic »