Welcome, Guest. Please Login or Register.
Dec 8th, 2024, 3:00pm

Home Home Help Help Search Search Members Members Login Login Register Register
Arimaa Forum « Empirically derived material evaluators Part II »


   Arimaa Forum
   Arimaa
   General Discussion
(Moderator: supersamu)
   Empirically derived material evaluators Part II
« Previous topic | Next topic »
Pages: 1 2 3  Reply Reply Notify of replies Notify of replies Send Topic Send Topic Print Print
   Author  Topic: Empirically derived material evaluators Part II  (Read 11743 times)
IdahoEv
Forum Guru
*****



Arimaa player #1753

   


Gender: male
Posts: 405
Re: Empirically derived material evaluators Part I
« Reply #15 on: Nov 22nd, 2006, 2:17am »
Quote Quote Modify Modify

on Nov 21st, 2006, 9:42pm, Janzert wrote:
oops forgot the words "to FAME"

Fame!
 
I'm gonna live forever
I'm gonna learn how to fly
High!
 
I feel it coming together
People will see me and cry
Fame!
 
I'm gonna make it to heaven
Light up the sky like a flame
Fame!
 
I'm gonna live forever
Baby remember my name
 
Remember
Remember
Remember
Remember...

 
Or was that not what you meant...
IP Logged
Fritzlein
Forum Guru
*****



Arimaa player #706

   
Email

Gender: male
Posts: 5928
Re: Empirically derived material evaluators Part I
« Reply #16 on: Nov 22nd, 2006, 9:51am »
Quote Quote Modify Modify

on Nov 21st, 2006, 3:23am, IdahoEv wrote:
AlgorithmKTotal ErrorError +/-
RabbitCurveABC (Optimized Coefficients)2.821891.210.646
LinearAB (Optimized Coefficients)1.771889.100.015
RabbitCurve2ABC (Optimized Coefficients)1.801889.060.004

 
Moreover, it essentially made itself into LinearAB:
A = 1.2427 +/- .0096
B = 1.3185 +/- .0010
C = 1.0205 +/- .0092

Ha, ha!  Yes, when the rabbit curve was allowed to bend itself either way, it decided to stay essentially straight, but what curvature it does have goes the wrong way!  Losing the first rabbit apparent hurts you more than losing the seventh rabbit...
 
This result has to make us stop and think: to what extent are the optimizations a commentary on the true value of rabbits, as opposed to a commentary on the dataset from which they are generated?  The sudden change in DAPE coefficients when the data was restricted to a subset raises the same question.  It may be as valid to wonder what is biasing the data as to wonder what is biasing our judgment about piece values.
IP Logged

IdahoEv
Forum Guru
*****



Arimaa player #1753

   


Gender: male
Posts: 405
Re: Empirically derived material evaluators Part I
« Reply #17 on: Nov 22nd, 2006, 2:26pm »
Quote Quote Modify Modify

on Nov 22nd, 2006, 9:51am, Fritzlein wrote:
To what extent are the optimizations a commentary on the true value of rabbits, as opposed to a commentary on the dataset from which they are generated?  The sudden change in DAPE coefficients when the data was restricted to a subset raises the same question.  It may be as valid to wonder what is biasing the data as to wonder what is biasing our judgment about piece values.

 
Indeed.   I think the likely issue is the overrepresentation of early losses in the DB.   Since there are many many examples of a single rabbit loss in the DB - and since those equate to a 55/45 win/loss or so, the optimizer has to work very hard to generate a 0.55 output for that case in order to minimize the error.
 
On the other hand, there are fewer cases where someone has, say, lost five rabbits.  And in most of those cases they've probably lost a few officers as well and are thoroughly losing the game.   So all the optimizer needs to do is punish them for having lost a lot of pieces and produce a probability near 1.0 or 0.0 as appropriate and very little error will accumulate for those cases.
 
This is the inevitable drawback of an empirical approach.
 
At the same time, one could argue that this is correct as far as a bot's material evaluator is concerned in any case.  As with any player, once you're way ahead on material it matters less if you're correctly evaluating an additional exchange of D vs. RC.  But correctly evaluating the early exchanges is far more important.  
 
Put differently, correctly fitting your evaluator to the cases that appear in real games may be more useful in the practical sense of winning real games than an abstract concept of "true piece value".   Especially since material evaluation alone is somewhat specious in Arimaa and is always subject to position, it may be that certain material states are more likely to be relevant than others simply because positional constraints prevent the other from actually appearing.   For example, how important is it for a material eval to "correctly" evaluate M vs. RRRR .... if that exchange never actually occurs in practice?   What if, in fact, it cannot occur because reasonable play simply will not create the positional circumstances for it?   Then of course an empirically-derived evaluator will not necessarily evaluate that state correctly ... but maybe that doesn't actually matter.
 
Evaluating the increased value of the seventh rabbit lost may be a moot point if the evaluator can already tell you've lost with 99% confidence three pieces earlier.  Or, if losing seven rabbits has only happened 10 times in a training set of 50,000 states.  If so, we wouldn't expect it to put much effort into fitting that state well, and maybe the fact that it doesn't care about that state is telling us that we shouldn't worry about it, either, because it's not a relevant distinction to make in terms of winning games.
 
I have some thoughts about the difference beween the two different DAPE optimizations that I'll share when I have a moment.
« Last Edit: Nov 22nd, 2006, 2:30pm by IdahoEv » IP Logged
IdahoEv
Forum Guru
*****



Arimaa player #1753

   


Gender: male
Posts: 405
Re: Empirically derived material evaluators Part I
« Reply #18 on: Nov 22nd, 2006, 2:42pm »
Quote Quote Modify Modify

Sudden thought:  now that the optimizer is scoring win probability rather than just "who is winning", we can't consider the cat loss at the beginning of a bait-and-tackle game as a simple "incorrect state" that no evaluator will get correct anymore, as we did in the analysis of Part I.
 
Instead, the optimizer will be desperately trying to reduce the probability error of the hundreds of examples of that technique, and that will definitely throw a wrench into the works.
 
I bet you that  ABS(wrating-brating)<100 eliminates the vast majority of bait-and-tackle cases from the dataset.
IP Logged
IdahoEv
Forum Guru
*****



Arimaa player #1753

   


Gender: male
Posts: 405
Re: Empirically derived material evaluators Part I
« Reply #19 on: Nov 27th, 2006, 10:14pm »
Quote Quote Modify Modify

Fritzl:  to answer your question from chat last week, Optimized LinearAB prefers RH to M, in fact by quite a lot.   Opt. Linear AB values (as optimized above, anyway) initial pieces as:
 
R: 1.0
C: 1.24
D: 1.54
H: 1.91
M: 2.38
 
LinearAB is strictly additive, so HR = 2.91.  More than half a rabbit better than M.
 
"Discuss".
 
IP Logged
99of9
Forum Guru
*****




Gnobby's creator (player #314)

  toby_hudson  


Gender: male
Posts: 1413
Re: Empirically derived material evaluators Part I
« Reply #20 on: Nov 27th, 2006, 10:35pm »
Quote Quote Modify Modify

on Nov 21st, 2006, 9:33pm, 99of9 wrote:
Hey Idaho.  Since optimized DAPE predicts that one H is worth almost exactly 2R out of the opening.  Could you access your actual stats and tell us how often each has won (say for the Rating Diff <100 criteria)?
 
I'll make a bold prediction that the guy with the horse wins more often.

 
When you get a chance you could try this for  M vs HR as well... but more tellingly you should try M vs DR, because Opt. linearAB even favours the DR there, which is totally whacky.
IP Logged
IdahoEv
Forum Guru
*****



Arimaa player #1753

   


Gender: male
Posts: 405
Re: Empirically derived material evaluators Part I
« Reply #21 on: Nov 28th, 2006, 12:36am »
Quote Quote Modify Modify

The problem, 99, is that the data of exactly those specific states is pretty sparse.    We can estimate the value of RR by how often it wins games, but the specific state of H traded for RR is pretty rare.   But here's what we've got.
 
Using the same data set I used for training (both ratings > 1600, id>5000)  (only including HvH and BvB, not HvB)
 
112228-112226 (gold ahead two rabbits)
     52 occurrences, gold wins 45 of them (86.5%)
 
112226-112228 (silver ahead two rabbits
    49 occurrences, silver wins 37 of them (75.5%)
 
Combined:  RR advantage wins 82/101 or 81.2%
 
 
112228-111228 (gold ahead one horse)
    78 occurrences, gold wins 54 of them (69.2%)
 
111228-112228 (silver ahead one horse)
    82 occurrences, silver wins 67 of them (81%)
 
Combined:  H advantage wins 121/160 or 75.6%
 
 
 
112226-111228 (gold down RR, silver down H)
    6 occurrences, gold wins 2, silver wins 4 (67% silver win)
 
111228-112226 (gold down H, silver down RR)
    6 occurrences, gold wins 5 (83%), silver wins 1 (16%)
 
 
 
Repeating the analysis, additionally constraining ABS(wrating-brating)<100 (only including HvH and BvB, not HvB):
 
112228-112226 (gold ahead two rabbits)
    16 occurrences, gold wins 14 of them (87.5%)
 
112226-112228 (silver ahead two rabbits
    20 occurrences, silver wins 15 of them (75.0%)
 
Combined: RR advantage wins 29/36 or 80.6%
 
112228-111228 (gold ahead one horse)
    31 occurrences, gold wins 21 of them (67.7%)
 
111228-112228 (silver ahead one horse)
    27 occurrences, silver wins 19 of them (70.4%)
 
Combined:  H advantage wins 40/58  or 69.0%
 
112226-111228 (gold down RR, silver down H)
    1 occurrences, gold wins 0, silver wins 1
 
111228-112226 (gold down H, silver down RR)
    3 occurrences, gold wins 2 (67%), silver wins 1 (33%)
 
 
 
« Last Edit: Nov 28th, 2006, 4:47pm by IdahoEv » IP Logged
Fritzlein
Forum Guru
*****



Arimaa player #706

   
Email

Gender: male
Posts: 5928
Re: Empirically derived material evaluators Part I
« Reply #22 on: Nov 28th, 2006, 8:23am »
Quote Quote Modify Modify

on Nov 28th, 2006, 12:36am, IdahoEv wrote:

112226-111228 (gold down RR, silver down H)
    6 occurrences, gold wins 2, silver wins 4 (67% silver win)
 
111228-112226 (gold down H, silver down RR)
    6 occurrences, gold wins 5 (83%), silver wins 1 (16%)

Based on the spreadsheet you sent me earlier, I rather expected this.  If RR weren't beating H most of the time (plus other similar stuff) then LinearAB wouldn't have settled on the coefficients it chose.
 
Can you post the game id's of those twelve games?  It might be instructive to see exactly how the extra rabbits prove advantageous.
IP Logged

IdahoEv
Forum Guru
*****



Arimaa player #1753

   


Gender: male
Posts: 405
Re: Empirically derived material evaluators Part I
« Reply #23 on: Nov 28th, 2006, 12:57pm »
Quote Quote Modify Modify

on Nov 27th, 2006, 10:35pm, 99of9 wrote:
When you get a chance you could try this for  M vs HR as well... but more tellingly you should try M vs DR, because Opt. linearAB even favours the DR there, which is totally whacky.

 
Well, as it turns out that is in fact what the data say when considering all games with a rating over 1600 (i.e. the dataset I trained LinearAB on).
 
 
Using the larger dataset (wrating > 1600 AND brating > 1600 AND wtype=btype)
 
Summary:
Camel advantage wins: 38/44 (86.4%)
HR advantage wins: 61/65 (91.0%)
DR advantage wins: 62/70 (88.6%)
Camel traded for HR: player with the camel wins 12/27 (44%)
Camel traded for DR: player with the camel wins 1/3 (33%)
 
StateNgold winssilver wins
112228-10222820 16(80%) 4(20%)
102228-11222824 2(8.3%) 22(91.7%)
112228-11122730 27(90%) 3(10%)
111227-11222831 2(6.5%)29(93.55)
112228-1121272924(82.8%)5(17.24%)
112127-112228413(7.32% 38(92.68%)
111227-102228143(21.4%)11(78.6%)
102228-111227134(30.8%)9(69.2%)
112127-102228no occurrences
102228-112127321

 
Using the constrained dataset (wrating > 1600 AND brating > 1600 AND ABS(wrating-brating)<100 AND wtype=btype):
 
 
Summary:
Camel advantage wins: 13/15 (87%)
HR advantage wins: 22/26 (85%)
DR advantage wins: 18/23 (78%)
Camel vs. HR: player with the camel wins 3/12 (25%)
Camel vs. DR: player with the camel wins 1/1 (100%)
 
StateNgold winssilver wins
112228-102228651
102228-112228918
112228-11122715123
111227-11222811110
112228-112127972
112127-11222814311
111227-102228927
102228-111227321
112127-102228no occurrences
102228-112127101

 
 
Using the larger data set, capturing DR and HR both seem to be better than capturing M. This does reverse when using the constrained data set, but in that case the sample sizes are sufficiently small that I don't have great confidence in the results.
« Last Edit: Nov 28th, 2006, 4:46pm by IdahoEv » IP Logged
IdahoEv
Forum Guru
*****



Arimaa player #1753

   


Gender: male
Posts: 405
Re: Empirically derived material evaluators Part I
« Reply #24 on: Nov 28th, 2006, 1:39pm »
Quote Quote Modify Modify

on Nov 28th, 2006, 8:23am, Fritzlein wrote:
Can you post the game id's of those twelve games?  It might be instructive to see exactly how the extra rabbits prove advantageous.

 
Here you go; the matching games for H vs RR, M vs. HR, and M vs. DR.  "Turn" is the turn at which my system records the material state, which is generally 1 full turn (2 ply) after the capture that created the state.  (This is how I avoid including mid-exchange states.)
 
Stategameturnwinner
H for RR:
112226-111228775634w1Black
112226-1112281516712b1White
112226-1112281600427b1Black
112226-1112283219537b1Black
112226-1112283537619b1Black
112226-1112283636917w1White
111228-1122261144922b1White
111228-1122263166826w1White
111228-1122263316529w1White
111228-1122263445518w1White
111228-1122263551031b1White
111228-1122263860318w1Black
M for HR:
111227-1022281048016w1White
111227-1022281123529b1Black
111227-1022281163214b1Black
111227-1022281303023b1Black
111227-1022281586216b1Black
111227-1022282191943w1Black
111227-1022282327214b1Black
111227-1022282328752b1Black
111227-1022282352517b1Black
111227-1022282462321b1White
111227-1022282720425b1Black
111227-1022282961943b1Black
111227-1022283136221b1White
111227-1022283632530b1Black
102228-111227923518b1White
102228-1112271529418w1Black
102228-1112271764937w1White
102228-1112271882423b1Black
102228-1112271887121w1White
102228-1112272468225w1Black
102228-1112272813727w1Black
102228-1112272950824w1Black
102228-1112273266320b1Black
102228-1112273326226b1Black
102228-1112273393416w1White
102228-1112273995825w1Black
102228-1112274084620b1Black
M for DR:
102228-1121271907332w1White
102228-1121272744523w1Black
102228-1121273849427w1White

IP Logged
IdahoEv
Forum Guru
*****



Arimaa player #1753

   


Gender: male
Posts: 405
Re: Empirically derived material evaluators Part I
« Reply #25 on: Nov 28th, 2006, 4:02pm »
Quote Quote Modify Modify

Argh.  I just realized my last three posts seached the database with an additional constraint ... wtype=btype.  So those numbers only include HvH games and BvB games; no HvB games.
 
I can re-run them if y'all care.
IP Logged
IdahoEv
Forum Guru
*****



Arimaa player #1753

   


Gender: male
Posts: 405
Re: Empirically derived material evaluators Part I
« Reply #26 on: Nov 28th, 2006, 5:20pm »
Quote Quote Modify Modify

Okay, the same results (summarized only - these posts are too long!), but now including HvB games as well as HvH and BvB.  
 
Interestingly, the results more closely match 99of9's intuition when HvB games are included.    Does this this mean that our feel for the game is based on thousands of games vs. bots, and isn't quite accurate when we are playing against a human opponent?
 
 
HvB included, ratings for both players > 1600.  Positions in order of greatest to least statistical advantage.
 
 
HR advantage wins 183/211, 86%
DR advantage wins 327/383, 85%
M advantage wins 132/159, 83%
RR advantage wins 446/568, 79%
H advantage wins 459/621, 74%
 
Trade M for HR, player with camel wins 7/17, 41%
Trade M for DR, player with camel wins 54/95, 57%
Trade H for RR, player with horse wins 19/44, 43%
 
 
HvB included, both ratings >100 and  abs(wrating-brating)<100
 
 
M advantage wins 64/72, 89%
HR advantage wins 59/71, 83%
DR advantage wins 94/117, 80%
RR advantage wins 167/214, 78%
H advantage wins 165/250, 66%
 
Trade M for HR, player with camel wins 18/33, 54.5%
Trade M for DR, player with camel wins 4/7, 57%
Trade H for RR, player with horse wins 10/18, 56%
 
 
So when HvB games are included AND we constrain to players with similar ratings, an M advantage becomes the strongest advantage we have.   But in all other combinations considered, HR advantage beats M, and even DR beats M for most combinations of constraints.    
 
Also, in every combination yet considered, RR advantage leads to win more often than H advantage.  (Actual trades of H for RR contraindicate, but the number of those is low)
IP Logged
aaaa
Forum Guru
*****



Arimaa player #958

   


Posts: 768
Re: Empirically derived material evaluators Part I
« Reply #27 on: May 11th, 2007, 6:22pm »
Quote Quote Modify Modify

on Nov 16th, 2006, 4:43pm, IdahoEv wrote:
ALGORITHM RUNDOWN
 
Count Pieces
 
Just what you'd expect; the score is +1.0 point for each piece gold has, -1.0 point for each piece silver has. This is here to give you a sense of the overall performance of the algorithms relative to a baseline.  Most of the time, a player with more pieces is ahead.   The performance improvement of the algorithms below is a measure of their ability to correctly evaluate the relatively few cases where a player has a numerical deficit but a functional advantage.

It's kind of a shame that you didn't use the opportunity to also see how the various algorithms would stack up against the significantly less (but still quite) naive method of simply assigning point values to the different (collapsed) piece types à la chess, which would be similarly optimized. These numbers would also make for nice base values for bots to use.
 
I'm not surprised in the least that an optimized DAPE performs the best here; the seven parameters just scream "overfitting" to me. If one were to wish to discover which general method (that is, ignoring the exact values used as parameters) would be best in capturing the intricacies of Arimaa, then cross-validation would be in order here.
IP Logged
IdahoEv
Forum Guru
*****



Arimaa player #1753

   


Gender: male
Posts: 405
Re: Empirically derived material evaluators Part I
« Reply #28 on: Jun 11th, 2007, 1:20pm »
Quote Quote Modify Modify

on May 11th, 2007, 6:22pm, aaaa wrote:

It's kind of a shame that you didn't use the opportunity to also see how the various algorithms would stack up against the significantly less (but still quite) naive method of simply assigning point values to the different (collapsed) piece types à la chess, which would be similarly optimized.

 
That's not a bad idea, and if and when I re-run this analysis, I'll include something like that.  The only trouble is that it would have to use collapsed piece types to be useful at all, and it's not at all clear how to assign the values to a collapsed piece list because the number of possible pieces changes.  
 
Do we fix the elephant value and count down in levels, or fix the cat and count up?  And do we float the rabbit value as well, or leave it fixed relative to the elephant or cat?  We might have to try it a couple of different ways.
 
Quote:
I'm not surprised in the least that an optimized DAPE performs the best here; the seven parameters just scream "overfitting" to me.
 
 
See my post in the fairy pieces thread for discussion on this.   As an additional comment, though, most of the variability in the oDAPE coefficients comes because what most of mattered in some cases was the ratio between two coefficients rather than the coefficients themselves.   So, for example, AR and BR vary widely, but only because they vary inversely.   The ratio AR/BR was fairly consistent over multiple runs.  
 
 
IP Logged
aaaa
Forum Guru
*****



Arimaa player #958

   


Posts: 768
Re: Empirically derived material evaluators Part I
« Reply #29 on: Jun 12th, 2007, 2:05pm »
Quote Quote Modify Modify

A case can be made both for collapsing the pieces downwards as well as upwards. On one hand, collapsing downwards makes sense from the viewpoint that as the board empties out, pieces and especially rabbits become more valuable simply by existing (i.e. that quantity becomes more important than quality), in which case you would want to minimize the discrepancy in value between a rabbit and the next weakest piece (by normalizing the latter as a cat). On the other hand, by collapsing upwards you won't have the strategically indispensable elephant inflating the base value of lower pieces after piece levels get eliminated.
 
This suggests adding three more systems for testing: collapsing downwards, collapsing upwards and a naive system with no collapsing at all in order to see whether it actually matters that much. I'm guessing it doesn't really.
IP Logged
Pages: 1 2 3  Reply Reply Notify of replies Notify of replies Send Topic Send Topic Print Print

« Previous topic | Next topic »

Arimaa Forum » Powered by YaBB 1 Gold - SP 1.3.1!
YaBB © 2000-2003. All Rights Reserved.