Welcome, Guest. Please Login or Register.
May 3rd, 2024, 10:14pm

Home Home Help Help Search Search Members Members Login Login Register Register
Arimaa Forum « Global Algebric Material Evaluator »


   Arimaa Forum
   Arimaa
   Bot Development
(Moderator: supersamu)
   Global Algebric Material Evaluator
« Previous topic | Next topic »
Pages: 1 2 3 4  5 Reply Reply Notify of replies Notify of replies Send Topic Send Topic Print Print
   Author  Topic: Global Algebric Material Evaluator  (Read 9120 times)
Fritzlein
Forum Guru
*****



Arimaa player #706

   
Email

Gender: male
Posts: 5928
On a
« Reply #15 on: Sep 5th, 2010, 9:41pm »
Quote Quote Modify Modify

on Sep 5th, 2010, 8:28pm, Rednaxela wrote:
Game must be rated and end in rabbit victory. Both players rated over 2000. Games with bots excluded.
 
Game Phase, Count, GAME, FAME, FAMEeo, DAPE, DAPEeo, HarLog
1st third, 1933, 67.977%, 70.150%, 69.633%, 70.150%, 69.633%, 69.840%

On closer examination, this might be less data than it seems.  I notice that both FAME and DAPE score 70.150% on these 1933 positions, i.e. both were right in exactly 1356 positions.  Did they ever disagree, or did they always pick the same winner?  If they never disagreed, then obviously the dataset was inadequate to discriminate, but it also makes one wonder about the significance of the datasets where they did disagree.
 
Quote:
2nd third, 12116, 73.110%, 74.563%, 75.264%, 75.322%, 75.165%, 74.761%

Not only is there a question of how many positions they disagreed about, but how many distinct games those positions belonged to.  Here DAPE outperforms FAME by 73 positions, but those might have been clustered in a small number of games, especially if the position is counted after after ply as long as the imbalance exists.  If FAME and DAPE only disagreed about the outcome of seven games, and DAPE was right five times to FAME being right twice, that doesn't allow you conclude DAPE is better any more than flipping a coin seven times and getting five heads allows you to conclude that the coin is unfair.  It's not statistically significant in either case.
 
I believe that IdahoEv tried to count the occurrence of imbalances without weighting an imbalance more if it persisted longer.  That is to say, if an imbalance of M for HD persisted for ten turns, whereas an imbalance of M for HC persisted for five turns, he didn't count the former as twice as meaningful.
 
I was going to point out that GAME was the worst of the six material evaluators in the first third and the middle third of the most important dataset, but now I think I would need to drill down further in the detail before being sure that the HvH over-2000 dataset says anything reliable at all.  I casually said earlier that it wouldn't be enough data, and now I am wondering again whether I chanced to be correct.
« Last Edit: Sep 5th, 2010, 9:54pm by Fritzlein » IP Logged

Rednaxela
Forum Senior Member
****



Arimaa player #4674

   


Gender: male
Posts: 34
Re: Global Algebric Material Evaluator
« Reply #16 on: Sep 5th, 2010, 10:23pm »
Quote Quote Modify Modify

on Sep 5th, 2010, 9:41pm, Fritzlein wrote:

On closer examination, this might be less data than it seems.  I notice that both FAME and DAPE score 70.150% on these 1933 positions, i.e. both were right in exactly 1356 positions.  Did they ever disagree, or did they always pick the same winner?  If they never disagreed, then obviously the dataset was inadequate to discriminate, but it also makes one wonder about the significance of the datasets where they did disagree.

Very good point there! I just modified my code to make note of this. The result is the following for the 2000+ HvH results:
 
FAME and DAPE disagreed about: 14/1933 positions in first third, 246/12116 positions in second third, 400/18632 positions in the final third.
 
on Sep 5th, 2010, 9:41pm, Fritzlein wrote:

Not only is there a question of how many positions they disagreed about, but how many distinct games those positions belonged to.  Here DAPE outperforms FAME by 73 positions, but those might have been clustered in a small number of games, especially if the position is counted after after ply as long as the imbalance exists.  If FAME and DAPE only disagreed about the outcome of seven games, and DAPE was right five times to FAME being right twice, that doesn't allow you conclude DAPE is better any more than flipping a coin seven times and getting five heads allows you to conclude that the coin is unfair.  It's not statistically significant in either case.
 
I believe that IdahoEv tried to count the occurrence of imbalances without weighting an imbalance more if it persisted longer.  That is to say, if an imbalance of M for HD persisted for ten turns, whereas an imbalance of M for HC persisted for five turns, he didn't count the former as twice as meaningful.

 
Those disagreements were in 3/590 games in the first third, 32/590 games in the second third, 61/590 games in the final third.
 
I disagree that that weighting should be canceled though. I think the more often a position occurs, the more meaningful it is. Anyway, that gives me another idea.... Perhaps a more interesting statistic than "How often the material evaluator guesses the winner", would be "How often the material evaluator thinks the eventual winner improved their position, NOT counting transient positions"?
 
 
Also, for sake of amusement I also did the following queries:
 
Fritzlein vs anyone above 1700 rating (including bots)
Game Phase, Count, GAME, FAME, FAMEeo, DAPE, DAPEeo, HarLog
1st third, 2102, 74.833%, 71.931%, 73.644%, 70.504%, 73.644%, 71.789%
2nd third, 12143, 81.767%, 81.405%, 82.220%, 81.273%, 82.311%, 81.364%
3rd third, 18408, 91.802%, 91.504%, 91.960%, 91.270%, 92.177%, 91.178%
Total, 32653, 86.978%, 86.488%, 87.159%, 86.216%, 87.315%, 86.280%
 
Chessandgo vs anyone above 1700 rating (including bots)
Game Phase, Count, GAME, FAME, FAMEeo, DAPE, DAPEeo, HarLog
1st third, 1401, 53.605%, 53.176%, 53.819%, 51.320%, 53.819%, 52.534%
2nd third, 6928, 72.113%, 69.140%, 71.579%, 69.703%, 71.897%, 69.847%
3rd third, 10805, 83.341%, 81.333%, 82.332%, 81.212%, 82.943%, 81.175%
Total, 19134, 77.098%, 74.856%, 76.351%, 74.856%, 76.811%, 74.976%
 
Fritzlein vs Chessandgo (only 45 games)
Game Phase, Count, GAME, FAME, FAMEeo, DAPE, DAPEeo, HarLog
1st third, 121, 61.157%, 63.636%, 63.636%, 57.851%, 63.636%, 61.157%
2nd third, 872, 60.321%, 60.436%, 63.073%, 59.633%, 63.303%, 59.174%
3rd third, 1407, 77.257%, 73.703%, 75.124%, 72.921%, 77.114%, 72.281%
Total, 2400, 70.292%, 68.375%, 70.167%, 67.333%, 71.417%, 66.958%
 
Apparently GAME and DAPEeo are by FAR the best predictors for the end third of a Fritz vs Chessandgo match... with DAPEeo being by far the best at predicting over the course of the whole game... But not sure how much it means with the sample size...
« Last Edit: Sep 5th, 2010, 10:54pm by Rednaxela » IP Logged
pago
Forum Guru
*****



Arimaa player #5439

   
Email

Gender: male
Posts: 69
Re: Global Algebric Material Evaluator
« Reply #17 on: Sep 6th, 2010, 7:56am »
Quote Quote Modify Modify


Thank you Rednaxela for these very interesting analysis.
I still need some time to read them carefully.
 
At first sight, I am happy to see that GAME is not so ridiculous compared with other evaluators although it has to be improved as Frizlein demonstrated in previous replies.
 
 
FYi, I think that I could have succeeded (at least partially) to fix the issues remarked by Fritzlein with just a little change in GAME formula.
 
With this new formula, M would be slightly better than HC but would be slightly worse than HD at first trade. I know that it is not totally consistent with the consensus which would say that M worth HD. However, it is already an improvement (GAME "says" that M <CC !).
In addition, I still need to verify its behavior in a lot of position as I did for GAME before sending it on the forum.
IP Logged
aaaa
Forum Guru
*****



Arimaa player #958

   


Posts: 768
Re: Global Algebric Material Evaluator
« Reply #18 on: Sep 6th, 2010, 9:23am »
Quote Quote Modify Modify

on Sep 6th, 2010, 7:56am, pago wrote:
With this new formula, M would be slightly better than HC but would be slightly worse than HD at first trade. I know that it is not totally consistent with the consensus which would say that M worth HD.

Eh, last I recall, the general position is, if anything, indeed that the initial value of a camel lies between a horse+cat and a horse+dog. In fact, from personal experience, I've consciously gotten the impression that my bot tends to do pretty well after giving up its camel for the latter combination as a first trade.
IP Logged
Rednaxela
Forum Senior Member
****



Arimaa player #4674

   


Gender: male
Posts: 34
Re: Global Algebric Material Evaluator
« Reply #19 on: Sep 6th, 2010, 5:04pm »
Quote Quote Modify Modify

@pago: Interesting. I'll be curious what the change to GAME is.
 
 
 
Well, I wasn't satisfied with my earlier evaluation and came up with another metric to try:
- Define a "capture sequence" as any set of captures, bordered on each side by at least one turn of no material change.
- What percentage of "capture sequnces" are in favor of the eventual winner, according to each material evaluator?
 
Now, I'm aware this metric has a flaw that it gives equal weight to big changes as it does to small changes, but I'm unsure what would be a fair way to fix that. Perhaps use the consensus of all the evaluators to judge how important each "capture sequence" is?
 
These are the results:
 
Game must be rated and end in rabbit victory. Both players rated over 1700. Games with bots included.
Game Phase, Count, GAME, FAME, FAMEeo, DAPE, DAPEeo, HarLog
Phase0, 13107, 64.004%, 64.202%, 64.378%, 64.019%, 64.309%, 64.187%
Phase1, 40798, 75.813%, 77.188%, 77.259%, 77.094%, 76.702%, 76.815%
Phase2, 68341, 82.562%, 84.380%, 84.598%, 84.454%, 83.556%, 83.702%
Total, 122246, 78.320%, 79.816%, 79.981%, 79.807%, 79.205%, 79.311%
 
Game must be rated and end in rabbit victory. Both players rated over 2000. Games with bots included.
Game Phase, Count, GAME, FAME, FAMEeo, DAPE, DAPEeo, HarLog
Phase0, 1640, 64.085%, 64.451%, 64.817%, 64.634%, 64.756%, 64.634%
Phase1, 4949, 72.398%, 74.217%, 74.338%, 74.156%, 73.651%, 73.894%
Phase2, 8291, 79.701%, 81.570%, 82.113%, 81.655%, 80.726%, 80.931%
Total, 14880, 75.551%, 77.238%, 77.621%, 77.285%, 76.613%, 76.794%
 
Game must be rated and end in rabbit victory. Both players rated over 1700. Games with bots excluded.
Game Phase, Count, GAME, FAME, FAMEeo, DAPE, DAPEeo, HarLog
Phase0, 1014, 57.298%, 57.890%, 57.988%, 57.890%, 57.988%, 57.692%
Phase1, 3823, 66.100%, 68.166%, 68.585%, 68.062%, 67.800%, 67.852%
Phase2, 6476, 73.564%, 76.004%, 76.683%, 76.081%, 75.293%, 75.540%
Total, 11313, 69.584%, 71.732%, 72.271%, 71.740%, 71.210%, 71.343%
 
Game must be rated and end in rabbit victory. Both players rated over 2000. Games with bots excluded.
Game Phase, Count, GAME, FAME, FAMEeo, DAPE, DAPEeo, HarLog
Phase0, 359, 54.318%, 55.989%, 55.989%, 56.546%, 55.710%, 55.153%
Phase1, 1356, 64.381%, 66.888%, 67.330%, 66.962%, 66.593%, 66.740%
Phase2, 2211, 70.782%, 72.999%, 73.813%, 73.089%, 72.546%, 72.365%
Total, 3926, 67.066%, 69.333%, 69.944%, 69.460%, 68.951%, 68.849%

 
I'm not sure if this is more relevant to performance of a bot using it or not, but I think it's interesting to compare.
 
1) In the earlier tests, DAPEeo was overall the biggest winner in most cuts of the data, however with this new metric, DAPEeo and GAME both take a real beating across all divisions. The winners now appear to be normal DAPE and FAMEeo, followed closely by normal FAME. I wonder why this big a shift exists with this different metric...
 
2) I expected the evaluators to agree with eachother more about "who benefits from a capture sequence", than the "who currently has an advantage in material" of the old tests, but clearly I was wrong... Hm.
IP Logged
Fritzlein
Forum Guru
*****



Arimaa player #706

   
Email

Gender: male
Posts: 5928
Re: Global Algebric Material Evaluator
« Reply #20 on: Sep 6th, 2010, 5:17pm »
Quote Quote Modify Modify

Hmmm, now GAME is the worst of the six evaluators in all three phases of all four datasets.
 
Let me see if I understand your methodology.  Suppose that I am ahead by a camel and two horses.  I decide to trade one of my horses for two rabbits, on my way to eventual victory.  FAME thinks trading H for RR increased my material advantage, but DAPE and HarLog think the trade made my material advantage less.  Since I won the game, FAME wins on your metric while DAPE and HarLog lose.  Is this correct?
 
If I'm understanding correctly, you are measuring the likelihood of the ultimately winning player to make trades that each evaluator approves of.  Is that reversing cause and effect?  This new metric says that people who are winning are more likely to make trades that GAME says they shouldn't make than they are likely to make trades FAME, DAPE, and HarLog say they shouldn't make?  That might be measuring what winning players actually like to do rather than measuring what they should like to do.
 
In any event, I really appreciate your willingness to slice and dice the data and share the results.  It seems to me that we don't yet have the best material evaluation formula we will ever have.  Smiley
« Last Edit: Sep 6th, 2010, 5:30pm by Fritzlein » IP Logged

Rednaxela
Forum Senior Member
****



Arimaa player #4674

   


Gender: male
Posts: 34
Re: Global Algebric Material Evaluator
« Reply #21 on: Sep 6th, 2010, 6:00pm »
Quote Quote Modify Modify

Yep, that is indeed the methodology of those latest tests.
 
on Sep 6th, 2010, 5:17pm, Fritzlein wrote:
If I'm understanding correctly, you are measuring the likelihood of the ultimately winning player to make trades that each evaluator approves of.  Is that reversing cause and effect?  This new metric says that people who are winning are more likely to make trades that GAME says they shouldn't make than they are likely to make trades FAME, DAPE, and HarLog say they shouldn't make?  That might be measuring what winning players actually like to do rather than measuring what they should like to do.

Very interesting insight into it there! As a test I decided to do a couple queries where.... instead of checking whether the evaluators approve of a move by the winning player, checking whether the evaluators approve of the move of a particular player. Just so you know Fritzlein... it turns out you appear to have a slight bias to make moves that FAME and FAMEeo approve of, whereas 99of9 has a slight bias to make moves that DAPE and DAPEeo approve of. Thus... I think you're quite right that this method measures what players like to do.
 
I wonder if this suggests that players would do well to discover why GAME gave relatively good results in the endgame in the other tests, yet based on this test seems less aligned with the moves the winning player likes to make.
 
Agreed that we likely don't yet have the best material evaluation we'll ever have. Smiley
IP Logged
pago
Forum Guru
*****



Arimaa player #5439

   
Email

Gender: male
Posts: 69
Re: Global Algebric Material Evaluator
« Reply #22 on: Sep 7th, 2010, 1:28am »
Quote Quote Modify Modify


I can't give my new evaluator yet but I may have a beginning of explaination.
 
Compared to the new evaluator, GAME seems to be biased by the rabbit density.  
When the rabbit density tends to 1 the new evaluator tends to GAME.
 
It could explain :
1) GAME is better in the second or third round because there are more piece exchanges than rabbit exchanges
2) It is consistent with Fritzlein feeling (GAME is better in the finale) if we consider that a finale is a position with a lot of rabbit and a few pieces
IP Logged
pago
Forum Guru
*****



Arimaa player #5439

   
Email

Gender: male
Posts: 69
Re: Global Algebric Material Evaluator
« Reply #23 on: Sep 7th, 2010, 7:29am »
Quote Quote Modify Modify


Quote:
Eh, last I recall, the general position is, if anything, indeed that the initial value of a camel lies between a horse+cat and a horse+dog.

 
It would be a great news for me and my evaluator (not for the camels).
 
FYi I have performed some tests with the new evaluator and it seems quite robust.
As it is, it answers to new issues :
1) HC < M < HD at first trade
2) It answers to one problem remarked by Fritzlein a few years ago
EHHDDCC8R < emhhddcc4r
but EDC8R > emdc4r
 
One unexpected result from this new evaluator is that there would be some "cycles".
ECC3R > ED4R > EDC2R > ECC3R !
ECC2R > ED3R > EDCR > ECC2R !
I wonder if it makes sense for strong (or not) players (ie do they agree with these inequalities ?)...
IP Logged
Rednaxela
Forum Senior Member
****



Arimaa player #4674

   


Gender: male
Posts: 34
Re: Global Algebric Material Evaluator
« Reply #24 on: Sep 7th, 2010, 7:53am »
Quote Quote Modify Modify

Hmm, interesting.
 
I'm far from a strong player, but personally I'm suspicious of the last link of those two cycles. Both of those last links are disagreed with by all existing material evaluators, though... I won't entirely write it off before checking the gameroom logs. If it's occurred enough times, the result from it might be interesting.
« Last Edit: Sep 7th, 2010, 7:53am by Rednaxela » IP Logged
pago
Forum Guru
*****



Arimaa player #5439

   
Email

Gender: male
Posts: 69
Re: Global Algebric Material Evaluator
« Reply #25 on: Sep 7th, 2010, 8:29am »
Quote Quote Modify Modify

Quote:
I'm far from a strong player, but personally I'm suspicious of the last link of those two cycles

 
A little precision :
The last inequality is wrong against the other opponents. The average performance of the third setup is the worst against the other opponents.
What my evaluator would suggest is that in a match at three there would be a cycle (although the effect would be tiny)
IP Logged
jdb
Forum Guru
*****



Arimaa player #214

   


Gender: male
Posts: 682
Re: Global Algebric Material Evaluator
« Reply #26 on: Sep 7th, 2010, 8:07pm »
Quote Quote Modify Modify

on Sep 7th, 2010, 7:29am, pago wrote:

One unexpected result from this new evaluator is that there would be some "cycles".
ECC3R > ED4R > EDC2R > ECC3R !
ECC2R > ED3R > EDCR > ECC2R !
I wonder if it makes sense for strong (or not) players (ie do they agree with these inequalities ?)...

 
The results from the reduced material tournament I played do not agree with this.
 
ECC2R vs ED3R ended 10-7
ECC2R vs EDCR ended 11-0
ED3R vs EDCR ended 9-1
 
ECC3R vs ED4R ended 8-9
ECC3R vs EDC2R ended 8-2
ED4R vs EDC2R ended 9-2
 
 
 
IP Logged
pago
Forum Guru
*****



Arimaa player #5439

   
Email

Gender: male
Posts: 69
Re: Global Algebric Material Evaluator
« Reply #27 on: Sep 8th, 2010, 10:09am »
Quote Quote Modify Modify

I would like to propose a (hopefully) improved evaluator.
 
It is a kind of extension of GAME evaluator and it is based on the same ideas.
 
I modestly called him : Global Evaluator of Material / GEM  Tongue
 
I have rewritten my paper  :
http://sd-2.archive-host.com/membres/up/208912627824851423/GEM.pdf
 
For people who would like to try the evaluator here is an Excel sheet :
http://sd-2.archive-host.com/membres/up/208912627824851423/GEM.xls
 
Main improvements compared with GAME are the following :
1) Camel is quoted between DC and DD at first trade (I was too optimistic in one of my previous mails).
2) Switch of advantage during trades naturally emerges from GEM.
 
and now the worry :
Quote:
One unexpected result from this new evaluator is that there would be some "cycles".  
ECC3R > ED4R > EDC2R > ECC3R !  
ECC2R > ED3R > EDCR > ECC2R !  

 
Quote:
The results from the reduced material tournament I played do not agree with this.  

 
Nobody is perfect !
My last (but not credible) hope is that the use of clueless by jdb induces a kind of bias (for example if positional factor parameters are not accurate, clueless might use some combination of piece in a way which is not efficient).
I am aware that it is dubious hope.
 
 
IP Logged
ocmiente
Forum Guru
*****




Arimaa player #3996

   
WWW

Gender: male
Posts: 194
Re: Global Algebric Material Evaluator
« Reply #28 on: Sep 8th, 2010, 12:11pm »
Quote Quote Modify Modify

Whether this evaluation function turns out to be better than others or not, I am very impressed with the clarity of your paper, the emergent behavior your evaluation demonstrates, and how the small set of rules succeed in getting a result as accurate as it is.  
 
One small detail - there may be a typo on page 22, "ECC2R would win against ecc3r".  I think it should be EDC2R.  
 
I'm skeptical about the cycles, but I would like to think that it could be true.  Having a rock/paper/scissors type of situation in a game can make it more interesting - especially if the choice is visible for everyone to see - that is, with no hidden information.  
« Last Edit: Sep 8th, 2010, 3:36pm by ocmiente » IP Logged

aaaa
Forum Guru
*****



Arimaa player #958

   


Posts: 768
Re: Global Algebric Material Evaluator
« Reply #29 on: Sep 8th, 2010, 12:52pm »
Quote Quote Modify Modify

Here is an earlier thread addressing the possibility of material intransitivity.
IP Logged
Pages: 1 2 3 4  5 Reply Reply Notify of replies Notify of replies Send Topic Send Topic Print Print

« Previous topic | Next topic »

Arimaa Forum » Powered by YaBB 1 Gold - SP 1.3.1!
YaBB © 2000-2003. All Rights Reserved.