|
||||||||||
Title: Global Algebric Material Evaluator Post by pago on Sep 1st, 2010, 10:21am Hello First, I hope that this topic is new. I didn’t see similar idea in the forum although this topic has obviously a close relation with the thread “(no) absolute score value for pieces”. I have written a paper about a new material evaluator that I propose. http://sd-2.archive-host.com/membres/up/208912627824851423/Global_Arimaa_Material_Evaluator.pdf This evaluator has some interesting properties : - I have designed the evaluator without other parameters than the number of pieces (yes I could !) - The evaluator takes into account ALL possible combination of pieces - It is intrinsically holistic. The evaluator globally deals with the material. A piece (or combination of piece) has not a predefined value. - It answers to most of the issues already discussed in the forum (increasing value of the rabbits, change of piece value according to the material balance etc…) - It is consistent with jdb’s tests on clueless bot (CR & DCR tests) - It could have some unexpected links with mathematics. Before thinking that I am totally mad, please try to read the paper :P I am interested by the friendly feed-back that I could get from the community, in particular those from bot developers, strong players, people with more mathematical competency than me and… Arimaa game inventors. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by rbarreira on Sep 1st, 2010, 11:49am It is very elegant, maybe Janzert will add it to the evaluators page for people to play around with. If I have some time I may program it into my bot and run some tournaments against my current material evaluation (FAME). |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by speek on Sep 1st, 2010, 12:04pm Where is this evaluators page? |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by Fritzlein on Sep 1st, 2010, 1:37pm on 09/01/10 at 12:04:59, speek wrote:
http://arimaa.janzert.com/eval.html Thanks for freely sharing your material evaluation formula! |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by Fritzlein on Sep 1st, 2010, 1:56pm My first impression is that GAME is aimed more at endgame evaluation than opening evaluation. This is the opposite of FAME which performs best in the opening and worst in the endgame. As the first trade, FAME, DAPE, and HarLog all agree that M >> CC. I think every top player would concur as well. In the opening the absence of two cats is merely annoying, whereas the absence of the camel is crippling. The player without the camel has no answer to an elephant-horse attack, which is strong enough to tie down the defensive elephant, leaving the lone camel unopposed on the rest of the board. On a much smaller point, GAME rates it equally to have one strong and one weak piece against two medium pieces, for example MD vs HH. My experience is that having the strong and weak piece is better in almost every situation. Winning the primary fight is more important than winning the secondary fight. Thanks again for sharing your formula. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by jdb on Sep 1st, 2010, 3:40pm Very nice work! Your idea does an amazingly good job. The rest of the tournament results are posted in the other thread. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by rednaxela on Sep 1st, 2010, 6:12pm Very impressive! I really like how elegant it is! Hmm... I wonder if there is a 'missing' factor would fix the "M >> CC" situation that Fritzlein mentioned without adding numeric constants.... I have a feeling that putting an exponent on the "number of pieces dominated by this piece" factor might work, but that does add a numeric constant to guess or empirically tune. About "M >> CC", I think one other thing perhaps worth noting, is that the empirically optimized ones all show that loss to be significantly smaller than a single-cat-at-opening loss, whereas all the hand tuned ones show it as a significantly bigger loss than a single-cat-at-opening loss. Perhaps this suggests that while "M > CC" is true, it's objective difference is smaller than what it 'feels' like to players? |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by Fritzlein on Sep 1st, 2010, 8:00pm on 09/01/10 at 18:12:28, rednaxela wrote:
You suggest that perhaps M for CC feels like a larger advantage than it is. This is a tricky issue, because an advantage that allows me to win a rabbit for nothing could be considered as an advantage of just one rabbit, or it could be considered an infinite (game-winning) advantage. If I have M for CC, it is reasonable for me to play to not lose any pieces ever. I should be able to successfully contest all four traps, denying the opponent any space for capture. I will almost automatically be able to set up a control situation where I eventually win something for nothing. It may take a long time, but if the opponent has no way to threaten me, it doesn't matter how long the control takes to pay off, because then I can reset and make it pay off again and again and again. (OK, maybe it is too strong to insist that I would never have to lose a piece, but at a minimum I could play to never lose a piece without capturing a better piece in compensation, i.e. I would never have to accept an equal trade.) The "empirically optimized" results should be used with caution. There is apparently a correlation between the ability of the players and the value of the strong pieces. Relatively speaking, between weaker players it is more important to have numerous pieces, whereas among stronger players it is more important to have stronger pieces. I believe this is because stronger players have a better understanding of "control" positions, where the player who has lost control is doomed to eventually lose material even though it isn't superficially obvious why. Unfortunately, if one restricts "empirical optimization" to, say, games in which both players were humans rated over 2000, there aren't enough data points to get a good reading on material values. Therefore data from games between intermediate players must be included as well, potentially distorting the results. An extreme example of a useless empirical result is self-play by a randomly-moving bot. For such a player (i.e. one with no strategy), a rabbit is worth more than an elephant, as proven by experiment! Yet nobody supposes the material values for ultra-weak players should inform their values for strong players. On the other hand, it is possible that I am merely stubborn in my adherence to my intuitions in spite of the evidence. Certainly the top players of 2005 over-valued the camel. Our intuitions were wrong. Now we value it somewhat less, but perhaps still too much. Perhaps a camel is truly worth only a cat and a dog. Please, do your best to convince chessandgo of this, so that we can arrange to trade his M for my DC in our next World Championship match. ;) |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by rednaxela on Sep 1st, 2010, 10:59pm on 09/01/10 at 20:00:08, Fritzlein wrote:
Indeed, one must be quite careful with "empirically optimization" and the methodology used. One note is that schemes with fewer constants to optimize take MUCH less data to optimize well. GAME extended by the exponent I mentioned in my earlier post, would take a relatively tiny number of data points to stabilize. My feeling is that the ideal material evaluation shouldn't need many parameters to tune. Really, all pieces except the rabbit have the same rules, and are only differentiated by what pieces on the board they dominate or are dominated by. I kind of think GAME is on the right track with the elegant way it approaches this. As far as pure material evaluation goes, there are only two things I can think of that feel like 'major omissions' from GAME: 1) Relative weight of rabbits to other pieces, due to their differing rules 2) Non-linear effects in the number of pieces dominated It seems to me #1 would take one constant to express, and #2 would take one to two constants to express. A number of constants as few as 2 is pretty easy to "empirically optimize" with relatively few data points, allowing one to be pickier about the games used. Anyway, since I already have code lying around for processing gameroom data... I should probably run some tests on it, using GAME, older material evaluators, and perhaps try a variation on GAME with some tunable parameters... Too late tonight to do that, but I'll probably get something together this week to give it a shot. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by pago on Sep 2nd, 2010, 5:10am Thank you for these very friendly feedbacks To rbarreira : I would be very interested by seeing the behavior of GAME in a bot although I already know that it has to be improved (see the answer to Fritzlein). Unfortunately my competences in software development are worse than than my competences at Arimaa (knowing that the last ones worth almost zero). To Fritzlein (and others) : As usual your comments are interesting and challenging. I agree with you when you remark in your two examples that GAME has not a perfect behaviour yet although I find astonishing to see that it is possible to design a (relatively consistent) evaluator taking into account all the combination of pieces without hand-tuned parameters. My analysis is that although GAME intrinsically takes into account the interactions between piece combinations it doesn’t take into account the relative position of pieces on the board and in particular the dangerousness of piece environment. GAME(G;s)=F(G;s)/(F(G;s)+F(s;G)) F(G,s) = somme(Fi(G;s)) = Opportunities–Risks As it is, GAME considers that the relative value of a cat and a camel in the middle of the enemy army equals the relative value of a cat and a camel staying quietly at home. It is obviously false. A cat feels that the middle of enemy army is a very dangerous environment. It has to look at enemy elephant, camel, horses, dogs (risks). It has no time to fight against enemy cats or to attack rabbits (opportunities). It would like to stay at home until the situation becomes more quiet. On the contrary a camel feels that the middle of the enemy army is not so dangerous. It has only one risk (the elephant) and a lot of opportunities. It can stay in the middle of enemy army (as far as possible of the elephant). To summarize, F(G;s) should be weighted by the dangerousness of each piece environment, so that the relative value between a cat and a camel would increase with dangerousness The initial setup is a quite dangerous situation (although it is less dangerous than the middle of enemy army). So the relative value between a camel and a cat should be greater. In the finale the dangerousness decreases more for a cat than for a camel. The relative value between a camel and a cat should decrease. I have some ideas to design a positional evaluator based on GAME which could fix the identified issues. F(G;s) would be replaced by F(G;s;setup) = somme(Fi(G;s;ri;ci) where ri is the rank and ci is the column of the square occupied by piece i. This evaluator would have the additional emerging following properties (keeping in mind that I want to avoid hand-tuned parameters) : a) Fi(G;0;ri;ci) increases toward the center of the board (pieces tend to search the center) b) Fi(G;s;ri;ci) increases with the proximity of weaker pieces (pieces tend to attack weaker pieces) c) Fi(G;s;ri;ci) decreases with the proximity of stronger pieces (pieces tend to avoid stronger pieces) d) For the rabbit, Fi(G;0;ri+1;ci) > Fi(G;0;ri;ci) (when there are no risk an advanced rabbit is better) e) Fi(G;0;ri+1;ci) > Fi(G;0;ri;ci+1) & Fi(G;0;ri+1;ci) > Fi(G;0;ri;ci-1) (when there are no risk a rabbit shall go to the 8th rank as far as possible) f) The relative weight between a, b, c, d, e would not be predefined (similarly the relative value of pieces are not predefined in GAME. They emerge from the material balance) The elephant would feel a tense between a) and b) (between centralization and attack) A camel would feel a tense between a) much b) and a little c) … A cat would would feel a tense between a) a little b) and much c) A rabbit would feel a tense between a little a) much c) a little d) a little e) With such an evaluator, cats and camels would feel the influence of enemy and so relative value between them would increase thanks b) and c). Maybe it is a little too ambitious, but I will try… |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by speek on Sep 2nd, 2010, 12:16pm Pago, now you've describe some of what I've been thinking about - evaluating pieces based on proximity those they can dominate (= good) and proximity to those who can dominate them (=bad). Something to consider though is that if A dominates B, it is MOST valuable if A's strength is only 1 greater than B, as opposed to 2+ greater. IE, better to use the elephant to dominate a camel than a cat. In this way, the inverse of the difference between piece strengths should, I think, feed into your equations. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by jdb on Sep 2nd, 2010, 2:52pm I have a question. In the phase where the points are calculated based on the duels, is this an equivalent formulation? The points each piece gets is 16 - the number of stronger enemy pieces. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by pago on Sep 4th, 2010, 7:27am [Posted by: jdb Posted on: Sep 2nd, 2010, 9:52pm ][/I have a question. In the phase where the points are calculated based on the duels, is this an equivalent formulation? The points each piece gets is 16 - the number of stronger enemy pieces. ] Yes it equivalent. I wrote §5.1 (matrix calculation) using this equivalence |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by Rednaxela on Sep 5th, 2010, 8:28pm I just ran some tests now, on all gameroom data from 2006 to present (1GB when uncomrpessed). Then I split the results based on 1) rating threshold, 2) bots included or not, and 3) which "phase" of the game it is, divided into thirds. The percent value represents "percept of turns with non-equal material, where the winner of the correctly guessed by the material evaluator", and the count is the number of such turns in that data sample. Sorry it's hard to read. The forum software in use here seems to not handle it's table tags that well, so I couldn't use that. Game must be rated and end in rabbit victory. Both players rated over 1700. Games with bots included. Game Phase, Count, GAME, FAME, FAMEeo, DAPE, DAPEeo, HarLog 1st third, 98369, 65.119%, 64.754%, 65.058%, 64.512%, 65.100%, 64.780% 2nd third, 378126, 77.409%, 77.597%, 78.036%, 77.332%, 78.035%, 77.691% 3rd third, 505682, 86.753%, 86.783%, 87.149%, 86.715%, 87.136%, 86.822% Total, 982177, 80.989%, 81.040%, 81.429%, 80.879%, 81.425%, 81.099% Game must be rated and end in rabbit victory. Both players rated over 2000. Games with bots included. Game Phase, Count, GAME, FAME, FAMEeo, DAPE, DAPEeo, HarLog 1st third, 11266, 69.466%, 68.676%, 69.270%, 68.116%, 69.253%, 68.454% 2nd third, 46259, 75.579%, 76.147%, 76.547%, 76.242%, 76.530%, 76.450% 3rd third, 63295, 85.929%, 86.146%, 86.448%, 86.027%, 86.550%, 86.136% Total, 120820, 80.431%, 80.689%, 81.055%, 80.611%, 81.101%, 80.779% Game must be rated and end in rabbit victory. Both players rated over 1700. Games with bots excluded. Game Phase, Count, GAME, FAME, FAMEeo, DAPE, DAPEeo, HarLog 1st third, 5326, 66.147%, 67.574%, 67.405%, 67.443%, 67.443%, 67.462% 2nd third, 34116, 74.716%, 75.070%, 76.404%, 75.152%, 76.404%, 75.378% 3rd third, 52590, 85.959%, 85.246%, 86.326%, 85.362%, 86.454%, 85.372% Total, 92032, 80.645%, 80.451%, 81.553%, 80.540%, 81.628%, 80.631% Game must be rated and end in rabbit victory. Both players rated over 2000. Games with bots excluded. Game Phase, Count, GAME, FAME, FAMEeo, DAPE, DAPEeo, HarLog 1st third, 1933, 67.977%, 70.150%, 69.633%, 70.150%, 69.633%, 69.840% 2nd third, 12116, 73.110%, 74.563%, 75.264%, 75.322%, 75.165%, 74.761% 3rd third, 18632, 85.793%, 84.811%, 85.804%, 84.650%, 86.174%, 84.537% Total, 32681, 80.037%, 80.144%, 80.940%, 80.334%, 81.114%, 80.043% It seems to show a few things, including the 'eo' variants being stronger even when excluding bots and players not rated over 2000. One interesting thing is GAME seems to have issues judging early trades in human-only games. Overall, GAME seems to perform fairly well though. EDIT: Interestingly, it looks like for the first third of the match, GAME always is the best predictor when bots are involved, but starts to suffer both in later game and when it's human only. The other interesting thing is that in early-game for human-only games, FAME and DAPE appear to outperform FAMEeo and DAPEeo, but in the late game the 'eo' variants come out on top. Then in the games with bots, the 'eo' variants seem to always come out on top. Any thoughts, either on GAME or on these results in general? :) |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by Fritzlein on Sep 5th, 2010, 9:22pm Very interesting analysis. Thanks for sharing. It strikes me that the optimized versions of FAME and DAPE do worse at predicting in the opening third of the game. Why wouldn't they do better in every phase? Probably it is because there are more material imbalances later in the game, so if equal weight is given to every position with a material imbalance, then late-game positions will have a greater impact on the optimizations. One reason to be suspicious of the optimized versions of FAME and DAPE was that perhaps they were over-fitting the data at hand. However, they have done quite well for themselves across the data from several years since they were calculated. This is true predictive power, and I am impressed by it, particularly in the case of over-2000 HvH matches. In line with my reasoning about control positions earlier in this thread, it makes sense that stronger pieces would relatively lose their value as the game progress, because it becomes harder and harder to play a control game. I hand-tuned FAME mostly with early-game positions in mind, so it isn't surprising that it performs increasingly poorly later in the game. It was clever of you to divide the data into thirds as you did, because that provides the greatest insight of all. Whenever someone suggests that my material intuitions are out of whack, I imagine the imbalance occurring on a relatively quiet, fluid board position. However, later in the game the positions tend to be messier, with more pieces strategically committed, more rabbits advanced, and more traps contested. Perhaps my intuitions are just fine in the positions I usually imagine, but are inaccurate when some factor typical in later games is present. I am not sure precisely which factor this would be, but I will now be more vigilant for late-game positional factors that change the value of a material imbalance. Thanks again for doing the spadework and sharing your results. |
||||||||||
Title: On a Post by Fritzlein on Sep 5th, 2010, 9:41pm on 09/05/10 at 20:28:25, Rednaxela wrote:
On closer examination, this might be less data than it seems. I notice that both FAME and DAPE score 70.150% on these 1933 positions, i.e. both were right in exactly 1356 positions. Did they ever disagree, or did they always pick the same winner? If they never disagreed, then obviously the dataset was inadequate to discriminate, but it also makes one wonder about the significance of the datasets where they did disagree. Quote:
Not only is there a question of how many positions they disagreed about, but how many distinct games those positions belonged to. Here DAPE outperforms FAME by 73 positions, but those might have been clustered in a small number of games, especially if the position is counted after after ply as long as the imbalance exists. If FAME and DAPE only disagreed about the outcome of seven games, and DAPE was right five times to FAME being right twice, that doesn't allow you conclude DAPE is better any more than flipping a coin seven times and getting five heads allows you to conclude that the coin is unfair. It's not statistically significant in either case. I believe that IdahoEv tried to count the occurrence of imbalances without weighting an imbalance more if it persisted longer. That is to say, if an imbalance of M for HD persisted for ten turns, whereas an imbalance of M for HC persisted for five turns, he didn't count the former as twice as meaningful. I was going to point out that GAME was the worst of the six material evaluators in the first third and the middle third of the most important dataset, but now I think I would need to drill down further in the detail before being sure that the HvH over-2000 dataset says anything reliable at all. I casually said earlier that it wouldn't be enough data, and now I am wondering again whether I chanced to be correct. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by Rednaxela on Sep 5th, 2010, 10:23pm on 09/05/10 at 21:41:04, Fritzlein wrote:
Very good point there! I just modified my code to make note of this. The result is the following for the 2000+ HvH results: FAME and DAPE disagreed about: 14/1933 positions in first third, 246/12116 positions in second third, 400/18632 positions in the final third. on 09/05/10 at 21:41:04, Fritzlein wrote:
Those disagreements were in 3/590 games in the first third, 32/590 games in the second third, 61/590 games in the final third. I disagree that that weighting should be canceled though. I think the more often a position occurs, the more meaningful it is. Anyway, that gives me another idea.... Perhaps a more interesting statistic than "How often the material evaluator guesses the winner", would be "How often the material evaluator thinks the eventual winner improved their position, NOT counting transient positions"? Also, for sake of amusement I also did the following queries: Fritzlein vs anyone above 1700 rating (including bots) Game Phase, Count, GAME, FAME, FAMEeo, DAPE, DAPEeo, HarLog 1st third, 2102, 74.833%, 71.931%, 73.644%, 70.504%, 73.644%, 71.789% 2nd third, 12143, 81.767%, 81.405%, 82.220%, 81.273%, 82.311%, 81.364% 3rd third, 18408, 91.802%, 91.504%, 91.960%, 91.270%, 92.177%, 91.178% Total, 32653, 86.978%, 86.488%, 87.159%, 86.216%, 87.315%, 86.280% Chessandgo vs anyone above 1700 rating (including bots) Game Phase, Count, GAME, FAME, FAMEeo, DAPE, DAPEeo, HarLog 1st third, 1401, 53.605%, 53.176%, 53.819%, 51.320%, 53.819%, 52.534% 2nd third, 6928, 72.113%, 69.140%, 71.579%, 69.703%, 71.897%, 69.847% 3rd third, 10805, 83.341%, 81.333%, 82.332%, 81.212%, 82.943%, 81.175% Total, 19134, 77.098%, 74.856%, 76.351%, 74.856%, 76.811%, 74.976% Fritzlein vs Chessandgo (only 45 games) Game Phase, Count, GAME, FAME, FAMEeo, DAPE, DAPEeo, HarLog 1st third, 121, 61.157%, 63.636%, 63.636%, 57.851%, 63.636%, 61.157% 2nd third, 872, 60.321%, 60.436%, 63.073%, 59.633%, 63.303%, 59.174% 3rd third, 1407, 77.257%, 73.703%, 75.124%, 72.921%, 77.114%, 72.281% Total, 2400, 70.292%, 68.375%, 70.167%, 67.333%, 71.417%, 66.958% Apparently GAME and DAPEeo are by FAR the best predictors for the end third of a Fritz vs Chessandgo match... with DAPEeo being by far the best at predicting over the course of the whole game... But not sure how much it means with the sample size... |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by pago on Sep 6th, 2010, 7:56am Thank you Rednaxela for these very interesting analysis. I still need some time to read them carefully. At first sight, I am happy to see that GAME is not so ridiculous compared with other evaluators although it has to be improved as Frizlein demonstrated in previous replies. FYi, I think that I could have succeeded (at least partially) to fix the issues remarked by Fritzlein with just a little change in GAME formula. With this new formula, M would be slightly better than HC but would be slightly worse than HD at first trade. I know that it is not totally consistent with the consensus which would say that M worth HD. However, it is already an improvement (GAME "says" that M <CC !). In addition, I still need to verify its behavior in a lot of position as I did for GAME before sending it on the forum. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by aaaa on Sep 6th, 2010, 9:23am on 09/06/10 at 07:56:52, pago wrote:
Eh, last I recall, the general position is, if anything, indeed that the initial value of a camel lies between a horse+cat and a horse+dog. In fact, from personal experience, I've consciously gotten the impression that my bot tends to do pretty well after giving up its camel for the latter combination as a first trade. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by Rednaxela on Sep 6th, 2010, 5:04pm @pago: Interesting. I'll be curious what the change to GAME is. Well, I wasn't satisfied with my earlier evaluation and came up with another metric to try: - Define a "capture sequence" as any set of captures, bordered on each side by at least one turn of no material change. - What percentage of "capture sequnces" are in favor of the eventual winner, according to each material evaluator? Now, I'm aware this metric has a flaw that it gives equal weight to big changes as it does to small changes, but I'm unsure what would be a fair way to fix that. Perhaps use the consensus of all the evaluators to judge how important each "capture sequence" is? These are the results: Game must be rated and end in rabbit victory. Both players rated over 1700. Games with bots included. Game Phase, Count, GAME, FAME, FAMEeo, DAPE, DAPEeo, HarLog Phase0, 13107, 64.004%, 64.202%, 64.378%, 64.019%, 64.309%, 64.187% Phase1, 40798, 75.813%, 77.188%, 77.259%, 77.094%, 76.702%, 76.815% Phase2, 68341, 82.562%, 84.380%, 84.598%, 84.454%, 83.556%, 83.702% Total, 122246, 78.320%, 79.816%, 79.981%, 79.807%, 79.205%, 79.311% Game must be rated and end in rabbit victory. Both players rated over 2000. Games with bots included. Game Phase, Count, GAME, FAME, FAMEeo, DAPE, DAPEeo, HarLog Phase0, 1640, 64.085%, 64.451%, 64.817%, 64.634%, 64.756%, 64.634% Phase1, 4949, 72.398%, 74.217%, 74.338%, 74.156%, 73.651%, 73.894% Phase2, 8291, 79.701%, 81.570%, 82.113%, 81.655%, 80.726%, 80.931% Total, 14880, 75.551%, 77.238%, 77.621%, 77.285%, 76.613%, 76.794% Game must be rated and end in rabbit victory. Both players rated over 1700. Games with bots excluded. Game Phase, Count, GAME, FAME, FAMEeo, DAPE, DAPEeo, HarLog Phase0, 1014, 57.298%, 57.890%, 57.988%, 57.890%, 57.988%, 57.692% Phase1, 3823, 66.100%, 68.166%, 68.585%, 68.062%, 67.800%, 67.852% Phase2, 6476, 73.564%, 76.004%, 76.683%, 76.081%, 75.293%, 75.540% Total, 11313, 69.584%, 71.732%, 72.271%, 71.740%, 71.210%, 71.343% Game must be rated and end in rabbit victory. Both players rated over 2000. Games with bots excluded. Game Phase, Count, GAME, FAME, FAMEeo, DAPE, DAPEeo, HarLog Phase0, 359, 54.318%, 55.989%, 55.989%, 56.546%, 55.710%, 55.153% Phase1, 1356, 64.381%, 66.888%, 67.330%, 66.962%, 66.593%, 66.740% Phase2, 2211, 70.782%, 72.999%, 73.813%, 73.089%, 72.546%, 72.365% Total, 3926, 67.066%, 69.333%, 69.944%, 69.460%, 68.951%, 68.849% I'm not sure if this is more relevant to performance of a bot using it or not, but I think it's interesting to compare. 1) In the earlier tests, DAPEeo was overall the biggest winner in most cuts of the data, however with this new metric, DAPEeo and GAME both take a real beating across all divisions. The winners now appear to be normal DAPE and FAMEeo, followed closely by normal FAME. I wonder why this big a shift exists with this different metric... 2) I expected the evaluators to agree with eachother more about "who benefits from a capture sequence", than the "who currently has an advantage in material" of the old tests, but clearly I was wrong... Hm. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by Fritzlein on Sep 6th, 2010, 5:17pm Hmmm, now GAME is the worst of the six evaluators in all three phases of all four datasets. Let me see if I understand your methodology. Suppose that I am ahead by a camel and two horses. I decide to trade one of my horses for two rabbits, on my way to eventual victory. FAME thinks trading H for RR increased my material advantage, but DAPE and HarLog think the trade made my material advantage less. Since I won the game, FAME wins on your metric while DAPE and HarLog lose. Is this correct? If I'm understanding correctly, you are measuring the likelihood of the ultimately winning player to make trades that each evaluator approves of. Is that reversing cause and effect? This new metric says that people who are winning are more likely to make trades that GAME says they shouldn't make than they are likely to make trades FAME, DAPE, and HarLog say they shouldn't make? That might be measuring what winning players actually like to do rather than measuring what they should like to do. In any event, I really appreciate your willingness to slice and dice the data and share the results. It seems to me that we don't yet have the best material evaluation formula we will ever have. :) |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by Rednaxela on Sep 6th, 2010, 6:00pm Yep, that is indeed the methodology of those latest tests. on 09/06/10 at 17:17:27, Fritzlein wrote:
Very interesting insight into it there! As a test I decided to do a couple queries where.... instead of checking whether the evaluators approve of a move by the winning player, checking whether the evaluators approve of the move of a particular player. Just so you know Fritzlein... it turns out you appear to have a slight bias to make moves that FAME and FAMEeo approve of, whereas 99of9 has a slight bias to make moves that DAPE and DAPEeo approve of. Thus... I think you're quite right that this method measures what players like to do. I wonder if this suggests that players would do well to discover why GAME gave relatively good results in the endgame in the other tests, yet based on this test seems less aligned with the moves the winning player likes to make. Agreed that we likely don't yet have the best material evaluation we'll ever have. :) |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by pago on Sep 7th, 2010, 1:28am I can't give my new evaluator yet but I may have a beginning of explaination. Compared to the new evaluator, GAME seems to be biased by the rabbit density. When the rabbit density tends to 1 the new evaluator tends to GAME. It could explain : 1) GAME is better in the second or third round because there are more piece exchanges than rabbit exchanges 2) It is consistent with Fritzlein feeling (GAME is better in the finale) if we consider that a finale is a position with a lot of rabbit and a few pieces |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by pago on Sep 7th, 2010, 7:29am Quote:
It would be a great news for me and my evaluator (not for the camels). FYi I have performed some tests with the new evaluator and it seems quite robust. As it is, it answers to new issues : 1) HC < M < HD at first trade 2) It answers to one problem remarked by Fritzlein a few years ago EHHDDCC8R < emhhddcc4r but EDC8R > emdc4r One unexpected result from this new evaluator is that there would be some "cycles". ECC3R > ED4R > EDC2R > ECC3R ! ECC2R > ED3R > EDCR > ECC2R ! I wonder if it makes sense for strong (or not) players (ie do they agree with these inequalities ?)... |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by Rednaxela on Sep 7th, 2010, 7:53am Hmm, interesting. I'm far from a strong player, but personally I'm suspicious of the last link of those two cycles. Both of those last links are disagreed with by all existing material evaluators, though... I won't entirely write it off before checking the gameroom logs. If it's occurred enough times, the result from it might be interesting. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by pago on Sep 7th, 2010, 8:29am Quote:
A little precision : The last inequality is wrong against the other opponents. The average performance of the third setup is the worst against the other opponents. What my evaluator would suggest is that in a match at three there would be a cycle (although the effect would be tiny) |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by jdb on Sep 7th, 2010, 8:07pm on 09/07/10 at 07:29:17, pago wrote:
The results from the reduced material tournament I played do not agree with this. ECC2R vs ED3R ended 10-7 ECC2R vs EDCR ended 11-0 ED3R vs EDCR ended 9-1 ECC3R vs ED4R ended 8-9 ECC3R vs EDC2R ended 8-2 ED4R vs EDC2R ended 9-2 |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by pago on Sep 8th, 2010, 10:09am I would like to propose a (hopefully) improved evaluator. It is a kind of extension of GAME evaluator and it is based on the same ideas. I modestly called him : Global Evaluator of Material / GEM :P I have rewritten my paper : http://sd-2.archive-host.com/membres/up/208912627824851423/GEM.pdf For people who would like to try the evaluator here is an Excel sheet : http://sd-2.archive-host.com/membres/up/208912627824851423/GEM.xls Main improvements compared with GAME are the following : 1) Camel is quoted between DC and DD at first trade (I was too optimistic in one of my previous mails). 2) Switch of advantage during trades naturally emerges from GEM. and now the worry : Quote:
Quote:
Nobody is perfect ! My last (but not credible) hope is that the use of clueless by jdb induces a kind of bias (for example if positional factor parameters are not accurate, clueless might use some combination of piece in a way which is not efficient). I am aware that it is dubious hope. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by ocmiente on Sep 8th, 2010, 12:11pm Whether this evaluation function turns out to be better than others or not, I am very impressed with the clarity of your paper, the emergent behavior your evaluation demonstrates, and how the small set of rules succeed in getting a result as accurate as it is. One small detail - there may be a typo on page 22, "ECC2R would win against ecc3r". I think it should be EDC2R. I'm skeptical about the cycles, but I would like to think that it could be true. Having a rock/paper/scissors type of situation in a game can make it more interesting - especially if the choice is visible for everyone to see - that is, with no hidden information. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by aaaa on Sep 8th, 2010, 12:52pm Here is an earlier thread addressing the possibility of material intransitivity (http://arimaa.com/arimaa/forum/cgi/YaBB.cgi?board=talk;action=display;num=1176168668). |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by pago on Sep 9th, 2010, 2:12pm Quote:
Interesting thread. So GEM gives some positive argument about the possibility of material intransitivity although I think that the cycles GEM has revealed are not convincing. Quote:
Thank you for this friendly reply. Now, Physicians say that an elegant theory is not a sufficient criteria to recognize a good theory. Similarly, GEM has to be tested indifferent situations. I tried to perform some of them and compare them to jdb's results (thank you jdb for your nice job which is a reference for me) but Excel has some limitations ! Rednaxela comparisions between evaluators is a much more valuable test. I would also be highly interested if someone would try to implement it in a bot. I believe that it is the best test for an evaluator and I would be curious to see how GEM would perform. Quote:
I have now an explaination that seems more credible to me. GEM evaluates a material balance but jdb tests the real results of matches between setup. However a material evaluator is not a complete result predictor although there is a great correlation. In particular GEM estimates the power balance between pieces without taking into account that the goal of Arimaa is taking the last rabbit of reaching a rabbit on the last rank. GEM evaluates material as if the goal would be to take indifferently the greater number of pieces. In general it has not a great impact excepted when it remains one or two rabbits as in the cycles given in the paper. I expect that other examples of intransitivity could exist with a greater number of rabbit. In that case (if I am right) they would be much more convincing. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by Fritzlein on Sep 10th, 2010, 1:01pm on 09/06/10 at 18:00:49, Rednaxela wrote:
Ah, nice, that was a very interesting test to run. I'm surprised (although I shouldn't be) that DAPE predicts what 99of9 actually likes to do better than FAME does. I guess that when he invented DAPE he was just putting his mouth where his money is. :) |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by tize on Sep 12th, 2010, 4:55am @Rednaxela: May I ask you to include a static material evaluator in your tests? We all know that the static ones are inferor, but it would be nice to see by how much in the tests you have been running. At least it would be nice for me as Marwin is using a static one. I can give you Marwins values if you'd like to run it. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by Rednaxela on Sep 12th, 2010, 1:27pm on 09/12/10 at 04:55:09, tize wrote:
Sure, I'm curious about this also. I'd be particularly interested in testing with Marwin's values. I'll also try out GEM at the same time. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by tize on Sep 12th, 2010, 2:32pm Ok then here is Marwins material values const int MaterialRabbit1 = 1150; const int MaterialRabbit2 = 1300; const int MaterialRabbit3 = 1600; const int MaterialRabbit4 = 2150; const int MaterialRabbit5 = 2700; const int MaterialRabbit6 = 3450; const int MaterialRabbit7 = 4200; const int MaterialRabbit8 = 7600; const int MaterialCat = 2000; const int MaterialDog = 2500; const int MaterialHorse = 3500; const int MaterialCamel = 5800; const int MaterialElephant = 10000; Everything is just sum'ed up, once for gold and once for silver, then the difference of those values are calculated... |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by Fritzlein on Sep 12th, 2010, 3:41pm on 09/12/10 at 14:32:13, tize wrote:
Would it have any effect to make the value of the eighth rabbit higher or lower? I can imagine that it doesn't matter at all, given that loss by elimination is handled separately, but if it does matter in any way, why not make this value much higher? Is a moderate value a guard against playing under a rule set that allows draws? |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by 722caasi on Sep 12th, 2010, 9:33pm on 09/12/10 at 14:32:13, tize wrote:
So low? I would set the value of an elephant far higher. On the other hand, I suppose the only time it would matter is in the end game, and then it is a reasonable value. Does it cause any problems? |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by Rednaxela on Sep 12th, 2010, 11:36pm Here are the latest results. The "How often does the eventual winner have the advantage according to the evaluator" tests: (aka "Accuracy of the guess of who wins") 1700+ rating, bots excluded (1708 games) Counting all turns: Game Phase, Count, Marwin, GEM, GAME, FAME, FAMEeo, DAPE, DAPEeo, HarLog Phase0, 5957, 68.189%, 67.030%, 66.577%, 67.937%, 67.836%, 67.786%, 67.870%, 67.819% Phase1, 35239, 75.740%, 74.962%, 75.090%, 75.431%, 76.793%, 75.487%, 76.784%, 75.731% Phase2, 49145, 85.921%, 84.999%, 86.145%, 85.496%, 86.526%, 85.630%, 86.662%, 85.622% Total, 90341, 80.781%, 79.899%, 80.543%, 80.412%, 81.497%, 80.497%, 81.570%, 80.590% 1700+ rating, bots excluded (1708 games) Only counting "quiet position" turns where 1) a capture occurs, and 2) the next turn is not a capture: Game Phase, Count, Marwin, GEM, GAME, FAME, FAMEeo, DAPE, DAPEeo, HarLog Phase0, 1014, 57.495%, 56.903%, 56.805%, 57.396%, 57.692%, 57.396%, 57.791%, 57.298% Phase1, 3823, 71.410%, 71.305%, 71.122%, 71.384%, 72.142%, 71.384%, 72.142%, 71.436% Phase2, 6476, 86.056%, 85.392%, 86.658%, 85.747%, 86.736%, 85.917%, 86.844%, 85.825% Total, 11313, 78.547%, 78.078%, 78.732%, 78.352%, 79.201%, 78.450%, 79.272%, 78.405% 2000+ rating, bots excluded (590 games) Counting all turns: Game Phase, Count, Marwin, GEM, GAME, FAME, FAMEeo, DAPE, DAPEeo, HarLog Phase0, 2158, 70.575%, 68.443%, 67.933%, 70.065%, 69.648%, 70.204%, 69.648%, 69.741% Phase1, 12526, 75.267%, 74.557%, 73.551%, 74.900%, 75.675%, 75.627%, 75.571%, 75.100% Phase2, 17416, 85.450%, 84.101%, 86.001%, 85.083%, 86.007%, 84.899%, 86.386%, 84.773% Total, 32100, 80.477%, 79.324%, 79.928%, 80.100%, 80.875%, 80.293%, 81.040%, 79.988% 2000+ rating, bots excluded (590 games) Only counting "quiet position" turns where 1) a capture occurs, and 2) the next turn is not a capture: Game Phase, Count, Marwin, GEM, GAME, FAME, FAMEeo, DAPE, DAPEeo, HarLog Phase0, 359, 58.496%, 57.382%, 56.546%, 58.217%, 57.939%, 58.774%, 57.939%, 57.939% Phase1, 1356, 70.870%, 71.313%, 70.428%, 70.723%, 71.460%, 70.944%, 71.386%, 70.870% Phase2, 2211, 85.346%, 84.080%, 85.889%, 85.391%, 85.798%, 85.075%, 86.251%, 85.301% Total, 3926, 77.891%, 77.229%, 77.866%, 77.840%, 78.299%, 77.789%, 78.528%, 77.815% The "How often the evaluator approves of the winning player's trades/captures" tests: (aka "How the winning player likes to play") 1700+ rating, bots excluded (1708 games) Game Phase, Count, Marwin, GEM, GAME, FAME, FAMEeo, DAPE, DAPEeo, HarLog Phase0, 1014, 57.002%, 58.087%, 57.298%, 57.890%, 57.988%, 57.890%, 57.988%, 57.692% Phase1, 3823, 63.955%, 68.193%, 66.100%, 68.166%, 68.585%, 68.062%, 67.800%, 67.852% Phase2, 6476, 73.116%, 75.664%, 73.564%, 76.004%, 76.683%, 76.081%, 75.293%, 75.540% Total, 11313, 68.576%, 71.564%, 69.584%, 71.732%, 72.271%, 71.740%, 71.210%, 71.343% 2000+ rating, bots excluded (590 games) Game Phase, Count, Marwin, GEM, GAME, FAME, FAMEeo, DAPE, DAPEeo, HarLog Phase0, 359, 54.039%, 55.153%, 54.318%, 55.989%, 55.989%, 56.546%, 55.710%, 55.153% Phase1, 1356, 62.611%, 67.109%, 64.381%, 66.888%, 67.330%, 66.962%, 66.593%, 66.740% Phase2, 2211, 69.878%, 72.999%, 70.782%, 72.999%, 73.813%, 73.089%, 72.546%, 72.365% Total, 3926, 65.920%, 69.333%, 67.066%, 69.333%, 69.944%, 69.460%, 68.951%, 68.849% I find the following things interesting: 1) Marwin's static material evaluation scores well in predicting the eventual winner, yet scores as relatively dissimilar to the trades/captures winning players get into. Curious... 2) Compared to GAME, GEM comes up much more similar to how players act, yet didn't show such an improvement in predicting eventual winner. 3) For predicting the eventual winner, GEM improves upon GAME in the early game, but is worse in the late game. For a finer-grained analysis, I'm now thinking about doing best-fit plots of "material evaluator score" versus "win/loss", With those best-bit plots, calculating the y-axis error (NOT least-squares style), should give a good even-handed way to take magnitude of score into account, as opposed to the current boolean "predicted correctly or not". I'm also open to other suggestions of how to analyze the results if anyone has any ideas. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by tize on Sep 13th, 2010, 12:04am on 09/12/10 at 15:41:36, Fritzlein wrote:
If we just look at this as a material evaluator then no, the value of the last rabbit has no effect at all, because both sides always have the last rabbit left whenever the evaluation is called for a position. But the values are used for a lot of things in Marwin e.g. to evaluate capture threats, hostages, and frames. So changing that value will change how he plays when one side only has one rabbit left. I know that I had that value at least a little higher about a year ago. I don't remember exactly why I changed it but I can imagine that it was because he gave away pieces for rabbit threats. on 09/12/10 at 21:33:12, 722caasi wrote:
10000 is enough to make him careful of his own elephant and, if the opportunity arise, take the opponents elephant. It would be very rare for him to see that he can trap material that is more valuable than the elephant, but that those trappings only where available if he couldn't save his own elephant. Most likely he then could trap something and pass up on the complete trade. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by tize on Sep 13th, 2010, 12:17am on 09/12/10 at 23:36:11, Rednaxela wrote:
Predicting the winner is probably less important than telling the bot which trades to go after... But honestly I didn't think that the static evaluator would score this well in predicting the winner. This tells me that the static evaluation should eventually be replaced but the award of doing it isn't very high. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by Rednaxela on Sep 13th, 2010, 12:35am on 09/13/10 at 00:17:18, tize wrote:
Agreed, but note that my second set of tests *doesn't* necessarily reflect the best trades to go after, because it essentially just gives a mirror into the current practices of human players, rather than directly measuring what works. Currently, I'm hoping the procedure I described before, using the y-axis error of a best-fit plot, will provide something more meaningful than boolean winner prediction, while avoiding the biases of what current human players do. It is still win-centric rather than trade-centric, but: 1) Unlike the boolean win prediction, it will care about the magnitude of the evaluator output, rather than just the sign. 2) I'm starting to think that the trade-centric approach will always inherently follow current habits of players. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by 99of9 on Sep 13th, 2010, 2:01am I'm glad to read all the analysis in this thread, thankyou. I'm also glad that DAPE is still competitive. on 09/10/10 at 13:01:03, Fritzlein wrote:
This is very curious. If only the world championship was decided according to material evaluation formulas! ... then chessandgo wouldn't even get to participate. :-) |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by pago on Sep 13th, 2010, 4:04am Thank you Rednaxela for these interesting results Quote:
I am not so surprised by this. As I tried to explain in a previous reply, GEM "measures a material balance as if the goal of Arimaa were to take the maximum quantity of adverse piece (or maybe more precisely as if there were no goal in Arimaa game). It is good at the beginning but at the end, it is more important to win the game than to catch the adverse elephant. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by Fritzlein on Sep 13th, 2010, 11:21am on 09/12/10 at 23:36:11, Rednaxela wrote:
I like this methodology much better than your previous one, because it counts each material state only once. Before if a material state persisted twenty ply, each evaluator got to predict the winner twenty times, and each was considered right twenty times or wrong twenty times on a single throw of the dice. Intuitively that way of doing things adds noise and makes the results less reliable, whereas your new way of taking quiet positions only counts each imbalance only once, removing that noise. Quote:
The winners now are: Phase 0: DAPE, Marwin, FAME Phase 1: FAMEeo, DAPEeo, GEM Phase 2: DAPEeo, GAME, FAMEeo In other words, it's a big mess, with no evaluator clearly best according to your metric. We somehow need a single formula that evaluates trades like strong humans do in the opening, but values smaller pieces and numerical superiority more later in the game. Of course one could arbitrarily draw a line for switching between evaluators, but it would be much more elegant to have a single formula. One thing that occurs to me is that you insisted your games end in goal. Doesn't that slightly bias things in favor of evaluators that like rabbits? In particular, someone who has an army consisting of lots of strong pieces and few rabbits might find it easier to win by immobilization than by goal. I don't see why you shouldn't include wins by immobilization and elimination in your methodology. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by Fritzlein on Sep 13th, 2010, 11:32am on 09/13/10 at 00:04:13, tize wrote:
Ah, that makes sense. Thanks for explaining. on 09/13/10 at 00:17:18, tize wrote:
This puts you in agreement with David Fotland. It is telling that both you and he were able to code championship bots without dynamic material evaluation. Obviously it isn't as big a deal as we think. Fotland opined that even when a bot was getting the overall material balance wrong, it was still getting most of the trades right, i.e. it would still know that a horse is worth more than a dog. The static evaluation doesn't fail until it comes to trading one strong piece for two smaller ones, which isn't all that common, and anyway we humans are confused about it. The two of you have jointly convinced me that there are more important things to work on for bots, e.g. strategic understanding, and I expect the same holds true for humans as well. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by aaaa on Sep 13th, 2010, 1:02pm on 09/13/10 at 11:21:09, Fritzlein wrote:
It's not just about elegance (http://chessprogramming.wikispaces.com/Evaluation+Discontinuity). |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by Rednaxela on Sep 13th, 2010, 8:14pm on 09/13/10 at 04:04:04, pago wrote:
Ahh, I see. As a quick note, I did some quick trials and found that unweighted "GEM + GAME" gives higher scores than either overall, but worse in than the best of the two in any given segment of the game. I tried some weighting based on number of pieces and got some slightly better performance still, but nothing that felt worth the inelegance of such melding to me. on 09/13/10 at 11:21:09, Fritzlein wrote:
Well, I didn't think much about elimination, but to me it seemed that immobilization wins are rare and are caused by rather different circumstances and thus would be more of a noise source than anything. Prompted by you asking this though I did a test of including the different game results: 2000+ score, no bots, goal ending only (590 games) "quiet position" turns only Game Phase, Count, Marwin, GEM, GAME, FAME, FAMEeo, DAPE, DAPEeo, HarLog Phase0, 359, 58.496%, 57.382%, 56.546%, 58.217%, 57.939%, 58.774%, 57.939%, 57.939% Phase1, 1356, 70.870%, 71.313%, 70.428%, 70.723%, 71.460%, 70.944%, 71.386%, 70.870% Phase2, 2211, 85.346%, 84.080%, 85.889%, 85.391%, 85.798%, 85.075%, 86.251%, 85.301% Total, 3926, 77.891%, 77.229%, 77.866%, 77.840%, 78.299%, 77.789%, 78.528%, 77.815% 590 2000+ score, no bots, goal AND elimination ending only (595 games) "quiet position" turns only Game Phase, Count, Marwin, GEM, GAME, FAME, FAMEeo, DAPE, DAPEeo, HarLog Phase0, 366, 58.197%, 57.104%, 56.557%, 57.923%, 57.650%, 58.470%, 57.650%, 57.650% Phase1, 1381, 70.891%, 71.615%, 70.746%, 70.818%, 71.687%, 71.035%, 71.687%, 71.108% Phase2, 2242, 85.459%, 84.255%, 86.084%, 85.504%, 85.995%, 85.236%, 86.396%, 85.459% Total, 3989, 77.914%, 77.388%, 78.065%, 77.889%, 78.441%, 77.864%, 78.666%, 77.939% 595 2000+ score, no bots, goal/elimination/immobilization endings (607 games) "quiet position" turns only Game Phase, Count, Marwin, GEM, GAME, FAME, FAMEeo, DAPE, DAPEeo, HarLog Phase0, 377, 58.355%, 57.294%, 56.764%, 58.090%, 57.825%, 58.621%, 57.825%, 57.825% Phase1, 1407, 71.073%, 71.784%, 70.860%, 71.144%, 71.855%, 71.357%, 71.855%, 71.429% Phase2, 2309, 85.881%, 84.712%, 86.488%, 85.925%, 86.401%, 85.665%, 86.791%, 85.881% Total, 4093, 78.256%, 77.742%, 78.378%, 78.280%, 78.769%, 78.256%, 78.989%, 78.329% The change to all evaluators and game segments seems essentially uniform, so at very least it doesn't really change the overall picture due to their rarity. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by Fritzlein on Sep 13th, 2010, 8:38pm on 09/13/10 at 20:14:31, Rednaxela wrote:
Immobilization is a source of wins, not noise! Or do you tell your opponent, when you lose by immobilization, that he didn't really beat you? ;) Quote:
It makes sense that the impact would be small due to the rarity of non-goal results, but I was curious nonetheless. Thanks for re-running the numbers. :) |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by Rednaxela on Sep 13th, 2010, 8:58pm on 09/13/10 at 20:38:14, Fritzlein wrote:
Hahaha, nah. What I mean by it being "noise" is that I felt that mixing it with goal wins would be too much of an "apples an oranges" comparison. I'm starting to change my mind though. ::) |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by pago on Sep 14th, 2010, 2:10pm Quote:
Maybe could we try to introduce a "goal balance" in the equation to take into account the goal of the game and to bias a little the evaluator in favour of rabbits. This goal balance could be something as : Balance(G;s;goal) = N6/(N6+n6) So when we would introduce it in GEM equation it would be simplified : (N6+n6)*Balance(G;s;goal) = N6 GEM = (sum(...)+N6)/(Sum(Ni+ni)+N6+n6) I have not tried this idea yet and maybe it doesn't work at all. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by pago on Sep 15th, 2010, 4:20am Quote:
Result : It doesn't work as it is... (rabbits are too favoured) The idea seems to work with a little difference. Instead of taking the number of rabbits as potential goals I consider that there are only two goals (one for each side). That is even more consistent with the fact that the match ends when one side has reached his goal. The equations would become : Balance(G;s;goal) = N6/(N6+n6) HEM = (Sigma(...)+2*balance(G;s;goal))/(Sigma(Ni+ni)+2) I am performing my tests with Excel (!). I'll post the results when they are finished |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by aaaa on Sep 16th, 2010, 5:56pm I just did some tests myself and I'm afraid to conclude that to use game data to evaluate or derive material evaluation functions would (still) be of dubious merit as it seems to lead to a lopsided preference of quantity of pieces over quality that's well outside the mainstream opinion. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by pago on Sep 17th, 2010, 7:50am Quote:
Quote:
I have rewritten for the third (and probably the last) time my paper about the evaluator. This time I have incorporated the idea I posted in a previous reply to take into account the goal of Arimaa. The pdf file is available under this link : http://sd-2.archive-host.com/membres/up/208912627824851423/HEM.pdf The Excel calculation file is available under this link : http://sd-2.archive-host.com/membres/up/208912627824851423/HEM.xls I called this evaluator HEM / Holistic Evaluator of Material (I am better to find names than to build efficient evaluators !) The main modifications of the paper are : - Incorporation of a goal balance - A paragraph about first trade comparison added - A paragraph about Dog + cat complete tournament (72 combinations) added - A paragraph about intransitivity added - The paragraph about matrix calculation removed - Correction of some typos. I didn’t copy all the tournament results in the appendix. They are available in the Excel file. Compared to GEM, the improvements are : - Increase of rabbit relative value when there is an unbalanced number of rabbit (the relative value between major pieces have not been changed). It should keep the GEM advantage in first round and GAME advantages in following rounds. - Dog tournament results are more consistent with jdb’s results - Switches of advantage after trades beginning from EMHHDDCC4R ve emhhddcc8r occurs after HDC ,trades (GEM needed HDCC) The biggest potential defect that I haven’t fixed is that HEM undervalues M compared with the community consensus. For HEM, DC < M < DD. HEM still foresees that intransitivity should occur. I am now almost convinced that it is not a defect of HEM but on the contrary an improvement compared to other evaluators although foreseen cycles are dubious because of M undervalue. @Rednaxela : I would be very interested to see the behaviour of HEM in your result prediction tests (I would also understand that you have no time to test all my ideas !) |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by Fritzlein on Sep 17th, 2010, 12:25pm on 09/17/10 at 07:50:28, pago wrote:
Improvement? Just because material intransitivities exist in fact, doesn't mean that a system having intransitivities is an improvement over a system that doesn't have them. A system might claim the existence of intransitivities that don't correspond to reality while not detecting ones that do. The relevant question is whether HEM is right or wrong about its evaluations. Again, thanks for sharing your results. Thought experiments like yours keep advancing the state of the art. I wonder whether future Arimaa grandmasters will become convinced, at least partially due to the material formulas under discussion, that our current outlook overvalues material quality and undervalues material quantity in late-game situations. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by pago on Sep 18th, 2010, 1:36am Quote:
@Fritzlein : I agree with you. "The relevant question is whether HEM is right or wrong about its evaluations." ... and at this time HEM is not perfect ! (for example its evaluation for relative value of M is probably wrong. What I tried to say without subtility (sorry for my bad english) is that it is not so easy to design a consistent evaluator that foresees intransitivity and that is an interesting property of HEM (assuming that intransitivity does exist !). Once again I am aware that HEM shall be improved and I share the common opinion about M underevaluation. I hope that I am not borrying you with a thread that was a thread about one evaluator at the beginning and that I have became a thread about an evaluator designing process (once again...an unexpected property). |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by pago on Oct 1st, 2010, 11:03am Quote:
I believe that I have succeeded to fix the main weaknesses of the previous evaluators (GEM & HEM). I have called this updated evaluator HERD (Holistic Evaluator of Remaining Duel). I do hope that HERD will be competitive compared with the current best ones (FAME, DAPE, Harlog etc...) according to rednaxela criteria (I have good reasons for this hope if I refer to my tests). Here are the links to the files : http://sd-2.archive-host.com/membres/up/208912627824851423/HERD.pdf http://sd-2.archive-host.com/membres/up/208912627824851423/HERD.zip The main improvements of HERD evaluator compared with the previous ones are : 1) Evaluation of major pieces relative values much more closer from community consensus (for example HD > M > HC) 2) Evaluation of finales (Cat tournament, Dog tournament and DCR tournament) very close from jdb’s results (according to RMSE, MAE & MAPE error estimations). 3) Estimated relative value of a cat compared with a rabbit consistent with the consensus. 4) Estimated relative value of a dog compared with two rabbits consistent with the consensus. 5) Estimated relative value of a horse and three rabbits close from the consensus. The main remaining potential defects or differences with community current consensus are : a) Evaluation of MD versus HH at first trade. HERD doesn’t complies with consensus and evaluates that HH > MD at first trade (although it estimates that MD > HH after the trade of one dog). In the same way it evaluates that DD > HC. b) Evaluation of DCC versus HH. HERD evaluates that DCC > HH at first trade. c) Relative value of camels compared with rabbits. HERD evaluates that 4R > M > 3R (I don’t know what the consensus would be). HERD calculation is based on the same formula than GEM and HEM formulas with two modifications : 1) Generalization of piece hierarchy 2) Introduction of a goal bias different from HEM goal balance I have added a few paragraph in the paper (in particular to compare HERD behaviour with current consensus at first trade). As usual, I am very interested by your comments or critics. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by pago on Oct 11th, 2010, 6:52am I would have liked to get more reactions about HERD even to show evidences of its bad behaviour (of course I would prefer the contrary !) :-X Shall I conclude that my HERD behaves as gnus and cannot survive in the Arimaa jungle ? |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by Rednaxela on Oct 11th, 2010, 12:35pm Hey, sorry I haven't gotten around to responding myself, I've been a bit busy lately. To me it looks like HERD really is on the right track, at least to being competitive, though I haven't had a chance to do any tests on it or anything. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by pago on Oct 11th, 2010, 2:12pm Quote:
Hello Rednaxela, Thank you for your reply. I am feeling less alone :) I hope you will be less busy in the next days because I find that your test is an interesting measurement of evaluator behavior before implementation in bots. Unfortunately I have not the competencies to perform myself the queries in the database. For your information I intend to propose in the next few weeks a Positional Evaluator based on HERD. I have already tested it with the following matches and it seems to get quite good winning prediction even when the material is equal (my criterias are the winning prediction in the 1st, the 2nd, the 3rd part of the game, the whole game and the 5 moves before the first exchange) : 136191 : 2010 WC R8 / Tuks vs chessandgo 136706 : 2010 WC R9 / 99of9 vs Fritzlein 136807 : 2010 WC R9 / Adanac vs chessandgo 137490 : 2010 WC R10 / Adanac vs 99of9 137854 : 2010 WC R10 / chessandgo vs Fritzlein 138929 : 2010 WC R11 / Fritzlein vs chessandgo 140605 : 2010 AC R1 / Adanac vs bot_marwin 140750 : 2010 AC R1 / Arimabuff vs bot_marwin 141378 : 2010 AC R2 / Tuks vs bot_marwin The originality of the evaluator is that it is totally blind : It doesnt search for blockades, hostages, trap threats, goal threats etc... but even like this it seems to be a quite good predictor of the winning side. Of course, this evaluator should be completed by a efficient tree search and by a goal search to have a chance te be competitive in bot. I intend to perform other tests before trying to share results on the forum. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by Fritzlein on Oct 17th, 2010, 9:17pm on 10/11/10 at 06:52:05, pago wrote:
In my undergraduate math department there was a Professor Mayer who was particularly good at refuting purported proofs, so that other professors came to him for checking their ideas. They taught us various methods of proof, for example proof by induction and proof by contradiction, but the method that sticks out most in my mind was "proof by Mayer". 1. Submit a conjecture to Professor Mayer. 2. He will generate a counter-example showing your conjecture to be false. 3. Modify your conjecture to exclude Professor Mayer's counter-example 4. Go to step 1. If on any iteration Professor Mayer fails to produce a counter-example, you may publish your conjecture as having been proven true. ;) |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by pago on Oct 18th, 2010, 5:35am Quote:
Sometimes, I have difficulties to make the differences in english between simple humor, subtile irony and “friendly” advices (I feel that your reply was a mix of all that). Please, be indulgent Fritzlein. I do try not to pollute this forum (I wouldn’t like to be classified in the troll category). Each time I have “published” one idea I had tried at least 10 or 20 ridiculous ideas at home before. I also tried to perform all the tests that I was able to do with Excel (It is very frustrating not been able to program) … And remind you that when you proposed FAME evaluator in 2005 your professors Mayer were 99of9 or jdb. Today my professors are Rednaxela, Fritzlein, aaaa, jdb… : I am flattered by the Quality and the quantity :P Now, the trial/refutation process is not a so bad process to make progress… and I am waiting for your counter-example and/or Rednaxela’s evaluation… before publishing a new idea… |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by Fritzlein on Oct 18th, 2010, 1:20pm on 10/18/10 at 05:35:26, pago wrote:
Yes, teasing out the difference between humor, irony, sarcasm, and simpler motives for sharing is difficult even among native speakers of the same language. When there is some language barrier it is even more difficult, as below: on 10/18/10 at 05:40:10, pago wrote:
Are you just being silly to talk about hiding and secrets? Or do you have a subtle agenda? If you aren't employing simple humor, what is it you want to say? Quote:
Goodness, I don't think you are trolling or polluting at all. I thanked you three times in this thread for sharing your ideas. I was quite sincere; your deliberations on this matter are a contribution to the Arimaa community, a gift of your thoughts and your writings. Quote:
Thus you have entirely understood the subtlety of my analogy. When the "professor" stops producing counterexamples, what does it say about the quality of the latest "proof"? Let me be explicit so that humor doesn't get in the way of communication: it only means that the "professor" has run out of energy. It doesn't mean that the last idea is good or bad, right or wrong, worthy or unworthy. on 10/11/10 at 06:52:05, pago wrote:
No. In all seriousness, without humor, irony, or sarcasm, that is not what you should conclude. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by pago on Oct 19th, 2010, 4:15am Quote:
It was a misplaced humor but it expressed a real feeling : Currently I am using your games, Chessandgo’s games and some other strong players games from WC 2010 to test a partial positional evaluator and I do find that it is really very pleasant to replay them even if I probably understand only a tiny part of the move reasons. Quote:
Understood. In any case thank you for your previous comments about the evaluator. They were valuable and helped me to (hopefully) improve it. In addition when one look at the old threads of the forum, we can find other very interesting comments from you and other players about the relative value of pieces. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by JimmSlimm on May 4th, 2011, 8:04am I tested GAME on games from 2011-03-01 to 2011-05-01 where both players are rated 1500+ and win is by goal, time or elimination. 170353 evaluations were made in 3131 games: GAME: 77.95% Then I tested a very simple evaluator, which only counts the numbers of material units left. 152883 evaluations on the same 3131 games: 79.96% This is odd, it seems like only considering the number of material units is better than GAME, which is very unlikely imo... Could someone try the same thing? edit: nevermind, the second eval seems better because it didnt evaluate as many moves, just the more obvious ones |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by JimmSlimm on May 8th, 2011, 6:31pm pago, first of all, thanks for GAME! I love it and I am using it in my bot as base for evaluation. I would like to test HERD, trouble is I don't understand how it works. I read the pdf and I got the excel file, but in the excel file the evaluation for HERD is always 50%, the other evaluators recognize the changes, but not HERD. Any ideas? I am using OpenOffice for the excel file edit: also if it would help you in your research, I could make a program that reads a big text file with game data and writes a new .txt which only said how many of each material is left after each turn or something. It would maybe make your testing easier somehow? |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by pago on May 13th, 2011, 11:23am on 05/08/11 at 18:31:28, JimmSlimm wrote:
Thank you very muchfor your comments...And I would be happy if some ideas would be implemented in a bot... 1) For the Excel file, I assume that you directly tried to change the number of pieces. Try to change the colored value (yellow for gold and blue for silver). The Excel formula is linked to these value. note that you must "Take R1 before R2, C1 before C2, D1 before D2, H1 before H2..." 2) The formula implemented in the Excel file is the one given in the .pdf at the end of page 7 (it would not be so easy to recognize it just by reading Excel because I cut it in little parts). HERD is basically a kind of generalization of GEM (the reading of GEM.pdf file should help to understand HERD.pdf). The main difference is that HERD assumes that the lost of the 2nd horse is worse than the lost of the first one) but the basic idea of duels is identical. I also tried to bias the result according to the number of remaining rabbits to take into account the fact that one lose when he has no more rabbits. 3) I have no real idea for the reasons of the disappointing results of GAME compared to a simple evaluator. Maybe the defects of GAME that I tried to solve in the following evaluators (such as HERD) were too important. |
||||||||||
Title: Re: Global Algebric Material Evaluator Post by JimmSlimm on May 13th, 2011, 4:15pm pago GAME is better than the simple evaluator I made, I just made a mistake/bug in my program when testing it the first time |
||||||||||
Arimaa Forum » Powered by YaBB 1 Gold - SP 1.3.1! YaBB © 2000-2003. All Rights Reserved. |