|
||||
Title: Attempt for an Arimaa Positional Evaluator Post by pago on Jan 15th, 2011, 4:16am Hello, I would like to suggest a new kind of positional evaluator based on board control although I am not very sure that it is valuable. Files are available under the following links : Explainations : http://sd-2.archive-host.com/membres/up/208912627824851423/APE.pdf Calculation sheet (Excel) : http://sd-2.archive-host.com/membres/up/208912627824851423/APE.zip When I test it without tree search is has some successes as bot slayer games or 2010 World Championship games. However it has also some complete failures as 2010 games between Chessandgo and Fritzlein. This evaluator uses HERD material evaluator discussed in an other thread. However, the idea could probably be used with other material evaluators. http://arimaa.com/arimaa/forum/cgi/YaBB.cgi?board=devTalk;action=display;num=1283354501;start=45 The principles of this Arimaa Positional Evaluator have also close relations with an old Omar’s idea. on 04/27/05 at 14:45:56, omar wrote:
It is also partially linked to discussions about mobility : on 11/19/06 at 07:40:04, Stanley wrote:
|
||||
Title: Re: Attempt for an Arimaa Positional Evaluator Post by pago on Feb 24th, 2011, 2:01pm Today I have noticed a very strange behavior oh HERD material evaluator : According to the games I have used to evaluate the efficiency of my evaluators it has a significant better result when silver wins than when gold wins. This phenomenon seems to be amplified with APE positional evaluator. This is very strange because the evaluators are symetric for gold and silver. HERD material evaluator (% of correct prevision) : Opening Middle endgame Whole silver 50,70% 67,48% 88,15% 69,13% gold 50,50% 60,78% 80,92% 64,17% APE positional evaluator (% of correct prevision) : Opening Middle endgame Whole silver 59,24% 77,46% 91,79% 76,59% gold 53,20% 69,06% 77,74% 66,80% I am currently trying to test these evaluators with 2011 WC games with the same results : APE is efficient when silver wins and has bad result when gold wins. I know that I haven't tested evaluators with a huge quantity of games but I believe (without beeing sure) that I have used a sufficient quantity of games to avoid a statistical bias. Does a same kind phenomenon exist with current performant evaluators (FAME, DAPE etc...) ? Has someone an explanation for this phenomenon ? Or do you think that the set of game I have used is not significant and that It would disappear if I used a greater set of games ? |
||||
Title: Re: Attempt for an Arimaa Positional Evaluator Post by The_Jeh on Feb 26th, 2011, 4:38pm Is there a typo somewhere in your evaluator code? What exactly does "% of correct prevision" mean? |
||||
Title: Re: Attempt for an Arimaa Positional Evaluator Post by pago on Feb 27th, 2011, 1:59pm on 02/26/11 at 16:38:03, The_Jeh wrote:
My notation may be confusing. If gold wins and the evaluator evaluates that gold has an advantage during all the game (from 1b to last move), the % of correct prevision would be 100%. If gold wins and the evaluator evaluates that silver has an advantage during all the game the % of correct prevision would be 0%. If the evaluator randomly evaluates a position the % of correct prevision should be about 50%. So a % of correct prevision higher than 50% may be seen as a (sight) success. A % of correct prevision lower prevision is a bad result (random evaluator would be better). In an other side the evaluator evaluates the position with a percentage value. Maybe should I have chosen a more classical scale (-infinite + infinite) to avoid the confusion. |
||||
Title: Re: Attempt for an Arimaa Positional Evaluator Post by The_Jeh on Feb 27th, 2011, 2:21pm All right, but let's say gold was theoretically winning until move 40, makes a serious blunder, and then loses on move 50. Your whole-game percent of correct prevision may only be about 20%, even though the evaluator was doing its job properly the entire time. So your endgame measurements may be the only part worth looking at, though even they may suffer from this effect. And do you consider the magnitude of the evaluation? For example, if the evaluation is +.01 for gold the whole game and then silver wins, I would say your prevision would be better estimated at 49.9% rather than 0%. The fact that all moves were evaluated in gold's favor means little given that the magnitude of the advantage was so tiny. This could also affect your measurements. But I don't know why else silver would have a higher prevision, other than that your data set is too small. |
||||
Title: Re: Attempt for an Arimaa Positional Evaluator Post by pago on Feb 28th, 2011, 1:57pm on 02/27/11 at 14:21:51, The_Jeh wrote:
I agree with you. The lone proper way to measure the efficiency of an evaluator is to implement it in a bot and look at its results (but I am not able to do that). In the meantime I try to compare it with strong player games (rates higher than 2000) assuming that there will not be too much blunders and that in a sufficient number of game the winner had a winning game more than half of the time. It is probably not very satisfying but I didn't find a better way. A good solution could be to use a game database assessed by Arimaa experts (ie experts would have evaluated which side has an advantage from move 1b to last move). But I don't think that such a base exists. |
||||
Title: Re: Attempt for an Arimaa Positional Evaluator Post by Sconibulus on Feb 28th, 2011, 7:14pm Well, even experts can be wrong, we've seen violent swings due to 'strong' moves without any obvious blunders or 'weak' moves from the other side. |
||||
Title: Re: Attempt for an Arimaa Positional Evaluator Post by UruramTururam on Mar 1st, 2011, 12:29am on 02/28/11 at 13:57:12, pago wrote:
No, absolutely not! It's a common misunderstanding... Implementing an evaluator in a bot would make people play specifically against it and this would alter the results and make them pretty useless. The proper way is to make a statistical analysis of every game played after the evaluator is developed and see how often and how early it would be able to determine the winner. This is in fact harder to do than making a bot; and in order to do it properly it would require a good statistician. |
||||
Title: Re: Attempt for an Arimaa Positional Evaluator Post by pago on Mar 31st, 2011, 10:50am I have tried to apply APE positional evaluator to 2011 World Championship games : http://sd-2.archive-host.com/membres/up/208912627824851423/2011_WC.zip I have chosen to test APE on all the games between players with a rate higher than 2000 (including Hanzack due to his results) assuming that the number of burdens would be sufficiently low not to disturb the results. The global result is that APE positional evaluator gets mixed results. Although it is better than a random play, it doesn’t show neither an obvious improvement of prevision performances nor an obvious decreased performances compared with HERD material evaluator. APE shows a worse result than HERD in the begining and the middle game and a better result than HERD at the end of the game. On the whole game the results are about equal. Strangely, APE and HERD still show significant better performances when silver wins than when gold wins. Maybe APE shows one interesting advantage on one point : it gets a better result from the last 5 moves before the first trade until the end of the game than the material evaluator. An other little point in APE favor is that it seems to choose reasonable moves when I test it on the first moves even without tree search. My global impression is that APE would probably not be very competitive compared with performant positional evaluators but it could « reasonably » play faced to weak players and could have a different play style than current bots. |
||||
Title: Re: Attempt for an Arimaa Positional Evaluator Post by JimmSlimm on May 12th, 2011, 12:06pm pago, since your evaluator is giving advantage as a percentage, maybe you would get better prevision results if you add certainty to the calculation. for example 1% advantage maybe shouldn't weigh much in the final prevision result |
||||
Title: Re: Attempt for an Arimaa Positional Evaluator Post by Fritzlein on May 12th, 2011, 4:08pm on 05/12/11 at 12:06:57, JimmSlimm wrote:
Root mean square error works well in such cases. One might think is bad to penalize a 51%-49% prediction with an "error" of 0.51^2 or 0.49^2 depending on the final outcome, especially if the position was "truly" a coin flip at the time of the prediction. However, a more ambitious prediction function which evaluated the same position 99%-1% would get hit with an error of 0.99^2 often enough to make it come out worse despite sometimes scoring only 0.01^2 error. The least average error will accrue to the formula making the truest prediction. |
||||
Title: Re: Attempt for an Arimaa Positional Evaluator Post by JimmSlimm on May 12th, 2011, 4:44pm on 05/12/11 at 16:08:35, Fritzlein wrote:
hmm ok, maybe it doesn't matter if prevision value is high, it should make a better bot with any evaluator that just gives a different percentage for each position(to get the best available move). Just as long as the sorted move list is in the correct order, I mean 100% for the best move is still the best, doesn't matter if the second best is 99% or 51%, they are still in same order |
||||
Title: Re: Attempt for an Arimaa Positional Evaluator Post by pago on May 13th, 2011, 7:09am Following the previous discussions I am also not sure that my indicator is a the more pertinent indicator. Maybe could we use a mix of different metric : - The one I used (% of right prevision) - Mean Absolute Error (it would be the average of APE when silver wins). An evaluation at 49,99% or 50,01% would almost have the same effect than a random prevision. - RMSE as suggested by Fritzlein. - Weighted error assuming that a right prevision at move one is much more important than a right prevision at the last move (even me I am able to say who is winning at the end of the game !). In a game of 50 (gold + silver) moves, the move one would get a weight of 50, the move two a weight of 49... and the last move a weight of 1. For information I have founded a great problem explaining why APE was not very performant in the beginning and the middle game : I used HERD to evaluate the local situation. However in HERD formula there is a bias to take into account the number of remaining rabbit. From global point of view it may make sense (a player without rabbit cannot win the game...). However from local point of view, it is totally silly. HERD evaluates that one rabbit is stronger than an elephant. In the whole board it is true but locally it is the contrary. The problem was amplified by the fact that the local winner get all and the loser nothing (the rabbit locally get 1 when the elephant get 0). => The result of that was when a piece was closed from rabbits, the evaluator tended to revert the advantage. I am rerunning the evaluation by removing the bias from HERD formula on 2011 WC. I have also introduced an other change. Now I keep HERD (without bias) result an I take the average performance on the complete board : It avoids to introduce bad errors when the local result is closed from 50% but in the wrong side (49,9% instead of 50,1%) With the 19 first games, APE performance to predict the result before the first piece trap was 48,9% (worse than a random prevision !). With the new formula without bias and the calculation on an average the performance is now about 55,9 % which begins to be more interesting... Maybe a more efficient material evaluator (FAME or other ?) would get better results with the same kind of idea : Locally apply the evaluator everywhere in the board and take a weighted average of the results. I take time to perform my new simulations because I use excel as usual... |
||||
Arimaa Forum » Powered by YaBB 1 Gold - SP 1.3.1! YaBB © 2000-2003. All Rights Reserved. |