Arimaa Forum - Print Page


    
      
        Arimaa Forum
        (http://arimaa.com/arimaa/forum/cgi/YaBB.cgi)
      

        Arimaa >> General Discussion >> Whole History Ratings
        
(Message started by: Janzert on Apr 8^th, 2008, 7:03pm)

Title: Whole History Ratings
Post by Janzert on Apr 8^th, 2008, 7:03pm

Since there have been a number of discussions here about various rating systems, I thought people would find this paper interesting.

Whole-History Rating: A Bayesian Rating System for Players of Time-Varying Strength (http://remi.coulom.free.fr/WHR/)

Abstract: Whole-History Rating (WHR) is a new method to estimate the time-varying strengths of players involved in paired comparisons. Like many variations of the Elo rating system, the whole-history approach is based on the dynamic Bradley-Terry model. But, instead of using incremental approximations, WHR directly computes the exact maximum a posteriori over the whole rating history of all players. This additional accuracy comes at a higher computational cost than traditional methods, but computation is still fast enough to be easily applied in real time to large-scale game servers (a new game is added in less than 0.001 second). Experiments demonstrate that, in comparison to Elo, Glicko, TrueSkill, and decayed-history algorithms, WHR produces better predictions.

On a side note, the author of the paper, Remi Coulom, is also the author of CrazyStone one of the current top playing go programs.

Janzert

Title: Re: Whole History Ratings
Post by Fritzlein on Apr 9^th, 2008, 2:45pm

Thank you for this link, Janzert. The WHR philosophy is the same as the FRIAR ratings I posted in this thread:
http://arimaa.com/arimaa/forum/cgi/YaBB.cgi?board=talk;action=display;num=1148669732;start=0

except that WHR does it way, way better. The model of time variance in rating is better, and the approximation algorithm is better. This paper is what I wanted to do in a rating system, but didn't have the skills to do.

The more I read from Remi Coulom, the more impressed I am. There are mathematicians who want to use the coolest tool they have, i.e. the heaviest machinery available, but Coulom is interested in what works best, not what is most complicated or most sophisticated. On the other hand, there are engineers who only care about what works, and therefore innovate only in tiny increment steps rather than in broad strokes like Coulom. It's a rare combination to have the best attributes of the theoretician and the best attributes of a practitioner combined in one person, but I think that's what we have here.

The_Jeh, if you are interested in taking your rating systems to the next level, this is a must-read.

Omar, if you want to replace the gameroom ratings with something better, this should be the starting point rather than p8 ratings. I still believe that it is basically impossible to get accurate ratings with non-learning bots and self-selection of opponents messing everything up. However, if we just took HvH rated games and fed them into the WHR framework, I think we would get the best ratings we are capable of getting with today's technology.

Title: Re: Whole History Ratings
Post by omar on Apr 9^th, 2008, 10:47pm

Wow, you seem really excited about this rating system Karl. I'll have to check out this paper. If you get a chance can you tell us what you like about this system and how it works. It would be useful for anyone who has not read the article; and for me to make sure I understood it. Thanks.

Title: Re: Whole History Ratings
Post by Fritzlein on Apr 11^th, 2008, 12:50pm

The idea of rating the whole game history simultaneously is appealing because inaccurate ratings are retroactively corrected. One could construct many examples where this is beneficial, but I will just give two.

First, suppose that Player A learns to play Arimaa quite well off-line in his local game club, reaching a playing strength of 2100. He joins the Arimaa server rated 1500 and beats you three in a row. That costs you about 80 rating points in the current system. Yes, Player A's gameroom rating is rapidly rising, and before long it will start hovering around 2100. Nevertheless, you are permanently out those 80 rating points that you didn't deserve to lose. You paid the price for making his rating more accurate.

In a whole-history rating system, in contrast, once the system figures out that Player A is rated 2100 now, it figures that he probably wasn't rated 1500 in those first few games either. His skill might have been a little different then, but not much. Therefore, after the fact, the penalty you suffer for losing to him is reduced.

The more common situation is that a player enters overrrated at 1500, and the weak player who beats him (usually a low bot on the ladder) gets too much credit. WHR fixes that too.

Second, suppose that two players (whom I will call seanmcl and Asubfive) join the server and play games only against each other. After fifty games they have each won twenty-five, effectively proving that they are of equal skill. Their ratings both remain at 1500, since they never play anyone else, but really their playing strength has risen to 1800. WHR can't fix that; nothing can fix it unless they play a variety of opponents.

But now suppose that Asubfive starts playing other people. His rating will rise to 1800, but only slowly because he is "established" at 1500. Furthermore seanmcl, despite having proven that he's just as good, won't have a rating increase at all. Contrast this to the WHR system, where Asubfive's ratings will not be firmly anchored by playing lots of games against a single opponent. Furthermore when Asubfive starts to get a game record against other players, seanmcl's rating will change too since it is firmly correlated to Asubfive.

Now, Coulom is not the first to use the whole database of games to get the benefits I list above. The_Jeh's ratings, for example, use all the games at once. The trouble with The_Jeh's system and others like it is the assumption that the strength of the players remains constant across the rating period. For example, even though chessandgo has caught or passed me in playing strength, his rating is weighed down by thirty consecutive losses to me from when he was first learning. Unless The_Jeh leaves those games out entirely (which forfeits the benefits of remembering history), chessandgo will _never_ catch up to me in rating (unless he significantly passes me in skill), because he continues to be punished for the past. A good rating system must allow the game data to be partially explained by changing skill.

The_Jeh's ratings apply retroactive corrections but don't allow for time-varying playing strength. Most other systems (e.g. the game room ratings, p8 ratings, aaaa's ratings) allow for time-varying playing strength but don't allow retroactive correction of inaccuracies. What Coulom does is effectively balance correction of past estimation error while also allowing for change over time.

Title: Re: Whole History Ratings
Post by mistre on Apr 11^th, 2008, 1:18pm

"Furthermore when Asubfive starts to get a game record against other players, seanmcl's rating will change too since it is firmly correlated to Asubfive."

So, if I understand you correctly, your game rating will change even when you don't play, but when others that you played play games?

There is nothing inherently wrong with that, but it would take some getting used to. I guess over time, as you play a large variety of players, the effect would be so small you would barely notice it, but I am just guessing here.

Title: Re: Whole History Ratings
Post by aaaa on Apr 11^th, 2008, 1:36pm

The KGS Go Server also has ratings that change without playing.

Title: Re: Whole History Ratings
Post by Fritzlein on Apr 11^th, 2008, 2:42pm

on 04/11/08 at 13:18:11, mistre wrote:

So, if I understand you correctly, your game rating will change even when you don't play, but when others that you played play games?

Yes, that's true, and yes, it does take some getting used to. It's somewhat akin to the tiebreak points in the Swiss tournament: the performance of your opponents after you have played them affects your standing.

I'm curious how much movement there would be in practice, and whether it would disturb me as a player. My hunch is that a bit of drift wouldn't bother me if in exchange I didn't see static ratings that appear wildly inaccurate. But maybe I'm trying to strike a bargain that isn't available, since the inaccuracy mostly comes from bots.

on 04/11/08 at 13:36:54, aaaa wrote:

The KGS Go Server also has ratings that change without playing.

Is there a lot of griping about that on KGS, or are the folks there pretty happy with their rating system?

Title: Re: Whole History Ratings
Post by aaaa on Apr 11^th, 2008, 3:22pm

Rating systems are of course a big issue in Go as they serve to determine appropriate handicap. One noticeable difference is that one doesn't even get a starting rating until having played a game or two.

Title: Re: Whole History Ratings
Post by lightvector on Apr 11^th, 2008, 3:27pm

I play(ed) on KGS as a 1 kyu, although very little for the last six months; I'm quite out of practice.

Like every server, the KGS server has it's complaints, but overall, it seems like the rating systems performs reasonably well. KGS also seems good about detecting and readjusting ratings and rating uncertainties in response to large over/under estimations. The current system seems plenty accurate, at least from 10K and stronger: for instance, a 2K playing a 5K can reasonably expect a balanced game with a 3 stone handicap.

However, I get the impression there are complaints about the lack of transparency of the system, and ratings tend to drift rather strangely if you play very infrequently. Also, every year or two, I notice that the servers has to rescale all the ratings, to compensate for inflation and/or inaccurate scaling, which can occur because of the relatively lower frequency of handicap games to provide cross-rating data.

Also, every once in a while, you get problems with sandbaggers, etc, although there are a few mechanisms to minimize that.

Title: Re: Whole History Ratings
Post by Fritzlein on Apr 11^th, 2008, 3:48pm

Thanks for the reviews, guys. A good Go rating system is in any case much harder than a good chess rating system. All chess games are played without handicap, but most Go games are played with handicap. The chess rating scale is calibrated by winning percentage alone, whereas the Go rating scale is calibrated by winning percentage and handicap stones.

The Go rating systems try to solve the problem by saying that one stone of handicap equals X rating points. But this is bound to be distorted. A stone of handicap between 5-dan players is immensely important, whereas a stone of handicap between 15-kyu players is essentially irrelevant. The EGF has at least acknowledged this problem by making a stone handicap worth different Elo points depending on the level of the players, but I think they don't go nearly far enough.

Note that Coulom, in order to validate the superiority of his rating system, tested it only on even Go games. He didn't even try to accommodate handicap games, so he had only the parameter of winning percentage to worry about.

Title: Re: Whole History Ratings
Post by The_Jeh on Apr 11^th, 2008, 4:53pm

Time-varying playing strength doesn't matter with the bots. Really, it is nonsensical for a bot's rating to change at all after it has played hundreds of games. (Well, the exception would be a bot that bases its play on the database or otherwise teaches itself.) There's got to be a way to deduce the bots' strengths relative to each other, and then set their RU to zero. Maybe that would solve some problems until the WHR can be implemented.

Title: Re: Whole History Ratings
Post by omar on Apr 12^th, 2008, 2:11am

on 04/11/08 at 16:53:23, The_Jeh wrote:

Yes, I also think that P1 and P2 type bot ratings should be fixed. Additionally the rating system should be anchored so that random play has a rating of zero. This can be achieved using a random bot which has a random setup and and picks random moves (i.e. enumerates all the possible unique positions that can arise from the current position and randomly selects one).

Title: Re: Whole History Ratings
Post by omar on Apr 12^th, 2008, 3:40am

I've had a chance to read the paper now. Thanks Karl for writing your views on WHR. I think I understand it now, though not enough to start implementing it.

In the paper the performance of different rating systems is compared by seeing how well they predicts the outcome of games. I'm a little surprised that the numbers are like 55% (in table 1 of the paper); I would have thought they would be higher like at least 60%. I am also surprised that the difference in prediction performance between WHR and Elo is so little; just 0.672%.

In practice I think I'm less concerned about how accurately the rating system can predict outcomes and more concerned about how resistant the rating system is to abuse such as pumping and sandbagging.

Last December I was experimenting with rating systems based on the idea that not all games should be rated equally. The idea is based on your proposal that games against bots should be rated half as much as games against humans (i.e. K/2). If we can use the opponent type to weight the game then we could also take other factors into consideration; like what is the opponent rating uncertainty and what is your existing record against this opponent. I was able to get results that were similar to the P8 rating system, but using a simple Elo equation; lot less computation than P8. So in a situation where your opponent has a much higher rating uncertainty than you, the game does not count as much for you and so you don't lose or gain many points from it, but for the player with the high uncertainty it counts as usual and thus is able to converge the ratings of new players without causing much disruption to the established players. Also by looking at the previous record, games can be weighted less and less as you diverge from 50% percent; this helps prevent pumping/dumping ones rating by repeatedly winning/losing against the same opponent.

Title: Re: Whole History Ratings
Post by Fritzlein on Apr 12^th, 2008, 9:42am

on 04/12/08 at 03:40:59, omar wrote:

In the paper the performance of different rating systems is compared by seeing how well they predicts the outcome of games. I'm a little surprised that the numbers are like 55% (in table 1 of the paper); I would have thought they would be higher like at least 60%.

Probably the number of games predicted correctly is so low because he was only looking at even Go games. If you were predicting only on games which were coin flips, you couldn't pick the winner more than 55% of the time yourself. The games which would be easy to predict, i.e. mismatched games, are usually played at handicap on the Go server, and therefore excluded from the data. For the Arimaa server, where lots of mismatched games take place, I'm sure all the different rating systems could pick the winner over 60% of the time.

Quote:

I am also surprised that the difference in prediction performance between WHR and Elo is so little; just 0.672%.

That small percentage difference between the systems could actually represent a substantial difference in predictive ability. The difference in score comes entirely from the few games in which the different rating systems pick different players to win. If one rating system thinks Player A is 60% favored to win, and the other thinks Player A is 70% favored to win, both rating systems will pick Player A to win, and there will be no discrimination between the two, regardless of whether the true winning probability for Player A is 45%, 53% or 89%. The is the problem you had in the first year of the spectator contest: How do you tell who is the better predictor if everyone bets on the better player?

Not only will Coulom's performance metric not discriminate between rating systems on most of the games, and only show a difference when they think opposite players are favorites, but furthermore even when one system picks Player A to win and the other picks Player B to win, if the true odds are 51% to 49%, then almost half the time the better predictor will be punished and the worse predictor will be rewarded just by chance. The average gain for being smarter is small even in the games where the metric measures a difference. Under the circumstances and the metric that Coulom chose, WHR scored a crushing victory over the other rating systems.

I actually think Coulom's metric for showing the worth of his system sucks. He should have used the formula we previously used in our 2005-2007 spectator contests, name root-mean-square error of percentage prediction, so that good predictors are discriminated from the bad ones by the confidence of the prediction as well as by who they chose to win. For this better metric, one can also measure how much error random chance imposes on even the theoretically best predictor, and thus have a scale for understanding superior performance so that, say, cutting the predictive error in half isn't dismissed as a negligible performance gain.

Quote:

In practice I think I'm less concerned about how accurately the rating system can predict outcomes and more concerned about how resistant the rating system is to abuse such as pumping and sandbagging.

Yes, my primary concern is also limiting the potential for abuse rather than maximizing predictive ability. However, increasing predictive ability is by nature the way to limit one type of abuse, namely the self-selection of opponents. Any inaccuracy of prediction is inherently an abuse opportunity. If the ratings predict that Player A is 70% to win in a match where his actual winning chance is 90%, then that match will gain him rating points on average, and Player A can select that opponent repeatedly to pump his own rating.

Quote:

The idea is based on your proposal that games against bots should be rated half as much as games against humans (i.e. K/2).

Weighting bot games half as much would only mean that pumping your rating up against a bot would take twice as many games. It wouldn't stop you from eventually getting just as many rating points in the long run. The alternative that I mentioned to you before, namely reducing the scaling factor from 400 to 200 for games involving a bot, actually cuts in half the amount of points a player can pump their rating against a bot no matter how many times they play it. If that's what you are referring to, I'm all in favor of. I think we should implement it today in the framework of the current game room ratings. That would have a huge benefit in minimizing the distortion caused by bots.

Then later on, when we also want to minimize the distortion caused by temporarily inaccurate ratings, we can implement WHR along with a reduced scaling factor for bot games.

Quote:

If we can use the opponent type to weight the game then we could also take other factors into consideration [...] So in a situation where your opponent has a much higher rating uncertainty than you, the game does not count as much for you and so you don't lose or gain many points from it, but for the player with the high uncertainty it counts as usual and thus is able to converge the ratings of new players without causing much disruption to the established players.

In the current scheme it would definitely be an improvement for games of newcomers to affect the ratings of established players less. But if you think it is worth tweaking the system so that newcomers mess up the ratings of established players less than they do now, wouldn't you really like to tweak the system so that newcomers mess up the ratings of established players not at all, plus have the benefit that the games are counted fully for both players, so both players have full incentive to try their best? WHR does what you are trying to do by tweaking the weighting, except it does it better.

Quote:

Also by looking at the previous record, games can be weighted less and less as you diverge from 50% percent; this helps prevent pumping/dumping ones rating by repeatedly winning/losing against the same opponent.

That's an intriguing idea that I haven't fully considered. We had previous discussed weighting games less when the two players have played each other a lot, regardless of whether one player has won most. You seem to be saying that the games should count less if one player has won most, regardless of how many times they have played. Intuitively I like the idea of games that are more nearly equal counting more, but I would have to think more about the possible consequences. Would that mean that upset victories also don't count for much?

Quote:

I was able to get results that were similar to the P8 rating system, but using a simple Elo equation; lot less computation than P8.

I agree that the P8 ratings take way too much computation. So do the FRIAR ratings that I experimented with. In my past enthusiasm for using the whole history of data, I wasn't sufficiently accounting for the practical problem of needing too much CPU time. We can't dedicate a whole server just to constantly computing ratings.

But much of the problem was using an inefficient estimation technique. FRIAR used a successive approximation method that took forever to converge, and from what I understand of your implementation, so does p8. Coulom claims that WHR ratings converge extremely quickly using Newton's method. After each game, the ratings of the two players involved can be updated with a quick computation; an approximation involving all players is more time-consuming but not terrible, and can be done at intervals to keep the system calibrated. Coulom's claim, which I buy into, is that whether using the whole history is too expensive or not is all about the efficiency of the implementation.

There's a difference between rejecting a system because it will require too much CPU after it is implemented, and rejecting it because it looks a little complicated to implement. Not that the latter reason isn't a valid one; we should just be clear about the reasoning.

Title: Re: Whole History Ratings
Post by omar on Apr 13^th, 2008, 4:41pm

Quote:

Would that mean that upset victories also don't count for much?

If you had played this player before and had a good record against this player then the first upset would not count that much, but if you kept losing eventually your record would get closer to 50% and they would start counting more. Now if this was the first time you played this player and had no previous record against the player then the upset would count as normal.

I would eventually like to start experimenting with WHR and run it on the rated games from the gameroom and see what kind of numbers it will produce, but it will probably be a while before I get to it. If anyone else is interested, there is a file which has just the rated games from the gameroom up to Dec 2006:
http://arimaa.com/arimaa/download/gameData/ratedgames.zip
http://arimaa.com/arimaa/download/gameData/ratedgames.tgz

I can update it if you needs more recent games.

Title: Re: Whole History Ratings
Post by aaaa on Apr 27^th, 2008, 1:52pm

on 04/12/08 at 02:11:55, omar wrote:

I think that's a very bad idea. My common sense tells me that that would result in an extreme runaway inflation of ratings.
I don't see why there should be any rating anchoring.

Title: Re: Whole History Ratings
Post by omar on May 2^nd, 2008, 1:26pm

There was a lot of discussion about anchoring the rating system several years ago. Here is the link to that thread.

http://arimaa.com/arimaa/forum/cgi/YaBB.cgi?board=talk;action=display;num=1065901453;start=0

I guess one could say that even our current rating system is anchored based on a new player having a rating of 1500 and all the established players ratings are relative to that. But there is nothing magical about the number 1500, I could have just as easily chose a new players rating to be 0 and the rating of the current 2000 rated players would just be 500. Likewise if new player ratings were 10000 the current 2000 rated players would be rated 10500. So just by picking some value for the new player rating we are anchoring the rating system to it.

I feel that a rating of 0 should mean random play because you are not trying to win or lose. And its very easy to create such a player for Arimaa and most other games now that we have computers :-)

The other thing to consider is how good is the anchor. A good anchor should be stable and never change over time. Also it should be a good source and sink so that it can cushion disruptions in the system and prevent rating inflation and deflation. I think programs that play with a fixed strength (such as the P1, P2 bots) would be better anchors than new players with strength that can vary significantly.

Ideally it would be nice to have a rating system that is stable across time so that it is reasonable to compare ratings of players from different eras. Also if other games used a similarly scaled rating system with zero as random play then it even becomes possible to compare the complexity of different games. Maybe the ratings for chess would range from 0 to 5000, but the ratings for Go range from 0 to 15000.

The tricky part is how do we go from the random bot having a rating of zero up to human players and CC level bots. You would not want to just put out the random bot for anyone to play against and fix its rating to zero. Everyone would only get points from it and hardly ever lose points. Which I think is why you feel there would be runaway inflation of ratings. But what if the random bot was only used off-line to establish the rating of some beginner level bots like ArimaaScoreP1 and then these bots were made available online with fixed ratings. The other non-fixed rating bots like ShallowBlue, LocP1, etc would setup to periodically play the fixed rated bots. Also the initial rating of new players would be set based on looking at how new players have performed against the bots in the first level of the ladder.

Of course having a nice anchor for the rating system doesn't prevent players from abusing the rating system. That's a different issue which needs to be dealt with separately.

Title: Re: Whole History Ratings
Post by aaaa on May 20^th, 2008, 1:55pm

You have to realistically expect that, by all likelihood, a random playing bot would have an Elo rating of at least several thousands below zero by current measures. That would mean that for a stable system, you would have to have an enormous amount of bots to cover the range between it and the rest of the field, to mention nothing of the number of games that would have to be played.

If inflation is really bothering you, one possible solution I can think of, would be to assume, for now, that every new player would have to be a beginner and thus to give him or her the average of the lowest rating ever of every established player. There would, of course, be a considerable risk of deflation then.

Title: Re: Whole History Ratings
Post by Fritzlein on May 21^st, 2008, 8:36am

on 05/20/08 at 13:55:11, aaaa wrote:

You have to realistically expect that, by all likelihood, a random playing bot would have an Elo rating of at least several thousands below zero by current measures.

You are not the only one to have this intuition, but it seems not to be upheld by experiments so far. (I infer from your post that you did not review the old thread Omar linked from his post directly before yours.) It seems that a random player would have a rating near zero on the current random scale, or anyway not more than a few hundred below zero.

Now that I re-read that thread myself, I think it would be a good idea to anchor ArimaaScoreP1 at a rating of 1000 (or something) immediately. I have gradually become convinced that ratings are inflating on the server, including the ratings of fixed-performance bots, and that the culprit is beginners playing only the first part of the bot ladder.

Title: Re: Whole History Ratings
Post by mistre on May 21^st, 2008, 10:43am

Here is an idea concerning WHR. Why not experiment with a human vs human only rating using WHR and keep our original rating separate?

On the bot side, we anchor lower bots and eventually change our original rating to a bot-only rating.

Does anyone see a problem having 2 different ratings? We theoretically already have that with P8. To me, this seems like the simplest solution so that you don't have to worry about mixing vs bot and vs human games together.

Title: Re: Whole History Ratings
Post by omar on Jun 3^rd, 2008, 5:34am

on 05/21/08 at 08:36:50, Fritzlein wrote:

I have gradually become convinced that ratings are inflating on the server, including the ratings of fixed-performance bots, and that the culprit is beginners playing only the first part of the bot ladder.

Looking at what is happening to the rating of fixed-performance bots sounds like a good way to keep tab on the health of the rating system. So are the fixed rating bots definitely increasing in rating?

Title: Re: Whole History Ratings
Post by omar on Jun 3^rd, 2008, 5:41am

on 05/21/08 at 10:43:39, mistre wrote:

Here is an idea concerning WHR. Why not experiment with a human vs human only rating using WHR and keep our original rating separate?

I think it is a good idea. Would someone would like to do this and post the results. I will link to this also besides the P8 ratings.

Title: Re: Whole History Ratings
Post by Fritzlein on Jun 8^th, 2008, 11:16am

on 06/03/08 at 05:34:31, omar wrote:

I queried the game database for the average rating of each bot each year, with a minimum of 30 rated games for the bot-year to count.

Player \ Year . 2003 2004 2005 2006 2007 2008
--------------- ---- ---- ---- ---- ---- ----
Aamira2006Blitz . . . . 1646 1751 1771
Aamira2006CC . . . . 1657 1669 1666
Aamira2006Fast . . . . 1685 1693 1735
Aamira2006P1 . . . . 1227 1314 1284
Aamira2006P2 . . . . 1518 1575 1627
Arimaalon . . . 1167 1240 1217 1364 1323
ArimaaScoreP1 . . . . 1181 1309 1250
ArimaaScoreP2 . . . . 1297 1368 1343
Arimaazilla . 1516 1419 1449 1451 1502 1542
Bomb2005Blitz . . . 1876 1856 1931 2048
Bomb2005CC. . . . 1774 1858 1916 1915
Bomb2005Fast . . . 1827 1826 1930 1869
Bomb2005P1. . . . 1488 1632 1715 1736
Bomb2005P2. . . . 1752 1806 1887 1902
Clueless2005Blitz . . 1660 1793 1878 1875
Clueless2005CC . . . 1621 1777 1807 1822
Clueless2005Fast. . . 1662 1761 1784 1794
Clueless2005P1 . . . 1645 1662 1656 1636
Clueless2005P2 . . . . 1750 1760 1762
Clueless2006Blitz . . . 1423 1420 1543
Clueless2006Fast. . . . 1602 1683 1627
Clueless2006P1 . . . . 1688 1716 1704
Clueless2006P2 . . . . 1705 1652 1711
GnoBot2005Blitz . . . 1652 1747 1841 1911
GnoBot2005CC . . . 1535 1661 1600 1627
GnoBot2005Fast . . . 1541 1724 1734 1772
GnoBot2005P1 . . . 1382 1262 1392 1378
GnoBot2005P2 . . . 1552 1608 1651 1660
Loc2005Blitz . . . 1602 1571 1568 1704
Loc2005CC . . . . 1419 1539 1508 1518
Loc2005Fast . . . 1438 1498 1557 1570
Loc2005P1 . . . . 1404 1314 1412 1374
Loc2005P2 . . . . 1425 1412 1498 1582
Loc2006Blitz . . . . 1579 1614 1724
Loc2006P1 . . . . . 1356 1448 1431
Loc2006P2 . . . . . 1585 1599 1644
ShallowBlue . . . . 1224 1326 1291

The average year-over-year rating gain for all bots was

2005 to 2006: 49
2006 to 2007: 52
2007 to 2008: 17

which works out to a gain of 50 points per year when you consider that we are only a third of the way through 2008.

The average year-over-year rating gain for fixed performance bots only was

2005 to 2006: 3
2006 to 2007: 66
2007 to 2008: 2

which isn't quite as bad, but still noticeable.

The average year-over-year rating gain for Fast and Blitz bots only was

2005 to 2006: 65
2006 to 2007: 52
2007 to 2008: 43

If we look at the total rating increase for speedy bots in two and a third years (160 points) and subtract out the total rating increase in fixed performance bots (71 points of inflation) that gives us an increase of 38 rating points per year due to faster hardware alone. Taking the rating of chessando (2496), subtract the rating of Bomb2005CC (1916) and divide by 38, we can forecast that faster hardware will allow bots to win the Arimaa Challenge in 15 years, i.e. a couple of years after the Challenge prize expires. Of course this doesn't take into account new programming techniques or new human strategic discoveries.

Title: Re: Whole History Ratings
Post by omar on Jun 11^th, 2008, 2:57pm

Thanks for posting this data Karl. I had not seen this before my last post requesting some graphs of the bot rating histories. I like this better.

Title: Re: Whole History Ratings
Post by woh on Mar 12^th, 2009, 4:24am

I managed to implement the whole history ratings. The first result (http://home.scarlet.be/~woh/whr/whr0901l.htm) are based on all rated games up to January 31st. All HvH as well as all HvB and all BvB games are included.

The time resolution used is 1 sec. This means that the rating at the time of each game is calculated separately. In the article a resolution of 1 day is used but tests have shown that the difference is not noticeable. For the variability of the ratings in time 60 Elo²/day was used. Finally the prior used was 1 game won and 1 game lost against a player with a rating of 1220.319747. At first I used 1 win and 1 loss against a player with a rating of 0 as mentioned in the article. But that gave me ratings in the range of -600 to 1400 and I wanted ratings that are better comparable with the gameroom ratings. So what to use? 1 win and 1 loss against a player of 1500, the rating new player used to start with, or 1300 at what rating new players now start? Then I came up with the idea to choose the rating in such a way that the final rating of bot_ArimaaScoreP1 is 1000, the rating to which this bot is now fixed in the gameroom. (The rating of bot_ArimaaScoreP1 is not fixed in these whole history ratings; it can vary over time just like the rating of any other player.)

For comparison I have plotted the gameroom ratings against the whole history ratings
http://home.scarlet.be/~woh/whr/WHR-GMR.png
as well as the difference between the 2 ratings against the gameroom ratings
http://home.scarlet.be/~woh/whr/GMR-WHR.png.
In both graphs the red square represents bot_ArimaScoreP1.

For some players I have also plotted the history (http://home.scarlet.be/~woh/whr/history.htm) of their whole history rating together with their gameroom rating. The history of bot_ArimaaScoreP1 is one of the players' histories included. You can see how his rating varies over time to end at a rating of 1000. (The time on the x-axis is in days since September 1st 2002.)

The whole history ratings predict the outcome of a game correctly in 77,2% of the games an improvement over the gameroom ratings that predict 72,6% of the games correctly. Fritzlein suggested to use the root mean square of the difference between the actual result of a game and the result predicted by the ratings to compare the performance of the ratings. By this criterion the whole history rating with 0,394 do also better than the gameroom ratings that score 0,424.

on 06/03/08 at 05:41:29, omar wrote:

I think it is a good idea. Would someone would like to do this and post the results. I will link to this also besides the P8 ratings.

The second result (http://home.scarlet.be/~woh/whr/whr0901h.htm) are based only on the rated HvH games. Otherwise all parameters are the same. Again the games up till January 31st are included.

Title: Re: Whole History Ratings
Post by Fritzlein on Mar 12^th, 2009, 8:30am

This is a fantastic contribution, woh. I recently commented (http://arimaa.com/arimaa/gameroom/comments.cgi?gid=99568) again on the need for "real" ratings, and the whole history ratings based on human vs. human games are the most real ratings I have seen.

We have seen some systems that improve on the game-room ratings, but no previous offering has been able to simultaneously solve the "isolated group" problem and the "time varying" problem.

Systems which process the game history sequentially are vulnerable to distortion from newcomers playing each other and getting established ratings from each other even though both players are overrated to start. Later losses by those newcomers to established players don't have as much corrective impact as they should.

On the other hand, systems which process the game history simultaneously are vulnerable to distortion from real changes in skill. Usually when someone processes the whole game history, I come out rated higher than chessandgo because I won about thirty straight games against him while he was learning. Those games show that I was a better player then, but are scant evidence that I am a better player now.

Just scanning the results of the human-versus-human whole history ratings, I don't see any obvious distortions, except for the whole scale having been shifted down. In particular, my early success against chessandgo didn't push my current rating above his. Also I don't see many players with thin records floating way above the median. (Although I want to look at how Rabbit got to #22, since I don't recall hearing of him before.)

I propose that the human-vs-human whole history ratings be used to seed tournaments from now on, starting with the Postal Mixer, but most especially that they be used to seed the next World Championship. Or if WHR comes with an uncertainty interval, then we could seed based on the WHR rating minus the lower uncertainty bound.

A while ago I tried to implement something like the WHR, but my successive approximation took ages to converge, making it computationally infeasible as a substitute for game room ratings. What are the CPU requirements of WHR? How quickly can a new game result be added in? I'm quite interested in WHR being added to the game room if the computation cost is low enough. Certainly it seems like a better way to spend arimaa.com server CPU than the periodic recalculation of p8 ratings.

Thanks again for coding this up, woh!

Title: Re: Whole History Ratings
Post by omar on Mar 13^th, 2009, 6:43am

Great job woh. This is fantastic.

I added a link to your web page on the 'Top Rated Players' page.

I agree with Karl that we should start using the WHR HH ratings for seeding of future tournaments.

Title: Re: Whole History Ratings
Post by Tuks on Mar 13^th, 2009, 9:47am

you might want to revise it though, no matter how much convincing you do, you will not convince me "Rabbit" who happened to win 3 out of 3 human games has any chance against any of the top players in a non-postal match ;) other than that, i like it, it presents an actual rating i want to be up at the top in

Title: Re: Whole History Ratings
Post by mistre on Mar 13^th, 2009, 11:10am

Great contribution Woh! Your personal graph looks like the current stock market! LOL.

Perhaps there should be a minimum number of games needed to be ranked - which would fix the "Rabbit" situation.

Title: Re: Whole History Ratings
Post by omar on Mar 18^th, 2009, 7:29pm

Ever since woh posted his results of using WHR on the HH games, I've been tempted to improve my position on that list :-) Which of course means I have to start playing more HH games. I plan to do that more once the events ease up.

Even though I don't care too much about my rating and play experimental games against bots as rated games, I think most all the rated HH games I've played were taken seriously. Thus, I think the data going into the WHR rating system was pretty good and so the ratings we get out from that will probably be more accurate. I feel very comfortable with using these ratings to seed human tournaments. Also a key feature I like about WHR is the ability to retroactively unrate games; I'll explain in a bit why this is so good.

I've started to view our current rating system as an unofficial superficial rating system which simply serves to provide immediate feedback of ratings to new users. Also for people who like to boost their ratings against bots it gives them a longer term goal and initially will help them get better. Eventually they will learn which ratings really matter and find yet another goal to try for.

There is the possibility that people will eventually try to distort their WHR HH ratings also. It would not be that hard to do by creating multiple accounts and sacrificing the ratings of a few accounts to boost the rating of one account. I think we might eventually need to keep a flag with every game that indicates if the game should be used in WHR rating calculations or not. This will allow us to retroactively not include rated games that look suspicious. Eventually we could even mark legitimate rated HB games for inclusion in WHR.

So woh how can we get daily updates of our WHR ratings? If you want I can setup to run the calculations on the arimaa.com server once a day.

Title: Re: Whole History Ratings
Post by woh on Mar 21^st, 2009, 2:04am

on 03/12/09 at 08:30:21, Fritzlein wrote:

Also I don't see many players with thin records floating way above the median. (Although I want to look at how Rabbit got to #22, since I don't recall hearing of him before.)

on 03/13/09 at 09:47:32, Tuks wrote:

I am not trying to convince anyone here, but purely on Rabbit HvH results one might arguable think that he deserved to be in 22nd place. One of his wins was against arimaa_master who is ranked 15th. Their ratings give Rabbit now 41% for a win over arimaa_master which might be considered as not overrated since he has proven he can do it.

on 03/13/09 at 11:10:47, mistre wrote:

Perhaps there should be a minimum number of games needed to be ranked - which would fix the "Rabbit" situation.

There is no need to exclude players. The WHR can address the “Rabbit” issue itself. By changing the number of games won/lost against a fictitious player it will be easier/harder for the players to move away from the average rating. (The average rating being the rating of the fictitious player.) It has the most impact on the ratings of players with fewer games played. With a prior of 2 games won/lost Rabbits rating drops to 1499,7 while the rating of arimaa_master sees only a minor change to 1714.5. With a prior of 3 games won/lost their ratings become 1423,8 and 1699,1. Because of the bigger impact on players with fewer games Rabbits rank drops from 22nd to 35th and to 42nd while arimaa_master stays on 15. The full results are available for 2 wins/losses (http://home.scarlet.be/~woh/whr/Prior2.htm) and 3 wins/losses (http://home.scarlet.be/~woh/whr/Prior3.htm).

I have done some more tests to see how fast a rating of a new player moves up when he wins a number of games against a average player for different priors. Since it is just a number in the equations the number of games won/lost doesn’t has to be an integer number. First I plotted how far a new player gets against the number of games of the prior for a different number of games won.
http://home.scarlet.be/~woh/whr/PriorWhr.png
If the number of games of the prior is zero the rating would become infinite. So with value less than 1 the rating increases quickly. With values greater than 1 a new player gets less far away from the average rating.

In a second graph I plotted how far a new player gets against the number of games won for a different number of games of the prior.
http://home.scarlet.be/~woh/whr/PriorGmr.png
Here I also added the results for the gameroom ratings. It turns out that the result for the gameroom ratings almost match those of the WHR with a prior of 2 games won/lost. I would suggest to use WHR ratings with a prior of 2 games won/lost against a fictitious player. Any thoughts on this?

Title: Re: Whole History Ratings
Post by Hannoskaj on Mar 21^st, 2009, 3:08am

When reading the long (and interesting) thread on rating inflation/deflation that Fritzlein had pointed me to in another discussion, I was planning to suggest reading Rémi's article, but I see you already have this kind of wholesome reading!

About the choice of the prior, two looks like a good idea from the graphs you have posted, woh. I just wonder about what would be the behaviour if you plotted 2-prior and GR against number of victories + 1 loss, number of victories + 2 losses, etc. Maybe even show the 2d graph, if you can draw it.

Title: Re: Whole History Ratings
Post by Hannoskaj on Mar 21^st, 2009, 3:11am

Oh, by the way, I do not think there's a need for the prior to give an integer number of victories and defeats; but it's true the benefits we could get (slightly better fitting what we deem should be the behaviour) are probably not worth making things strange.

Title: Re: Whole History Ratings
Post by woh on Mar 21^st, 2009, 4:14am

on 03/21/09 at 03:11:25, Hannoskaj wrote:

We could use a non integer number of games like 1.5 if the general consensus is that a new player moves away from the average rating too fast with 1 and too slow with 2.

In fact the number of games won in the prior doesn't has to be the same as the number of games lost. If that were the case a new player would not start with an initial rating, that is before he played a single game, equal to the average rating. This could prove to be an interesting idea. If on average a new player has a chance of 1 in 4 to win a game against an average player then may be we should just use a prior of 1 game won and 3 games lost.

I think that the total number of game of the prior dictates how fast a new player moves away from his initial rating and the distribution of number of wins and losses determines the initial rating of a new player in relation to the average rating. But I would need to do tests to be sure of that. I wish I could spent all my time on this :)

Title: Re: Whole History Ratings
Post by woh on Mar 21^st, 2009, 6:47am

on 03/18/09 at 19:29:52, omar wrote:

So woh how can we get daily updates of our WHR ratings? If you want I can setup to run the calculations on the arimaa.com server once a day.

Omar

At this moment the source of the WHR rating tool is the Arimaa game archive. This archive is only updated on a weekly base. To make daily updates available I would need another source. What would you suggest?

The tool is a Windows executable. Can you run this on the arimaa.com server?

Title: Re: Whole History Ratings
Post by Fritzlein on Mar 21^st, 2009, 8:15am

Thanks for continuing to work on this woh. I am very eager to see WHR ratings that are updated daily and integrated into the server.

I tend to lean in favor of weak prior distributions, such as just one win and one loss to an anchor player. This does make it easier for a new player to shoot up the ranks quickly, but my intuition is that a stronger prior has a different disadvantage. I believe that if you run WHR with a stronger prior the ratings of ArifSyed and Swynndla will unduly benefit from it. Why? Because both of them beat many newcomers in their attempts to win the Player of the Month contest. A player whose entire game history consists of two losses to Swynndla will be rated as quite weak if the prior is weak, but not nearly so weak if the prior is strong. Thus if the prior is strong, it will appear that Swynndla beat a host of not-terribly-weak players, when in fact they were all quite weak. That's just a hunch though; I'd be curious whether the numbers bear me out.

I am no longer drawn to the notion of using one win and three losses for the prior, although I once was. Why should we make it easier for a rating to move up than to move down? Symmetry makes more sense. If we believe that the prior is too kind to newcomers, then we can have it be one loss and one win to a lower-rated anchor rather than keeping the anchor rating the same and adding more losses.

On the other hand, I do like the idea of extra losses for the purpose of tournament seeding. If we are alarmed that someone can get Rabbit's high rank (and thus a high seeding into tournaments) on the basis of only a few games, we can do seeding based on the rating each player would have given two additional losses to the anchor. Note that this is quite different from giving everyone a one-win-three-loss prior. To find a player's rating for tournament seeding, we give just him an extra two losses and see what his rating would be. Then we remove those two losses and give them to another player, etc., until we have calculated an individual conservative rating for everyone entering the tournament.

Apart from seeding tournaments, I vote we let Rabbit keep his high rating from a weak prior. Although it is a tenuous guess, it is a reasonable guess. Yes, I understand that people might not want Rabbit to be displayed with such a high rating in the list of best players. One solution to that is to have the Top Rated Players list default to only active players. Then no one will ever hear of Rabbit unless Rabbit comes back to play more games. For the curious we could also keep a list of Top Rated Players including inactive players, but seriously deprecate that list (i.e. hide the link).

Title: Re: Whole History Ratings
Post by mistre on Mar 21^st, 2009, 6:42pm

on 03/21/09 at 08:15:03, Fritzlein wrote:

One solution to that is to have the Top Rated Players list default to only active players. Then no one will ever hear of Rabbit unless Rabbit comes back to play more games. For the curious we could also keep a list of Top Rated Players including inactive players, but seriously deprecate that list (i.e. hide the link).

I like this idea. What to propose as determining an active player? We could be super lenient (i.e. 1 game in last year) or super strict (i.e. 1 game within last month) or somewhere in between.

Title: Re: Whole History Ratings
Post by Fritzlein on Mar 21^st, 2009, 7:15pm

on 03/21/09 at 18:42:56, mistre wrote:

I like this idea. What to propose as determining an active player? We could be super lenient (i.e. 1 game in last year) or super strict (i.e. 1 game within last month) or somewhere in between.

How about six games in the last year? Then if you play the Postal Mixer only, or the World Championship plus one practice game only, you are still active. It's like a one-event-per-year rule.

Title: Re: Whole History Ratings
Post by woh on Mar 22^nd, 2009, 9:04am

I found a way to get the details of the games not yet included in the game archive. And I have updated the results including all the games till 3:45 PM today (GMT). That is, just before the final game of the WC started.

Title: Re: Whole History Ratings
Post by Fritzlein on Mar 22^nd, 2009, 2:22pm

Marvelous! Thank you, woh.

The relative values of the HvH ratings look very reasonable, but I'm worried that they are so much lower than the game room ratings. Do I understand correctly that your anchor rating of 1220 was chosen to give ArimaaScoreP1 a rating of 1000 when all games are rated? That's a nice idea when bots are included, but for the human-only ratings it seems too low. It looks like we would need to add about 200 points to the human-only anchor to make the scales comparable.

To put it another way, if we are rating two different sets of games, I would rather have the outputs be roughly comparable than have the anchors be identical. Would you be able to take the ratings of the 100 most active HvH players and anchor the HvH ratings so that their average rating is the same as for the all-game ratings anchored at 1220?

Title: Re: Whole History Ratings
Post by Fritzlein on Mar 23^rd, 2009, 12:39pm

on 03/21/09 at 08:15:03, Fritzlein wrote:

I believe that if you run WHR with a stronger prior the ratings of ArifSyed and Swynndla will unduly benefit from it.

Oh, I didn't see at first that you had actually posted with different priors. My suspicions were confirmed. With a 1-of-2 prior, ArifSyed is ranked 46th. With a 2-of-4 prior he is ranked 37th. With a 3-of-6 prior he is ranked 33rd.

The point is that when we choose our prior we should not only look at how it affects newcomers, but also how it affects established players who play a lot of newcomers. Apparently if the prior is stronger, then beating up newcomers is rewarded more. A weak prior has the advantage of rewarding sandbagging less.

Title: Re: Whole History Ratings
Post by woh on Mar 24^th, 2009, 1:06pm

on 03/22/09 at 14:22:38, Fritzlein wrote:

Would you be able to take the ratings of the 100 most active HvH players and anchor the HvH ratings so that their average rating is the same as for the all-game ratings anchored at 1220?

Fritzlein, is your concern the difference between the WHR all-games ratings and the WHR HvH-games ratings or the difference between the gameroom ratings and the WHR HvH-games ratings? I would expect the latter since the gameroom rating would still be used as the all-games rating.

on 03/18/09 at 19:29:52, omar wrote:

I've started to view our current rating system as an unofficial superficial rating system which simply serves to provide immediate feedback of ratings to new users.

Would it not be more logical to anchor the WHR HvH ratings so that the average rating of the 100 most active HvH players is the same as the average of their gameroom rating? Or was it that what you meant all along?

BTW: the 100th most active HvH player played 12 HvH games.

Title: Re: Whole History Ratings
Post by Fritzlein on Mar 24^th, 2009, 2:03pm

I would like the human-only WH rating of experienced humans who haven't pumped up their ratings with bot bashing to correspond roughly to their current game room ratings. I don't care about the correspondence in ratings for bot bashers or newcomers. In fact, I explicitly want the WH ratings to be different from game room ratings for bot bashers and newcomers.

Your graphs show that low-rated players have significantly higher game room ratings than they have WH ratings. My hunch is that most of those discrepancies come from players with very scant game records. There are lots of accounts where people joined at a 1500 rating, lost one game, and never played again. Their game room rating will obviously be way above their WH rating

My guess is that over the history of Arimaa, there was first rating deflation, then rating inflation, and most recently another bout of rating deflation. Since the absolute meaning of game room ratings has probably fluctuated by a hundred points or more over time, I would be fine with any set of ratings that were scaled within about 100 points of the game room ratings. By superficial examination, your all-games WH ratings (for non-bot-bashers, non-newcomers) fall within the tolerable range, but the human-only WH ratings fall so far below game-room ratings that it would be a major jolt.

I quite like your method of initializing the system with an anchor rating that sets ArimaaScoreP1's rating to 1000. If we can achieve WHR on nearly the same scale as game room ratings using the notion that ArimaaScoreP1=1000, that's a bonus, because that means we are indirectly calibrating to RandomMover=0.

Unfortunately, it is almost a contradiction in terms to calibrate a human-only rating system to the rating of a bot. That's why I came up with the odd idea of calibrating all-game WHR ratings first to ArimaaScoreP1=1000, and then calibrating the human-only WHR to the all-games WHR.

Maybe 12 games is a rather small number to make an HvH rating reliable. Also, what I really want to achieve is to exclude the influence of bot-bashers. If we did the calibration on all players with at least 30 rated HvH games, and with between 25% and 75% of their games against bots, how many players would that leave? I think quality is more important than quantity for aligning the two scales, but if there are too few points of comparison, it would be easier for the alignment to be thrown off by pure randomness. Do we have 30 players with at least 30 HvH games and good mix of human and bot opponents?

Title: Re: Whole History Ratings
Post by woh on Mar 25^th, 2009, 4:22am

on 03/24/09 at 14:03:57, Fritzlein wrote:

Do we have 30 players with at least 30 HvH games and good mix of human and bot opponents?

51 players have played at least 30 HvH games. 23 of them have played at most 75% of their games against bots, none of them played less then 25% of their games against bots. There are 30 players who played at least 25 HvH games with between 20% and 80% of their games against bots.

on 03/24/09 at 14:03:57, Fritzlein wrote:

If we can achieve WHR on nearly the same scale as game room ratings using the notion that ArimaaScoreP1=1000, that's a bonus, because that means we are indirectly calibrating to RandomMover=0.

This was the case with the data of January but far less with the current data.
Comparing the average rating of the above mentioned pools of 23 and 30 players for January, we get:
___ WHR all-games GMR difference R23 1899.89 1943.65 45.76 R30 1860.40 1906.17 43.76

As you can see in the history graph for ArimaaScoreP1, his rating fluctuates. Fixing his final rating at 1000 makes the anchor change over the course of time. ArimaaScoreP1 apparently has been doing well lately. Now an anchor of only 1133 is needed to fix his rating at 1000. Using this anchor pulls the whole scale down by about 90 points.
___ WHR all-games GMR difference R23 1808.99 1944.00 135.01 R30 1770.27 1906.20 135.93This is no longer within about 100 points (or maybe just).

I then checked what anchor is required to synchronize the HvH WHR with the gameroom ratings.
January R23 1476.12 January R30 1476.04 currently R23 1474.54 currently R30 1472.67
I do not like my idea any longer to use an anchor that fixes the rating of a player. Looking at the history of ArimaaScoreP1 its rating fluctuates by about 150 points. This will cause the whole set of ratings go up and down with it. I am now in favour of a fixed anchor given the fact that about the same anchor is required now as in January to synchronize the HvH WHR with the gameroom. I will try to check how this evolves over a bigger period.

Both pools of players give about the same result. So I think both are a good reference.

Title: Re: Whole History Ratings
Post by Fritzlein on Mar 25^th, 2009, 6:29am

on 03/25/09 at 04:22:06, woh wrote:

Both pools of players give about the same result. So I think both are a good reference.

Excellent.

Quote:

I do not like my idea any longer to use an anchor that fixes the rating of a player. Looking at the history of ArimaaScoreP1 its rating fluctuates by about 150 points. This will cause the whole set of ratings go up and down with it.

That's a very good point I had not considered. Any single player will have a performance rating that fluctuates over time by chance. Fixing the rating of any single player will cause the whole system to swing up and down in response to that player's performance. We might think that the performance rating of a fixed-performance bot would be quite stable, but in the case of ArimaaScoreP1 there will be a huge amount of noise introduced via its opponents. Since all newcomers play ArimaaScoreP1 first, and newcomers are necessarily the least accurately-rated people in the system, the inaccuracy of their ratings will show up as a ton of noise in the performance rating of ArimaaScoreP1.

Therefore I am totally in agreement with your change of position. We should definitely not anchor the ratings on ArimaaScoreP1. It would be much better to anchor the ratings on something else such as a fixed prior distribution.

Quote:

I am now in favour of a fixed anchor given the fact that about the same anchor is required now as in January to synchronize the HvH WHR with the gameroom. I will try to check how this evolves over a bigger period.

I am interested in how you will measure the effect of a fixed anchor over a larger period of time. I did not expect that in order to stabilize the ratings of a good reference group of players, we need an anchor that is approximately 1500, the rating that we formerly gave to all newcomers. We're a bit under 1500 now, but I blame that on recent deflation, and I expect that a year ago the anchor rating needed to calibrate WHR to game room ratings would have been over 1500 due to the inflation underway then. If you see this anchor value drifting up and down historically, but in the neighborhood of 1500, and in any case drifting less quickly and violently than the rating of ArimaaScoreP1, it would seem like a strong argument for anchoring WHR with a prior against a 1500-rated player.

Title: Re: Whole History Ratings
Post by omar on Mar 26^th, 2009, 2:31pm

on 03/21/09 at 06:47:47, woh wrote:

I've set it up to run everyday now. The earliest time for you to pick up the new data would be 9:10 am GMT.

Title: Re: Whole History Ratings
Post by Fritzlein on Mar 31^st, 2009, 3:03pm

Woh, your graphs of player ratings over time made me think of an excellent application of WHR. In the Arimaa article (http://en.wikipedia.org/wiki/Arimaa) on Wikipedia, I have included the ranks of the human defenders of the Challenge. However, I always felt odd that the ranks were based on the inaccurate game room ratings. For known duplicate accounts I included only the account for which the most human games had been played, but still there were many ratings distorted by bot bashing in the actual ratings I used.

One feature of WHR is that we don't just have to use them going forward; we can use information after a point in time to retroactively improve our guess of a player's skill level at that time. We are now in a position to more accurately rank the players at the times they played the Challenge games of past years. If you have time to extract the historical data, I would love to update the Wikipedia article with more realistic rankings. (Incidentally, for this purpose the level of the anchor doesn't matter, since I only want relative positions of the players.)

The dates of interest would be the starting dates of each challenge, namely

February 2, 2004 (omar)
February 7, 2005 (Belbo)
February 5, 2006 (Fritzlein, Adanac, PMertens)
February 11, 2007 (Fritzlein, Brendan, omar, naveed)
April 6, 2008 (chessandgo, Adanac, mistre, omar)

Thanks in advance if you have time for this project!

Title: Re: Whole History Ratings
Post by woh on Apr 2^nd, 2009, 9:21am

on 03/26/09 at 14:31:11, omar wrote:

I've set it up to run everyday now. The earliest time for you to pick up the new data would be 9:10 am GMT.

Thank you very much, Omar!
It is quite faster to generate the new rankings if the game archive is up to date.
The rankings are now updated daily around 5 PM GMT.

Title: #2Re: Whole History Ratings
Post by woh on Apr 2^nd, 2009, 9:29am

on 03/31/09 at 15:03:08, Fritzlein wrote:

If you have time to extract the historical data, I would love to update the Wikipedia article with more realistic rankings.

It was not much trouble to generate these rankings. So here are the results. Follow the links for the full rankings on those dates.

February 2, 2004 (http://home.scarlet.be/~woh/whr/whrh2004.htm)
omar #1
February 7, 2005 (http://home.scarlet.be/~woh/whr/whrh2005.htm)
Belbo #5
February 5, 2006 (http://home.scarlet.be/~woh/whr/whrh2006.htm)
Fritzlein #1
Adanac #2
PMertens #5
February 11, 2007 (http://home.scarlet.be/~woh/whr/whrh2007.htm)
Fritzlein #1
Brendan #12
omar #9
naveed #23
April 6, 2008 (http://home.scarlet.be/~woh/whr/whrh2008.htm)
chessandgo #2
Adanac #3
mistre #20
omar #24

Title: Re: Whole History Ratings
Post by Fritzlein on Apr 2^nd, 2009, 10:49am

Thanks again for all your work on this, woh. I have updated the Wikipedia page. I knew that #1/#2 would be close between chessandgo and myself for the 2008 Challenge; by your results I was a whopping 7 rating points ahead. Another interesting side note is that the early-2008 ratings have the two of us a little over 300 points ahead of Adanac at #3, whereas now chessandgo is only 160 points ahead of him and I'm only 90 ahead. Sounds about right given the results then and now.

Title: Re: Whole History Ratings
Post by mistre on Apr 3^rd, 2009, 6:06am

This might be asking too much - but would it be possible to further separate the WHR for postal vs live games if just for a one time analysis rather than an ongoing rating. I have no doubt that I am a stronger player postally than I am live, but it would interesting to see by how much.

Title: Re: Whole History Ratings
Post by woh on Oct 30^th, 2009, 10:01am

on 04/03/09 at 06:06:38, mistre wrote:

I have no doubt that I am a stronger player postally than I am live, but it would interesting to see by how much.

Your ranking based on the postal games only (http://home.scarlet.be/~woh/whr/whrhp.htm) is 5 position higher than your ranking based on all games (18 vs 23). Other players with a better postal ranking include
Fritzlein at 1 vs 2.
99of9 3 (7)
jdb 5 (15)
camelback 7 (13)
omar 11 (16)
ChrisB 13 (17)
Simon 14 (46) !
OLTI 15 (30)
Tuks 17 (20)

And me? I make a move in the opposite direction, dropping from 38 to 49.

Title: Re: Whole History Ratings
Post by aaaa on Oct 30^th, 2009, 3:22pm

Would you be willing to apply your rating system to rated games involving only developer bots?

Title: Re: Whole History Ratings
Post by Simon on Oct 30^th, 2009, 7:36pm

There would be a much larger discrepancy in my ratings if you compared my postal rating with my rating for live games only, as my record is (not counting my accidental resignation against ChrisB) 4-0 in H. v. H postal games and 1-5 in H. v. H live games.

One thing I am wondering about whole history ratings, is how the prior works. I take it that there is an imaginary pair of games, one won and one lost, against an opponent with a standard rating. But when is this imaginary game supposed to have occured? If it is taken to have occured far in the past, say when a player joined or played their first game, then the effects of the prior would decay away over time, resulting in inflated ratings for players (such as myself) who have a long time gap between their most recent win and their most recent loss or first entry into the system. The way it ought to work is that the prior games are taken to have occured at the present moment, i.e. at the moment the ratings are calculated. I am not sure if that is how it does work, however.

Edit: or maybe simultaneous with that player's most recent game. Otherwise the ratings of inactive players would tend to move towards the standard rating as they remain inactive. And, actually, maybe even that version could be problematic because a win against a weak player by a long-inactive player would likely result in a sudden rating drop for the winner. On the other hand, continued playing would smooth things out...though, maybe the rating convergence of inactive players wouldn't be a bad thing, it would quickly be corrected by resumed play and would encourage (higher-rated than the standard rating) players to remain active.

Further edit:

making the prior games occur at the moment the ratings are calculated would also tend to make players who play infrequently tend to have ratings closer to the standard rating than players with similar skill who play infrequently. (in contrast to the system of having prior games at the beginning, which results in exaggerated ratings for players who play infrequently, particularly if on a winning ir losing streak). One possibility might be to have a fraction of a prior pair of games for every game a player plays. In order to avoid making players who play to many uneven games get a rating too close to the standard rating, the weight of the prior game pair could be adjusted based on the expected absolute value of rating change for the player from that game, or some similar metric.

Title: Re: Whole History Ratings
Post by woh on Oct 31^st, 2009, 5:51am

on 10/30/09 at 15:22:46, aaaa wrote:

Would you be willing to apply your rating system to rated games involving only developer bots?

I am not sure what you mean with 'developer bots'. Do you mean all BvB games or only games between some particular bots?

Title: Re: Whole History Ratings
Post by Fritzlein on Oct 31^st, 2009, 6:24am

Like Simon, I am curious how the prior is applied relative to time-varying ratings. I had assumed that the win and loss against a 1500 player would be coincident with each player's first real game, and the effect of the prior would damp out over time, but now I see that this could result in a player with a one-game history eventually having a weaker prior than a player with no game history, which would be odd. If that is what happens, how long does it take for it to happen?

Title: Re: Whole History Ratings
Post by aaaa on Oct 31^st, 2009, 8:55am

on 10/31/09 at 05:51:36, woh wrote:

I am not sure what you mean with 'developer bots'. Do you mean all BvB games or only games between some particular bots?

Games between bots not hosted on the server, i.e. those not listed here (http://arimaa.com/arimaa/bots/index.cgi).

Title: Re: Whole History Ratings
Post by woh on Nov 6^th, 2009, 8:16am

on 10/31/09 at 08:55:39, aaaa wrote:

Games between bots not hosted on the server, i.e. those not listed here (http://arimaa.com/arimaa/bots/index.cgi).

I generated the rankings (http://home.scarlet.be/~woh/whr/whrbd.htm) based on those games.

Title: Re: Whole History Ratings
Post by Janzert on Nov 6^th, 2009, 8:27am

on 11/06/09 at 08:16:04, woh wrote:

I generated the rankings (http://home.scarlet.be/~woh/whr/whrbd.htm) based on those games.

Hmm, very interesting. Quite different than I expected.

What games were included?

Thanks for doing this,
Janzert

Title: Re: Whole History Ratings
Post by woh on Nov 6^th, 2009, 8:34am

on 11/06/09 at 08:27:06, Janzert wrote:

What games were included?

All rated games between 2 bots which both are not listed on the 'Arimaa Bots Available to Play' page.

Title: Re: Whole History Ratings
Post by Janzert on Nov 6^th, 2009, 5:20pm

Ahh, ok. I wonder if the graph of opponents is simply not well connected enough for the ratings or maybe just not enough games?

Also looking at it again bot_Bomb2004CC and all the bot_Gnobot2006* versions should have also been excluded. These are variants that used to be run on the arimaa.com server but have been removed for various reasons.

Of course removing those is only going to make the small number of games played even less.

Janzert

Title: Re: Whole History Ratings
Post by Tuks on Nov 15^th, 2009, 9:03am

your rating is doing something fishy...i beat Adanac and i barely got anything for it even though Adanac is 300+ higher than me

i lost to Fritz too but that shouldn't make any difference because he is 500+ higher

maybe im wrong, i was expecting to jump up a couple ranks

Title: Re: Whole History Ratings
Post by Tuks on Nov 15^th, 2009, 9:04am

it could be that the date changed but the ratings didnt, urgh, blast my memory i cant remember if i was in the 2040s before or not

Title: Re: Whole History Ratings
Post by Fritzlein on Nov 15^th, 2009, 9:34am

on 11/15/09 at 09:04:29, Tuks wrote:

it could be that the date changed but the ratings didnt, urgh, blast my memory i cant remember if i was in the 2040s before or not

Tuks, you can tell by your game room rating of 1892 (listed next to your WHR) that the whole history ratings were calculated at the after your game with froody but before your game with Adanac. But anyway how can you enjoy seeing your rating go up if you can't remember what it was before? ;-)

Title: Re: Whole History Ratings
Post by Fritzlein on Nov 15^th, 2009, 9:39am

I have a feature request for you, woh. Could you display a second ranking that omits anyone who hasn't played a rated game in the last year? Or even (if it isn't too difficult) omit anyone who hasn't played at least five games in the past year? The changed ranks could be listed in the same table as the ratings. It is nice to see the all-time rankings, but it would also be nice to see a ranking of active players.

Title: Re: Whole History Ratings
Post by Tuks on Nov 15^th, 2009, 10:11am

true that

Title: Re: Whole History Ratings
Post by camelback on Nov 15^th, 2009, 6:50pm

on 11/15/09 at 09:39:38, Fritzlein wrote:

Can't wait to see active player's ratings. It will definitely be a good impetus for active players or be active

Title: Re: Whole History Ratings
Post by Fritzlein on Nov 16^th, 2009, 5:52pm

I like the "# games" column that you added. Thanks, woh!

Title: Re: Whole History Ratings
Post by woh on Nov 17^th, 2009, 10:17am

on 11/15/09 at 09:39:38, Fritzlein wrote:

WHR now includes 2 extra columns: the active ranking and the number of games played in the past year. It is based on a minimum of 5 games.

Title: Re: Whole History Ratings
Post by woh on Nov 17^th, 2009, 10:42am

on 11/06/09 at 17:20:23, Janzert wrote:

I updated this ranking (http://home.scarlet.be/~woh/whr/whrbd.htm) and excluded the bots you mentionned. The new ranking is based on a total of 1430 games, the previous one on 1470 games.

Title: Re: Whole History Ratings
Post by Fritzlein on Nov 17^th, 2009, 12:50pm

on 11/17/09 at 10:17:46, woh wrote:

WHR now includes 2 extra columns: the active ranking and the number of games played in the past year. It is based on a minimum of 5 games.

Awesome! Thank you so much. I love these ratings. I guess my only other request would be to have them integrated into the game room. :-)

Title: Re: Whole History Ratings
Post by woh on Nov 17^th, 2009, 2:01pm

on 11/15/09 at 09:03:04, Tuks wrote:

Tuks, the WHR ratings are not updated live. I generate those normally once a day. The rankings of November 16th were the first to include your game against Adanac. Your rating went up from 2044.6 to 2088.2 thereby gaining 2 positions ( 20 -> 18 ). At the same time Adanac moved from 3rd to 4th position.

Title: Re: Whole History Ratings
Post by camelback on Nov 17^th, 2009, 2:19pm

Very nice, Thank you woh.

Title: Re: Whole History Ratings
Post by Fritzlein on Nov 18^th, 2009, 2:32pm

Woh, is the new column for games played in the last year blanked out if the user has played less than 5? That is to say, if they don't get a ranking among active players, it doesn't show whether they have played 0, 1, 2, 3, or 4 games in the last year?

Title: Re: Whole History Ratings
Post by Janzert on Nov 18^th, 2009, 2:44pm

on 11/17/09 at 10:42:35, woh wrote:

I updated this ranking (http://home.scarlet.be/~woh/whr/whrbd.htm) and excluded the bots you mentionned. The new ranking is based on a total of 1430 games, the previous one on 1470 games.

Thanks. I'm still pretty mystified by the ranking order.

Janzert

Title: Re: Whole History Ratings
Post by woh on Nov 19^th, 2009, 4:10am

on 11/18/09 at 14:32:38, Fritzlein wrote:

That is correct, Fritzlein. It will be more informative when the number of games played the last year is mentionned for all players, so I will change it to that.

Title: Re: Whole History Ratings
Post by Fritzlein on Nov 19^th, 2009, 6:31am

on 11/19/09 at 04:10:42, woh wrote:

That is correct, Fritzlein. It will be more informative when the number of games played the last year is mentionned for all players, so I will change it to that.

Thanks!

Title: Re: Whole History Ratings
Post by Tuks on Nov 19^th, 2009, 12:18pm

hehe, Pmertens lost a game and everyone's ratings went down in the top 20

Title: Re: Whole History Ratings
Post by Fritzlein on Mar 27^th, 2010, 10:42am

Woh, I have another feature request if you are interested and have the time. Could you include a column in the WHR display for peak rating? Since you have the whole history at hand anyway, the number should already be calculated, so I think I'm only requesting an additional lookup and display.

I realize that subsequent games do cause old ratings to be recalculated, such that a player's peak rating could go down slightly, as counter-intuitive as that is. But I don't think this small amount of drift is a problem for the purpose at hand, namely discouraging sandbagging in the World League.

Plus, it would be kind of cool to know one's own high-water mark in any case. :D

Title: Re: Whole History Ratings
Post by knarl on Mar 28^th, 2010, 2:51pm

on 03/27/10 at 10:42:43, Fritzlein wrote:

As I mentioned in the world league feedback; For players who's ratings decended from the 1500 starting point, their peak WHR could be at the first maxima. Because a rating isn't really true until it bottoms out at the beginning right?

I don't know how WHR are calculated, so I might be totally wrong in the way I'm considering it.

Cheer's
knarl.

Title: Re: Whole History Ratings
Post by woh on Mar 29^th, 2010, 1:50pm

on 03/27/10 at 10:42:43, Fritzlein wrote:

Hi Fritzlein. I will add a column with the peak rating.

Your observation is correct, all the the data is available, just give me some time to add it.

Title: Re: Whole History Ratings
Post by Fritzlein on Mar 29^th, 2010, 2:01pm

on 03/29/10 at 13:50:43, woh wrote:

Hi Fritzlein. I will add a column with the peak rating.

Your observation is correct, all the the data is available, just give me some time to add it.

Awesome, thanks!

Title: Re: Whole History Ratings
Post by woh on Mar 29^th, 2010, 2:10pm

on 03/28/10 at 14:51:02, knarl wrote:

knarl

Every time the whratings are calculated all ratings are recalculated including the rating a player had when playing his previous games. Previous ratings are influenced by the results of the players new games (and indirectly also by the results of the new games played by his opponents)
It is true that a players first rating is possible his maximum but this rating will no be equal to the 1500 starting point. The more a player lose the more all his ratings will drop including his maximum. So a plyaer who has lost more games and lost to weaker opponents should end up with a lower maximum.

Title: Re: Whole History Ratings
Post by knarl on Mar 29^th, 2010, 3:28pm

on 03/29/10 at 14:10:32, woh wrote:

Cool. That solves that problem. Sounds like a neat rating system. I should read that paper some time.

Cheers,
knarl.

Title: Re: Whole History Ratings
Post by Fritzlein on Apr 1^st, 2010, 8:53am

Thanks for adding the "peak rating" column, woh! It's quite interesting in its own right, and it is enabling for the World League. For those World League players concerned that the minimum peak rating would be 1500, just look at the results and you will see that it doesn't work that way.

Title: Re: Whole History Ratings
Post by Tuks on Apr 12^th, 2010, 12:23am

hey woh, could you update the #games played this year column and how hard would it be to add the postal ratings? just as its postal season so they ought to be changing quite a lot in the next couple of months.

Title: Re: Whole History Ratings
Post by woh on Apr 13^th, 2010, 8:37am

Hi Tuks

on 04/12/10 at 00:23:46, Tuks wrote:

hey woh, could you update the #games played this year column.

The column at the right is not #games played this year but the #games player between today and the same date a year ago. It is updated with every new list.

on 04/12/10 at 00:23:46, Tuks wrote:

I posted a new postal ranking (http://home.scarlet.be/~woh/whr/whrp.htm) as well as a new events ranking (http://home.scarlet.be/~woh/whr/whre.htm). I will update these once a week.

Title: Re: Whole History Ratings
Post by omar on Apr 19^th, 2010, 8:40am

Woh, I know you mentioned that there is a windows executable for calculating WHR, but I won't be able to run that on the arimaa.com server. If you could either post or email me the source, I would like to try some experiments with it. Thanks.

Title: Re: Whole History Ratings
Post by Weirdo87 on Apr 19^th, 2010, 6:16pm

on 04/13/10 at 08:37:22, woh wrote:

I posted a new postal ranking (http://home.scarlet.be/~woh/whr/whrp.htm) as well as a new events ranking (http://home.scarlet.be/~woh/whr/whre.htm). I will update these once a week.

These rankings strike me as odd, particularly because my rating seems hugely inflated. In the events ranking, I'm ranked at #16, but there's no way in hell I'm that good. Can someone explain why I'm ranked at #16 while someone like Arimabuff (who I couldn't beat in a million years) is only ranked at #41?

(If this question has already been addressed, just ignore me.)

Title: Re: Whole History Ratings
Post by omar on Apr 21^st, 2010, 8:26pm

Most likely due to the low number of games. Until a player has played about 50 games, the ratings are probably not very stable.

Woh has sent me the code he uses, so I'm going to do some tests with WHR to characterize it a bit.

Title: Re: Whole History Ratings
Post by woh on Apr 22^nd, 2010, 3:14am

Weirdo87, you have a 6-1 win/loss record on event games, including a 1-1 record against the player ranked 22nd and wins against the players ranked 32nd, 53rd and 55th. With such a record I think it is quite acceptable to be in 16th position.

Title: Re: Whole History Ratings
Post by aaaa on Apr 22^nd, 2010, 5:54pm

I've managed to implement this system myself and tried to discover what optimized values for the parameters might look like. I'm getting about 1.4 (+/- 0.1) wins/losses for the prior and roughly 235 (+/- 20) Elo^2/day for the variance of the Wiener process. Especially the latter value may seem like the result of overfitting (since a high flexibility of a player's rating would allow it to be separately tailored for each game result), but performance was measured through cross-validation.
If any "official" adoption of these values is contingent on me supplying more details, I'll gladly do so.

Title: Re: Whole History Ratings
Post by omar on Apr 24^th, 2010, 6:00am

on 04/22/10 at 17:54:41, aaaa wrote:

Yes, I am interested to know more about the experiments that you have tried. I will probably have lots of questions. I'll contact you by email.

Title: Re: Whole History Ratings
Post by aaaa on Apr 25^th, 2010, 10:44pm

It seems to me that it would be better to just discuss it here. Anyway, the following elaboration should hopefully give a complete picture; if not, just post questions here.

First of all, the games considered are, obviously, all the rated ones involving only humans.
At the start of a run, each game that is either the earliest or the latest for either player is set aside. The remaining ones are randomly divided into 10 subsamples.
A Nelder–Mead process then tries to home in on the best combination of the two aforementioned parameters (starting with a random triangle of initial guesses).
For a pair of values to be evaluated, each of the 10 subsamples is, in turn, omitted when the parameters and games are fed into the Whole-History Rating system. Performance is then measured by how well the system predicts the omitted games (constituting the validation set). These values are summed to give the total error for the pair of parameters in question. This seems to be just like 10-fold cross-validation, but games at an extreme time point for any player always have to be part of the training set in order that ratings at time points not occurring in the system can always be interpolated (as given in the last page of the paper).
Finally, the error of a game set is calculated by adding for each member the result of "(r-1)*log(1-e)-r*log(e)" where 'e' is the expected outcome (as calculated with the usual logistic formula) and 'r' the actual one. This seems to me to be the right formula to use, as it makes optimizing the parameters coincide with maximizing their likelihood with respect to the data.

Running the test various times led me to the figures I gave earlier.

Title: Re: Whole History Ratings
Post by omar on Apr 26^th, 2010, 3:42pm

Thanks aaaa. Is it possible you can send me the code you used to run these tests. I'd like to try it out with different data sets.

Title: Re: Whole History Ratings
Post by aaaa on Aug 20^th, 2010, 5:25pm

woh, could you add a statistic that shows the geometric mean likelihood of a game outcome for the rating systems? You can calculate it efficiently by calculating the arithmetic mean log-likelihood and raising e to it.

I also notice that "Predict Percentage" is a misnomer for the shown values, as they are unaltered ratios.

Title: Re: Whole History Ratings
Post by woh on Aug 26^th, 2010, 6:31am

on 08/20/10 at 17:25:37, aaaa wrote:

woh, could you add a statistic that shows the geometric mean likelihood of a game outcome for the rating systems?

Hi aaaa,
It might take a while before I can spend some time on this.

on 08/20/10 at 17:25:37, aaaa wrote:

I also notice that "Predict Percentage" is a misnomer for the shown values, as they are unaltered ratios.

What do you mean with 'unaltered ratios'?

Title: Re: Whole History Ratings
Post by aaaa on Aug 26^th, 2010, 9:15am

on 08/26/10 at 06:31:24, woh wrote:

What do you mean with 'unaltered ratios'?

Well I assume that the respective systems are correctly predicting the winners of 83% and 75% of the games, not merely 0.83% and 0.75%.

In response to your hesitancy of accepting any optimized parameters due to the changing nature of Arimaa players, I've changed the evaluation such that predictions of game outcomes are weighted by how new they are. The weights decay exponentially in such a way that games at the median time point have half the weight in comparison to the newest (the latter, which of course wouldn't itself be evaluated due to aforementioned interpolation restriction, but you get the idea).
Tell me whether this would satisfy you enough to the point of adopting figures that will come with this setup.

Title: Re: Whole History Ratings
Post by woh on Aug 27^th, 2010, 10:10am

on 08/26/10 at 09:15:08, aaaa wrote:

Well I assume that the respective systems are correctly predicting the winners of 83% and 75% of the games, not merely 0.83% and 0.75%.

OK, Thanks aaaa.
Now I've got it.
My bad, I should have given it some more thoughts.

I have changed the title.

Title: Re: Whole History Ratings
Post by woh on Aug 27^th, 2010, 10:12am

on 08/20/10 at 17:25:37, aaaa wrote:

woh, could you add a statistic that shows the geometric mean likelihood of a game outcome for the rating systems?

Added.

Title: Re: Whole History Ratings
Post by Tuks on Aug 30^th, 2010, 2:53pm

is it possible to have the graphs that janzert made with the whr ratings instead of the gameroom rating, the graphs are a very cool feature but by using the gameroom rating some players have rather strange and inaccurate graphs like omar for example or other players who went through phases of bot bashing or testing

that would actually show accurate progression for each player

Title: Re: Whole History Ratings
Post by aaaa on Sep 3^rd, 2010, 3:51pm

I'm going to recommend that the parameters are set to 1.3 and 200. Unlike with the prior, I cannot justify giving anything other than such a round value for the increase in variance in light of the lack of precision of the optimization method with regard to this parameter.
Perhaps the switch could be made immediately, while keeping a separate temporary page with the old parameter values for the duration of the competition season.

Also, I think it would be nice have a column next to that of the peak ratings that shows the respective dates of reaching it.

Thanks.

Title: Re: Whole History Ratings
Post by omar on Sep 15^th, 2010, 7:09pm

on 08/30/10 at 14:53:09, Tuks wrote:

The WHR system does not keep a static rating which is changed for the two players after the game. It computes the ratings of all the players at once. So woh would have to store the daily WHR snapshots into a database and make it available as a web service in order to do this.

Title: Re: Whole History Ratings
Post by aaaa on Dec 22^nd, 2010, 5:02pm

In light of the upcoming use of WHR ratings for the seeding of the championship and the fact that currently any change of parameters shouldn't be disruptive right now, I would like to reiterate my request to change them to values that have at least some empirical basis to them. The latest values I'm getting (with exponential decay) are 1.3 and 170.

Title: Re: Whole History Ratings
Post by Adanac on Jan 13^th, 2011, 1:59pm

Fritzlein has beaten me 10 straight games but somehow I just passed him in WHR :-[ I'll take a screenshot of this miraculous event in case it never happens again!

Title: Re: Whole History Ratings
Post by aaaa on Jan 13^th, 2011, 2:51pm

Must be due to an intransitivity in performance. If you were Joe Frazier, Fritzlein would be George Foreman and chessandgo would be Muhammad Ali.

Title: Re: Whole History Ratings
Post by Eltripas on Jan 13^th, 2011, 5:39pm

on 01/13/11 at 13:59:23, Adanac wrote:

Fritzlein has beaten me 10 straight games but somehow I just passed him in WHR :-[ I'll take a screenshot of this miraculous event in case it never happens again!

I'm sure it has nothing to do with the fact that you defeated the world champion in 4 of your last 5 games against him, without mentioning that you defeated Tuks twice also. All this happening in the last 2 days.

Title: Re: Whole History Ratings
Post by megajester on Jan 14^th, 2011, 2:07am

on 01/13/11 at 17:39:08, Eltripas wrote:

Does this mean that skill at Arimaa is multidimensional? Meaning that one is not "better" or "worse" at Arimaa per se, but there are a spectrum of different styles that interact with each other differently?

Of course as Fritz always says, "tactics beats strategy". But once you get beyond tactics, is it possible that Arimaa strategies are like Rock Paper Scissors? (Rock beats scissors beats paper beats rock = Fritz beats Adanac beats chessandgo beats Fritz)

This would perhaps explain why bot bashers sometimes don't do well against humans and vice versa...

... megajester looks up at the edifice of his theory with pride ...

... of course if chessandgo turns up and says he was having a bad couple of days when he played Adanac then the whole thing comes crashing down. :)

Edit: Now I see what aaaa meant by "intransitivity". I don't know boxing, so I didn't understand that post when I first read it.

Perhaps I should define my theory better. Of course there are different strategies that different situations call for. I am talking more about playing styles, different philosophies, or approaches to the game. General principles of thought as opposed to knee-jerk reactionary strategies.

Let's say two players, A and B, play a game according to their own styles, A and B. Up until now I would have thought that the winner was determined by whoever made the least mistakes, meaning whoever executed their own style more perfectly. So if A beats B, it means that B must have made a mistake somewhere. I thought that the playing style itself was immaterial, that everything rests on the perfect execution of that style. Now I am proposing that even if a style is perfectly executed it can still be beaten by another, perfectly executed style.

Of course you would also assume that when quantum computers come along and "solve" Arimaa in the technical sense of the word, all this will be bunk. (In fact this has already happened to chess even before computers have solved it. Especially when compared with Arimaa, chess is a tactical slugfest with very little room for strategic variety.) However as Fritzlein points out in his book, branching factors are not everything. Perhaps Arimaa really does have richness all its own.

Title: Re: Whole History Ratings
Post by Adanac on Jan 14^th, 2011, 5:41am

on 01/14/11 at 02:07:17, megajester wrote:

... of course if chessandgo turns up and says he was having a bad couple of days when he played Adanac then the whole thing comes crashing down. :)

Yes, my high WHR rating is a temporary fluke and I don't expect to remain number 2 for long. Or maybe I'm becoming the Boris Gulko of Arimaa (5 out of 8 points in 8 games versus Kasparov: +3 =4 -1, but with no success against Karpov, Kramnik, etc.) I don't know if Gulko is an example of intransitivity or just a small sample size, though.

Quote:

Let's say two players, A and B, play a game according to their own styles, A and B. Up until now I would have thought that the winner was determined by whoever made the least mistakes, meaning whoever executed their own style more perfectly. So if A beats B, it means that B must have made a mistake somewhere. I thought that the playing style itself was immaterial, that everything rests on the perfect execution of that style. Now I am proposing that even if a style is perfectly executed it can still be beaten by another, perfectly executed style.

If we categorize players by level of aggressive play:

High = always attacks and advances pieces into enemy territory
Medium = likes to attack, but nowhere near as enthusiastically as “High”
Low = prefers to defend home traps and hold hostages

From my observations, if I had to guess which style tends to beat others:
High > Medium > Low > High

Or, as a general rule: more aggressive players tend to beat cautious players except in cases of over-exuberance against solid play.

Of course, as megajester pointed out, tactics and execution are the primary factors. For example, we saw an extreme example in round 1 of the World Championship. Hanzack is one of the most aggressive players out there while Harren is known more for his defensive skill and technique. In fact, the defensive player did get the much better position out of the opening but tactics were the deciding factor in the end, not style.

Arimaa is such a new game that I think everyone would benefit from exploring new ideas and experimenting with different strategies & styles. Nobody knows for sure what the future will bring but I'm pretty certain that opening setups & moves will evolve tremendously as we learn more. It's still amazing to me that after centuries of analysis by countless thousands of brilliant masters that new ideas can still be found in chess openings. This game here is an example of how computers are finding wild new tactics that even Capablanca and Kasparov never would have dreamed of: sacrificing a central pawn and giving up the ability to castle with both queens on the board just for a pased a-pawn!!? It sounds absurd, but I'll trust the judgement of this 3200-strength computer :)

http://www.chessgames.com/perl/chessgame?gid=1546726

Title: Re: Whole History Ratings
Post by Hippo on Jan 14^th, 2011, 12:58pm

on 01/14/11 at 05:41:47, Adanac wrote:

http://www.chessgames.com/perl/chessgame?gid=1546726

Wow what a game :). I hardly understand all the hidden tactics ... Black prefered 0-0 to capturing c5 pawn, but the pawn was captured later without problems ... couldn't black avoid pinning his bishop in final "pin exchange"?

Title: Re: Whole History Ratings
Post by Fritzlein on Jan 14^th, 2011, 7:39pm

on 01/13/11 at 13:59:23, Adanac wrote:

Fritzlein has beaten me 10 straight games but somehow I just passed him in WHR :-[ I'll take a screenshot of this miraculous event in case it never happens again!

One possible explanation is that you played rather more than I did in the Q4 of 2010, so you are entering the World Championship "in form", whereas I am rusty and lagging the latest advances. Given how little I have prepared this year, I will be happy if I can just finish on the podium.

Title: Re: Whole History Ratings
Post by Boom on Jan 30^th, 2011, 1:41pm

Dear woh,

Is there any chance that your implementation of whole history rankings be published as open source software?

Thanks,

Boom

Title: Re: Whole History Ratings
Post by woh on Feb 1^st, 2011, 10:48am

Hi Boom

I am afraid I have no plans to do so.

woh

Title: Re: Whole History Ratings
Post by omar on Feb 4^th, 2011, 3:52pm

Boom, you may be interested to know that recently there was a rating system competition on kaggle.com.

http://www.kaggle.com/chess

If you look in the forum section of that site you will find some of the top finishers have posted their source and methods.

My entry which was a simplified version of the Chessmetrics rating system placed 44 out of 258 teams. The source for it is here:

http://arimaa.com/arimaa/rating/chessmetrics/cmrs.txt

Hope this helps.

Title: Re: Whole History Ratings
Post by aaaa on Feb 6^th, 2011, 2:51pm

woh, could you increase the variability to 90 Elo^2/day? That's what my research is giving me right now.

Title: Re: Whole History Ratings
Post by aaaa on Jun 10^th, 2011, 9:08am

Given how much the values I'm getting while searching for optimal parameter values tend to fluctuate over time, I have finally come to the conclusion that there is a considerably wide range of reasonable values that can be used and that this can actually be considered a virtue of the WHR system, as it demonstrates a robustness. In light of that, I will stop making any further calls to change the official parameters, unless they were to become really out of whack.

On a different note, people may have noticed that when I make use of game data for various purposes, I have been pretty insistent in maintaining a purported kind of purity by restricting the games to those that are not only rated but also involve only humans. However, when it comes to doing rating research, like finding out what the typical rating of a beginner is, this is actually pretty ill-conceived, as, after all, the overwhelming number of games humans play on this site are against bots. In addition, interesting questions involving bots, like how much the strength difference between the top humans and bots has changed over time, become outright impossible to answer using only human games as data.

That brings me to the following idea: Instead of letting the WHR system loose on all rated games and simply accepting the already-mentioned distortions that come from humans playing bots, why not provide some compensation by having the system assume that server bots have static performance (which could be done by treating all their games as occurring in a single all-encompassing rating period)? Although this would technically not be entirely sound in light of hardware changes, future bots that are adaptive over time or even the fluctuating nature of the load on the server, I would still think that such an assumption would be a net winner in terms of informativeness. If this would still be too much or one would like to be able to keep tabs on how performance changes with hardware speed, the assumption could be restricted to the subset of server bots that are supposed to be of fixed strength.

One concrete use for this hybrid system could be to have the calculated ratings themselves be used to help find optimal parameters for other rating systems, in particular those classified as being "incremental" in the WHR paper; these are the comparably simple "ad-hoc" systems, like the one currently in use by the Arimaa gameroom.

Now, given the existence of the WHR list, Omar may have given any further work on the current gameroom rating system a very low priority, but it's clear that the gameroom ratings are still serving various purposes, including even server-technical ones. So even if a change to a more technically justified incremental rating system, like Glicko, is too much to ask for right now, at the very least, I don't think that, for optimization purposes, it would be too much to ask for less-involved changes, like different system parameters or starting ratings.

Title: Re: Whole History Ratings
Post by aaaa on Aug 20^th, 2011, 8:16am

Taking the first game of a 2011 server bot (http://arimaa.com/arimaa/gameroom/comments.cgi?gid=179387) as the cutoff point, I applied the WHR system (prior: 2 wins/4, variance: ca. 311 Elo^2/day) to all 160,318 earlier rated games with the modification that for each fixed-performance bot, its games, from its point of view, take place in one perpetual rating period. What follows is their performances relative to the median (retroactive) starting rating of a human player:

Fixed-performance bot	Elo above "beginner"	Standard deviation
bot_MarwinXP2Blitz	868	23
bot_Sharp2010P2	773	44
bot_Marwin2010P2	661	42
bot_Clueless2010P2	660	53
bot_Bomb2005P2	652	7
bot_GnoBot2010P2	608	56
bot_Clueless2008P2	554	46
bot_Clueless2009P2	553	40
bot_Clueless2007P2	549	13
bot_Clueless2005P2	520	12
bot_PragmaticTheory2010P2	465	55
bot_Clueless2009P1	455	45
bot_Sharp2010P1	444	58
bot_Clueless2006P1	435	11
bot_Clueless2006P2	434	16
bot_GnoBot2006P2	427	22
bot_OpFor2008P2	412	10
bot_Clueless2010P1	402	61
bot_Clueless2007P1	396	9
bot_GnoBot2005P2	388	8
bot_Clueless2005P1	374	8
bot_Bomb2005P1	358	5
bot_Badger2010P2	357	71
bot_Loc2006P2	321	14
bot_Marwin2010P1	320	67
bot_Clueless2008P1	312	54
bot_Loc2007P2	311	9
bot_Aamira2006P2	292	7
bot_Sharp2008P2	276	8
bot_PragmaticTheory2010P1	250	69
bot_Arimaazilla	213	4
bot_Loc2005P2	184	13
bot_OpFor2008P1	181	6
bot_Bomb2005P3	147	168
bot_GnoBot2010P1	96	103
bot_Loc2007P1	79	6
bot_Loc2006P1	62	8
bot_Badger2010P1	37	82
bot_Loc2005P1	20	8
bot_GnoBot2005P1	7	5
bot_GnoBot2006P1	3	13
bot_Rat2009P1	0	171
bot_Rat2009P2	0	171
bot_ArimaaScoreP3	-13	165
bot_Sharp2008P1	-48	7
bot_Arimaalon	-66	7
bot_ShallowBlue	-68	6
bot_ArimaaScoreP2	-88	5
bot_Aamira2006P1	-90	6
bot_ArimaaScoreP1	-166	4

Title: Re: Whole History Ratings
Post by aaaa on Aug 21^st, 2012, 11:56pm

woh, would it be too much trouble to have the peak ratings be links to (the comment pages of) the games they correspond to? That would be a really great feature to have.

Title: Re: Whole History Ratings
Post by woh on Aug 22^nd, 2012, 3:41pm

Interesting idea, aaaa. It might take me a while to implement.

Title: Re: Whole History Ratings
Post by clyring on Aug 27^th, 2012, 4:46pm

I don't think cyborg_briareus should be in WHRH.

Title: Re: Whole History Ratings
Post by woh on Aug 28^th, 2012, 11:01am

Yes, I agree.
I have excluded cyborg_briareus in today's WHRH ranking.

Title: Re: Whole History Ratings
Post by aaaa on Aug 28^th, 2012, 11:10am

Game 11711 should also be disregarded.

Title: Re: Whole History Ratings
Post by rbarreira on Aug 28^th, 2012, 2:12pm

on 08/28/12 at 11:10:29, aaaa wrote:

Game 11711 should also be disregarded.

But you already unrated it.

edit- ah, do you mean that due to incremental updating your "unratement" (lol is that a word?) would not be picked up automatically?

Title: Re: Whole History Ratings
Post by blackczajk on Sep 15^th, 2012, 1:20pm

on 01/14/11 at 02:07:17, megajester wrote:

I thought that the playing style itself was immaterial, that everything rests on the perfect execution of that style. Now I am proposing that even if a style is perfectly executed it can still be beaten by another, perfectly executed style.

I don't know enough yet about Arimaa to really give an answer on it, but I will give a basketball analogy:

I'm coaching a team that's lacking in height, but very fast. I'm playing against Team X, who has great (GREAT) low-post play, but is substantially slow-footed and not very good shooting from long range. If I decide to defend to my strength, I should play a zone defense, allowing me to double-team the low post, and play against passing lanes. Any kind of rebound or turnover should allow me to run the floor and score before Team X can begin to defend me. All that being said, if X's low-post threats are THAT good, I'm probably still taking a beating, but if I can blunt their strength enough, I've got as good a chance as any with what I'm strong at. But if I play man-to-man, with my lack of height, X's height will certainly crush me. (I'm thinking like a 20-10 kind of center or power forward).

I guess my point is, if you're facing a top five-percent player (or team), perfect execution of a counter style may only blunt what they can do, because they do it so well anyway. But on average, every "offensive" set should have a "defensive" counter that would give a player a big edge. Because of that, if you can extrapolate that thought to Arimaa (otherwise, I think it would have been broken already), yes, styles can counter other styles.

Title: Re: Whole History Ratings
Post by Fritzlein on Sep 15^th, 2012, 7:26pm

I'll bet that the objectively best moves in Arimaa sometimes belong to one style and sometimes belong to another style. Thus to always play the best move, you have to have play in every style, or equivalently, with no particular style.

That said, I probably have better practical chances if I steer towards familiar positions that I'm most comfortable with.

Title: Re: Whole History Ratings
Post by aaaa on Oct 13^th, 2012, 11:49am

on 08/28/12 at 11:10:29, aaaa wrote:

Game 11711 should also be disregarded.

It appears this game is still being included in the calculation of the WHRH ratings. Can you comment on this, woh?

Title: Re: Whole History Ratings
Post by woh on Oct 16^th, 2012, 5:16pm

Game 11711 is still included in the WHRH because the game is still marked as rated in the game database.

Title: Re: Whole History Ratings
Post by aaaa on Oct 17^th, 2012, 6:57am

on 10/16/12 at 17:16:36, woh wrote:

Game 11711 is still included in the WHRH because the game is still marked as rated in the game database.

Obviously, but surely you must agree that abandoned games shouldn't be used in its calculation?

Title: Re: Whole History Ratings
Post by aaaa on Dec 1^st, 2012, 8:47am

I'm going to reiterate my request for abandoned games to be filtered from the rating lists.

Title: Re: Whole History Ratings
Post by woh on Dec 4^th, 2012, 3:25pm

Abandoned games are now excluded.

Title: Re: Whole History Ratings
Post by woh on Dec 25^th, 2012, 9:27am

on 08/21/12 at 23:56:23, aaaa wrote:

woh, would it be too much trouble to have the peak ratings be links to (the comment pages of) the games they correspond to? That would be a really great feature to have.

Hi aaaa, the feature you requested is now included.

Title: Re: Whole History Ratings
Post by Fritzlein on Dec 26^th, 2012, 9:28am

on 12/25/12 at 09:27:10, woh wrote:

Hi aaaa, the feature you requested is now included.

Thanks! It is interesting to see the dates at which various players achieved their respective peak ratings. Chessandgo is approaching the 3rd anniversary of his peak...

Title: Re: Whole History Ratings
Post by supersamu on Dec 26^th, 2012, 5:44pm

Woh, there seems to be something wrong with the recently added feature. Acoording to the diagram, arimaa_master´s peak WHRH rating occured after a loss against fritzlein in 2007. I haven´t found any other mistakes (yet).

Title: Re: Whole History Ratings
Post by aaaa on Dec 27^th, 2012, 1:11pm

I have arimaa_master's peak rating coinciding with the end of his next game (http://arimaa.com/arimaa/gameroom/comments.cgi?gid=65993) instead.

Title: Re: Whole History Ratings
Post by woh on Dec 28^th, 2012, 9:23am

on 12/26/12 at 17:44:03, supersamu wrote:

Nice catch! It is fxed now. If you would have checked all players you would have found 2 others.

on 12/27/12 at 13:11:08, aaaa wrote:

I have arimaa_master's peak rating coinciding with the end of his next game (http://arimaa.com/arimaa/gameroom/comments.cgi?gid=65993) instead.

That is correct, the link was one game off. Thanks.

Title: Re: Whole History Ratings
Post by mistre on Feb 4^th, 2014, 2:43pm

Is it possible to create graphs showing WHR over time like the current gameroom rating graphs?

Title: Re: Whole History Ratings
Post by supersamu on Feb 9^th, 2014, 1:42pm

On the current WHRH list I see browni3141*s peak rating coinciding with his win over Brendan_M:
http://home.scarlet.be/~woh/whr/whrh.htm

http://arimaa.com/arimaa/gameroom/comments.cgi?gid=291140

However, browni immediately thereafter won against 5HT three times, Nombril twice, bohmaster twice, and kzb52 once without losing any games in between.
[EDIT: I have been informed that browni's games against 5HT and bohmaster were unrated, that somehow slipped past me. Sorry. But the Question is still valid, I think]

http://arimaa.com/arimaa/gameroom/pastgames.cgi?id=16866

After that he gets defeated by Adanac. Shouldn't the peak rating link to his last game against kzb52? Or are there other factors involved?

Title: Re: Whole History Ratings
Post by Fritzlein on Feb 9^th, 2014, 10:36pm

If the WHR ratings were updated like gameroom ratings then yes, the peak rating would always be directly before a loss, at the end of a winning streak. But WHR ratings are calculated retroactively. After browni lost to Adanac, WHR went back and thought, "You probably weren't that good when you beat kzb after all; that was more lucky than I thought."

When you go back after the fact and decide how good someone was when they played, you are most likely to think their peak came at the middle of a winning streak, gradually sloping up to and down from that peak. Otherwise you have to claim that browni got worse all of a sudden after his last win of the streak and before his first loss, which is less likely to be true.

What's more, if browni now goes on a long losing streak, WHR will conclude that his peak rating was even earlier than it now thinks. The system will discard the notion that he has been getting better and better in favor of the notion that he has been getting worse for some time now, and it didn't immediately show up in the results.

Title: Re: Whole History Ratings
Post by Fritzlein on Mar 11^th, 2014, 8:43pm

Are the games from the Screening counted in WHRE? It seems to me they should be. Especially given that the main problem with the WHRE is that humans don't have enough chances to play in events, this would be a perfect opportunity for folks to get a rating established.

Title: Re: Whole History Ratings
Post by woh on Mar 12^th, 2014, 7:00am

The screening games are currently not included in WHRE. They are stored in de game database as casual games. If the event field is changed to something like 'ACS 2014' they would be included. Unless I change my code and include HvB with either bot_ziltoid or bot_sharp and the game ends during the screening period. I believe it makes more sense to change the event field.

Title: Re: Whole History Ratings
Post by aaaa on Aug 31^st, 2014, 6:58pm

The following games should be disregarded from any rating calculation:

211904 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=211904) and 211911 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=211911) (because they were resumed by 211922 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=211922))
42449 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=42449) (because it's a duplication of 42448 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=42448))
1922 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=1922) (because it was resumed by 1923 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=1923))
130221 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=130221) (because it was resumed by 130227 (http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=130227))

I'd unrate them already, but I had given up that power as a security precaution.

Title: Re: Whole History Ratings
Post by supersamu on Oct 11^th, 2014, 8:00am

on 02/04/14 at 14:43:58, mistre wrote:

Is it possible to create graphs showing WHR over time like the current gameroom rating graphs?

I would really like to see this feature, possibly even with a graph of the position in the standings over time.

Title: Re: Whole History Ratings
Post by woh on Nov 18^th, 2014, 5:09am

Starting todays WHR ratings, these games are now excluded.

on 08/31/14 at 18:58:20, aaaa wrote:

Title: Re: Whole History Ratings
Post by Janzert on Nov 18^th, 2014, 11:42am

Does WHR not automatically drop games after they are unrated?

It's sorta hard to tell since the links don't go to the comments page, but all of those games should have been unrated back in September.

Janzert

Title: Re: Whole History Ratings
Post by aaaa on Nov 18^th, 2014, 1:37pm

on 11/18/14 at 11:42:14, Janzert wrote:

Does WHR not automatically drop games after they are unrated?

A database file no longer gets automatically updated after the corresponding month is over. See here (http://arimaa.com/arimaa/download/gameData/).

Title: Re: Whole History Ratings
Post by clyring on Jan 31^st, 2016, 8:09am

It looks like the Shadow-mattj256 WC Warmup game is still being counted in WHRE despite being flagged as a non-event game. If it's not convenient to modify the behavior of WHRE on these types of games I'm also okay with resetting to default the event and/or eventid fields and marking any future games of this type via the comments in the future.

EDIT: Okay, it looks like the 'event game' flag I unset for this game isn't actually represented in the game archives and is hence inaccessible to the program calculating WHRE. I'll just set the event and eventid fields back to their defaults for this game and it should pick up the change next time the January archives are updated.