Author |
Topic: Whole History Ratings (Read 69533 times) |
|
Janzert
Forum Guru
Arimaa player #247
Gender:
Posts: 1016
|
|
Whole History Ratings
« on: Apr 8th, 2008, 7:03pm » |
Quote Modify
|
Since there have been a number of discussions here about various rating systems, I thought people would find this paper interesting. Whole-History Rating: A Bayesian Rating System for Players of Time-Varying Strength Abstract: Whole-History Rating (WHR) is a new method to estimate the time-varying strengths of players involved in paired comparisons. Like many variations of the Elo rating system, the whole-history approach is based on the dynamic Bradley-Terry model. But, instead of using incremental approximations, WHR directly computes the exact maximum a posteriori over the whole rating history of all players. This additional accuracy comes at a higher computational cost than traditional methods, but computation is still fast enough to be easily applied in real time to large-scale game servers (a new game is added in less than 0.001 second). Experiments demonstrate that, in comparison to Elo, Glicko, TrueSkill, and decayed-history algorithms, WHR produces better predictions. On a side note, the author of the paper, Remi Coulom, is also the author of CrazyStone one of the current top playing go programs. Janzert
|
« Last Edit: Apr 8th, 2008, 7:03pm by Janzert » |
IP Logged |
|
|
|
Fritzlein
Forum Guru
Arimaa player #706
Gender:
Posts: 5928
|
|
Re: Whole History Ratings
« Reply #1 on: Apr 9th, 2008, 2:45pm » |
Quote Modify
|
Thank you for this link, Janzert. The WHR philosophy is the same as the FRIAR ratings I posted in this thread: http://arimaa.com/arimaa/forum/cgi/YaBB.cgi?board=talk;action=display;nu m=1148669732;start=0 except that WHR does it way, way better. The model of time variance in rating is better, and the approximation algorithm is better. This paper is what I wanted to do in a rating system, but didn't have the skills to do. The more I read from Remi Coulom, the more impressed I am. There are mathematicians who want to use the coolest tool they have, i.e. the heaviest machinery available, but Coulom is interested in what works best, not what is most complicated or most sophisticated. On the other hand, there are engineers who only care about what works, and therefore innovate only in tiny increment steps rather than in broad strokes like Coulom. It's a rare combination to have the best attributes of the theoretician and the best attributes of a practitioner combined in one person, but I think that's what we have here. The_Jeh, if you are interested in taking your rating systems to the next level, this is a must-read. Omar, if you want to replace the gameroom ratings with something better, this should be the starting point rather than p8 ratings. I still believe that it is basically impossible to get accurate ratings with non-learning bots and self-selection of opponents messing everything up. However, if we just took HvH rated games and fed them into the WHR framework, I think we would get the best ratings we are capable of getting with today's technology.
|
|
IP Logged |
|
|
|
omar
Forum Guru
Arimaa player #2
Gender:
Posts: 1003
|
|
Re: Whole History Ratings
« Reply #2 on: Apr 9th, 2008, 10:47pm » |
Quote Modify
|
Wow, you seem really excited about this rating system Karl. I'll have to check out this paper. If you get a chance can you tell us what you like about this system and how it works. It would be useful for anyone who has not read the article; and for me to make sure I understood it. Thanks.
|
|
IP Logged |
|
|
|
Fritzlein
Forum Guru
Arimaa player #706
Gender:
Posts: 5928
|
|
Re: Whole History Ratings
« Reply #3 on: Apr 11th, 2008, 12:50pm » |
Quote Modify
|
The idea of rating the whole game history simultaneously is appealing because inaccurate ratings are retroactively corrected. One could construct many examples where this is beneficial, but I will just give two. First, suppose that Player A learns to play Arimaa quite well off-line in his local game club, reaching a playing strength of 2100. He joins the Arimaa server rated 1500 and beats you three in a row. That costs you about 80 rating points in the current system. Yes, Player A's gameroom rating is rapidly rising, and before long it will start hovering around 2100. Nevertheless, you are permanently out those 80 rating points that you didn't deserve to lose. You paid the price for making his rating more accurate. In a whole-history rating system, in contrast, once the system figures out that Player A is rated 2100 now, it figures that he probably wasn't rated 1500 in those first few games either. His skill might have been a little different then, but not much. Therefore, after the fact, the penalty you suffer for losing to him is reduced. The more common situation is that a player enters overrrated at 1500, and the weak player who beats him (usually a low bot on the ladder) gets too much credit. WHR fixes that too. Second, suppose that two players (whom I will call seanmcl and Asubfive) join the server and play games only against each other. After fifty games they have each won twenty-five, effectively proving that they are of equal skill. Their ratings both remain at 1500, since they never play anyone else, but really their playing strength has risen to 1800. WHR can't fix that; nothing can fix it unless they play a variety of opponents. But now suppose that Asubfive starts playing other people. His rating will rise to 1800, but only slowly because he is "established" at 1500. Furthermore seanmcl, despite having proven that he's just as good, won't have a rating increase at all. Contrast this to the WHR system, where Asubfive's ratings will not be firmly anchored by playing lots of games against a single opponent. Furthermore when Asubfive starts to get a game record against other players, seanmcl's rating will change too since it is firmly correlated to Asubfive. Now, Coulom is not the first to use the whole database of games to get the benefits I list above. The_Jeh's ratings, for example, use all the games at once. The trouble with The_Jeh's system and others like it is the assumption that the strength of the players remains constant across the rating period. For example, even though chessandgo has caught or passed me in playing strength, his rating is weighed down by thirty consecutive losses to me from when he was first learning. Unless The_Jeh leaves those games out entirely (which forfeits the benefits of remembering history), chessandgo will _never_ catch up to me in rating (unless he significantly passes me in skill), because he continues to be punished for the past. A good rating system must allow the game data to be partially explained by changing skill. The_Jeh's ratings apply retroactive corrections but don't allow for time-varying playing strength. Most other systems (e.g. the game room ratings, p8 ratings, aaaa's ratings) allow for time-varying playing strength but don't allow retroactive correction of inaccuracies. What Coulom does is effectively balance correction of past estimation error while also allowing for change over time.
|
« Last Edit: Apr 11th, 2008, 12:51pm by Fritzlein » |
IP Logged |
|
|
|
mistre
Forum Guru
Gender:
Posts: 553
|
|
Re: Whole History Ratings
« Reply #4 on: Apr 11th, 2008, 1:18pm » |
Quote Modify
|
"Furthermore when Asubfive starts to get a game record against other players, seanmcl's rating will change too since it is firmly correlated to Asubfive." So, if I understand you correctly, your game rating will change even when you don't play, but when others that you played play games? There is nothing inherently wrong with that, but it would take some getting used to. I guess over time, as you play a large variety of players, the effect would be so small you would barely notice it, but I am just guessing here.
|
|
IP Logged |
|
|
|
aaaa
Forum Guru
Arimaa player #958
Posts: 768
|
|
Re: Whole History Ratings
« Reply #5 on: Apr 11th, 2008, 1:36pm » |
Quote Modify
|
The KGS Go Server also has ratings that change without playing.
|
|
IP Logged |
|
|
|
Fritzlein
Forum Guru
Arimaa player #706
Gender:
Posts: 5928
|
|
Re: Whole History Ratings
« Reply #6 on: Apr 11th, 2008, 2:42pm » |
Quote Modify
|
on Apr 11th, 2008, 1:18pm, mistre wrote:So, if I understand you correctly, your game rating will change even when you don't play, but when others that you played play games? |
| Yes, that's true, and yes, it does take some getting used to. It's somewhat akin to the tiebreak points in the Swiss tournament: the performance of your opponents after you have played them affects your standing. I'm curious how much movement there would be in practice, and whether it would disturb me as a player. My hunch is that a bit of drift wouldn't bother me if in exchange I didn't see static ratings that appear wildly inaccurate. But maybe I'm trying to strike a bargain that isn't available, since the inaccuracy mostly comes from bots. on Apr 11th, 2008, 1:36pm, aaaa wrote:The KGS Go Server also has ratings that change without playing. |
| Is there a lot of griping about that on KGS, or are the folks there pretty happy with their rating system?
|
« Last Edit: Apr 11th, 2008, 2:43pm by Fritzlein » |
IP Logged |
|
|
|
aaaa
Forum Guru
Arimaa player #958
Posts: 768
|
|
Re: Whole History Ratings
« Reply #7 on: Apr 11th, 2008, 3:22pm » |
Quote Modify
|
Rating systems are of course a big issue in Go as they serve to determine appropriate handicap. One noticeable difference is that one doesn't even get a starting rating until having played a game or two.
|
|
IP Logged |
|
|
|
lightvector
Forum Guru
Arimaa player #2543
Gender:
Posts: 197
|
|
Re: Whole History Ratings
« Reply #8 on: Apr 11th, 2008, 3:27pm » |
Quote Modify
|
I play(ed) on KGS as a 1 kyu, although very little for the last six months; I'm quite out of practice. Like every server, the KGS server has it's complaints, but overall, it seems like the rating systems performs reasonably well. KGS also seems good about detecting and readjusting ratings and rating uncertainties in response to large over/under estimations. The current system seems plenty accurate, at least from 10K and stronger: for instance, a 2K playing a 5K can reasonably expect a balanced game with a 3 stone handicap. However, I get the impression there are complaints about the lack of transparency of the system, and ratings tend to drift rather strangely if you play very infrequently. Also, every year or two, I notice that the servers has to rescale all the ratings, to compensate for inflation and/or inaccurate scaling, which can occur because of the relatively lower frequency of handicap games to provide cross-rating data. Also, every once in a while, you get problems with sandbaggers, etc, although there are a few mechanisms to minimize that.
|
« Last Edit: Apr 11th, 2008, 3:33pm by lightvector » |
IP Logged |
|
|
|
Fritzlein
Forum Guru
Arimaa player #706
Gender:
Posts: 5928
|
|
Re: Whole History Ratings
« Reply #9 on: Apr 11th, 2008, 3:48pm » |
Quote Modify
|
Thanks for the reviews, guys. A good Go rating system is in any case much harder than a good chess rating system. All chess games are played without handicap, but most Go games are played with handicap. The chess rating scale is calibrated by winning percentage alone, whereas the Go rating scale is calibrated by winning percentage and handicap stones. The Go rating systems try to solve the problem by saying that one stone of handicap equals X rating points. But this is bound to be distorted. A stone of handicap between 5-dan players is immensely important, whereas a stone of handicap between 15-kyu players is essentially irrelevant. The EGF has at least acknowledged this problem by making a stone handicap worth different Elo points depending on the level of the players, but I think they don't go nearly far enough. Note that Coulom, in order to validate the superiority of his rating system, tested it only on even Go games. He didn't even try to accommodate handicap games, so he had only the parameter of winning percentage to worry about.
|
« Last Edit: Apr 11th, 2008, 3:49pm by Fritzlein » |
IP Logged |
|
|
|
The_Jeh
Forum Guru
Arimaa player #634
Gender:
Posts: 460
|
|
Re: Whole History Ratings
« Reply #10 on: Apr 11th, 2008, 4:53pm » |
Quote Modify
|
Time-varying playing strength doesn't matter with the bots. Really, it is nonsensical for a bot's rating to change at all after it has played hundreds of games. (Well, the exception would be a bot that bases its play on the database or otherwise teaches itself.) There's got to be a way to deduce the bots' strengths relative to each other, and then set their RU to zero. Maybe that would solve some problems until the WHR can be implemented.
|
« Last Edit: Apr 11th, 2008, 4:54pm by The_Jeh » |
IP Logged |
|
|
|
omar
Forum Guru
Arimaa player #2
Gender:
Posts: 1003
|
|
Re: Whole History Ratings
« Reply #11 on: Apr 12th, 2008, 2:11am » |
Quote Modify
|
on Apr 11th, 2008, 4:53pm, The_Jeh wrote:Time-varying playing strength doesn't matter with the bots. Really, it is nonsensical for a bot's rating to change at all after it has played hundreds of games. (Well, the exception would be a bot that bases its play on the database or otherwise teaches itself.) There's got to be a way to deduce the bots' strengths relative to each other, and then set their RU to zero. Maybe that would solve some problems until the WHR can be implemented. |
| Yes, I also think that P1 and P2 type bot ratings should be fixed. Additionally the rating system should be anchored so that random play has a rating of zero. This can be achieved using a random bot which has a random setup and and picks random moves (i.e. enumerates all the possible unique positions that can arise from the current position and randomly selects one).
|
|
IP Logged |
|
|
|
omar
Forum Guru
Arimaa player #2
Gender:
Posts: 1003
|
|
Re: Whole History Ratings
« Reply #12 on: Apr 12th, 2008, 3:40am » |
Quote Modify
|
I've had a chance to read the paper now. Thanks Karl for writing your views on WHR. I think I understand it now, though not enough to start implementing it. In the paper the performance of different rating systems is compared by seeing how well they predicts the outcome of games. I'm a little surprised that the numbers are like 55% (in table 1 of the paper); I would have thought they would be higher like at least 60%. I am also surprised that the difference in prediction performance between WHR and Elo is so little; just 0.672%. In practice I think I'm less concerned about how accurately the rating system can predict outcomes and more concerned about how resistant the rating system is to abuse such as pumping and sandbagging. Last December I was experimenting with rating systems based on the idea that not all games should be rated equally. The idea is based on your proposal that games against bots should be rated half as much as games against humans (i.e. K/2). If we can use the opponent type to weight the game then we could also take other factors into consideration; like what is the opponent rating uncertainty and what is your existing record against this opponent. I was able to get results that were similar to the P8 rating system, but using a simple Elo equation; lot less computation than P8. So in a situation where your opponent has a much higher rating uncertainty than you, the game does not count as much for you and so you don't lose or gain many points from it, but for the player with the high uncertainty it counts as usual and thus is able to converge the ratings of new players without causing much disruption to the established players. Also by looking at the previous record, games can be weighted less and less as you diverge from 50% percent; this helps prevent pumping/dumping ones rating by repeatedly winning/losing against the same opponent.
|
|
IP Logged |
|
|
|
Fritzlein
Forum Guru
Arimaa player #706
Gender:
Posts: 5928
|
|
Re: Whole History Ratings
« Reply #13 on: Apr 12th, 2008, 9:42am » |
Quote Modify
|
on Apr 12th, 2008, 3:40am, omar wrote:In the paper the performance of different rating systems is compared by seeing how well they predicts the outcome of games. I'm a little surprised that the numbers are like 55% (in table 1 of the paper); I would have thought they would be higher like at least 60%. |
| Probably the number of games predicted correctly is so low because he was only looking at even Go games. If you were predicting only on games which were coin flips, you couldn't pick the winner more than 55% of the time yourself. The games which would be easy to predict, i.e. mismatched games, are usually played at handicap on the Go server, and therefore excluded from the data. For the Arimaa server, where lots of mismatched games take place, I'm sure all the different rating systems could pick the winner over 60% of the time. Quote:I am also surprised that the difference in prediction performance between WHR and Elo is so little; just 0.672%. |
| That small percentage difference between the systems could actually represent a substantial difference in predictive ability. The difference in score comes entirely from the few games in which the different rating systems pick different players to win. If one rating system thinks Player A is 60% favored to win, and the other thinks Player A is 70% favored to win, both rating systems will pick Player A to win, and there will be no discrimination between the two, regardless of whether the true winning probability for Player A is 45%, 53% or 89%. The is the problem you had in the first year of the spectator contest: How do you tell who is the better predictor if everyone bets on the better player? Not only will Coulom's performance metric not discriminate between rating systems on most of the games, and only show a difference when they think opposite players are favorites, but furthermore even when one system picks Player A to win and the other picks Player B to win, if the true odds are 51% to 49%, then almost half the time the better predictor will be punished and the worse predictor will be rewarded just by chance. The average gain for being smarter is small even in the games where the metric measures a difference. Under the circumstances and the metric that Coulom chose, WHR scored a crushing victory over the other rating systems. I actually think Coulom's metric for showing the worth of his system sucks. He should have used the formula we previously used in our 2005-2007 spectator contests, name root-mean-square error of percentage prediction, so that good predictors are discriminated from the bad ones by the confidence of the prediction as well as by who they chose to win. For this better metric, one can also measure how much error random chance imposes on even the theoretically best predictor, and thus have a scale for understanding superior performance so that, say, cutting the predictive error in half isn't dismissed as a negligible performance gain. Quote:In practice I think I'm less concerned about how accurately the rating system can predict outcomes and more concerned about how resistant the rating system is to abuse such as pumping and sandbagging. |
| Yes, my primary concern is also limiting the potential for abuse rather than maximizing predictive ability. However, increasing predictive ability is by nature the way to limit one type of abuse, namely the self-selection of opponents. Any inaccuracy of prediction is inherently an abuse opportunity. If the ratings predict that Player A is 70% to win in a match where his actual winning chance is 90%, then that match will gain him rating points on average, and Player A can select that opponent repeatedly to pump his own rating. Quote:The idea is based on your proposal that games against bots should be rated half as much as games against humans (i.e. K/2). |
| Weighting bot games half as much would only mean that pumping your rating up against a bot would take twice as many games. It wouldn't stop you from eventually getting just as many rating points in the long run. The alternative that I mentioned to you before, namely reducing the scaling factor from 400 to 200 for games involving a bot, actually cuts in half the amount of points a player can pump their rating against a bot no matter how many times they play it. If that's what you are referring to, I'm all in favor of. I think we should implement it today in the framework of the current game room ratings. That would have a huge benefit in minimizing the distortion caused by bots. Then later on, when we also want to minimize the distortion caused by temporarily inaccurate ratings, we can implement WHR along with a reduced scaling factor for bot games. Quote:If we can use the opponent type to weight the game then we could also take other factors into consideration [...] So in a situation where your opponent has a much higher rating uncertainty than you, the game does not count as much for you and so you don't lose or gain many points from it, but for the player with the high uncertainty it counts as usual and thus is able to converge the ratings of new players without causing much disruption to the established players. |
| In the current scheme it would definitely be an improvement for games of newcomers to affect the ratings of established players less. But if you think it is worth tweaking the system so that newcomers mess up the ratings of established players less than they do now, wouldn't you really like to tweak the system so that newcomers mess up the ratings of established players not at all, plus have the benefit that the games are counted fully for both players, so both players have full incentive to try their best? WHR does what you are trying to do by tweaking the weighting, except it does it better. Quote:Also by looking at the previous record, games can be weighted less and less as you diverge from 50% percent; this helps prevent pumping/dumping ones rating by repeatedly winning/losing against the same opponent. |
| That's an intriguing idea that I haven't fully considered. We had previous discussed weighting games less when the two players have played each other a lot, regardless of whether one player has won most. You seem to be saying that the games should count less if one player has won most, regardless of how many times they have played. Intuitively I like the idea of games that are more nearly equal counting more, but I would have to think more about the possible consequences. Would that mean that upset victories also don't count for much? Quote:I was able to get results that were similar to the P8 rating system, but using a simple Elo equation; lot less computation than P8. |
| I agree that the P8 ratings take way too much computation. So do the FRIAR ratings that I experimented with. In my past enthusiasm for using the whole history of data, I wasn't sufficiently accounting for the practical problem of needing too much CPU time. We can't dedicate a whole server just to constantly computing ratings. But much of the problem was using an inefficient estimation technique. FRIAR used a successive approximation method that took forever to converge, and from what I understand of your implementation, so does p8. Coulom claims that WHR ratings converge extremely quickly using Newton's method. After each game, the ratings of the two players involved can be updated with a quick computation; an approximation involving all players is more time-consuming but not terrible, and can be done at intervals to keep the system calibrated. Coulom's claim, which I buy into, is that whether using the whole history is too expensive or not is all about the efficiency of the implementation. There's a difference between rejecting a system because it will require too much CPU after it is implemented, and rejecting it because it looks a little complicated to implement. Not that the latter reason isn't a valid one; we should just be clear about the reasoning.
|
|
IP Logged |
|
|
|
omar
Forum Guru
Arimaa player #2
Gender:
Posts: 1003
|
|
Re: Whole History Ratings
« Reply #14 on: Apr 13th, 2008, 4:41pm » |
Quote Modify
|
Quote:Would that mean that upset victories also don't count for much? |
| If you had played this player before and had a good record against this player then the first upset would not count that much, but if you kept losing eventually your record would get closer to 50% and they would start counting more. Now if this was the first time you played this player and had no previous record against the player then the upset would count as normal. I would eventually like to start experimenting with WHR and run it on the rated games from the gameroom and see what kind of numbers it will produce, but it will probably be a while before I get to it. If anyone else is interested, there is a file which has just the rated games from the gameroom up to Dec 2006: http://arimaa.com/arimaa/download/gameData/ratedgames.zip http://arimaa.com/arimaa/download/gameData/ratedgames.tgz I can update it if you needs more recent games.
|
|
IP Logged |
|
|
|
|