Arimaa Forum (http://arimaa.com/arimaa/forum/cgi/YaBB.cgi)
Arimaa >> General Discussion >> Evidence of inaccurate ratings
(Message started by: Fritzlein on May 25th, 2005, 7:47pm)

Title: Evidence of inaccurate ratings
Post by Fritzlein on May 25th, 2005, 7:47pm
In another thread, Omar asked me to do some statistical analysis on the accuracy (bzw. inaccuracy) of Arimaa ratings.  The first thing I wanted to check was the predicted scores of underdogs compared to their actual scores.  This is like the doctor checking your blood pressure.  High blood pressure is the symptom of so many different problems that doctors always check your blood pressure first, and if it is high, they try to figure out what is causing it.  Most things that could be wrong with a rating system will result in the underdogs scoring more wins than the ratings would predict.

Sure enough, in 9444 decisive rated games, the ratings predicted the underdog to win 2565, but they actually won 2670.  For the underdogs to be winning 4.1% more often than predicted may not seem like much, but in truth it suggests substantial rating inaccuracies of some type.  The reason this small percentage could show a big problem is that the system has a built in mechanism to self-correct if it is making wrong predictions, so persistent wrong predictions point to something intractably wrong that prevents the system from self-correcting properly.

When I restricted the query to human versus bot games, the problem seemed to become even worse.  In 7708 decisive rated hvb games, the ratings predicted the underdog to win 2067, but they actually won 2186, a surplus of 5.7%.

However, just as I was gearing up to estimate in terms of plus-or-minus rating points just how awful the problem of inaccurate ratings appears to be from these statistics, an obvious explanation occurred to me: It is all the fault of newcomers.  Almost everyone who ever entered the system played their first rated game against Arimaazilla or, more recently, Arimaalon.  The ratings inaccurately have these beginning players as favorites.  For those who persist in playing, there is often a reverse trend: after losing enough points to get below the weak bots, the new players are supposedly underdogs, but they quickly learn how to beat the bots, and again the underdogs do better than expected.

Clearly, to get useful results, I need to filter out games involving beginners.  Unfortunately, the game result database doesn't save the rating uncertainty (RU) of each player for each game, so I couldn't filter my query based on both players having a low RU.  What I did instead was to filter on both players having ratings over 1600, because most players over 1600 have a low RU.  The results shocked me.  In 2506 decisive rated over-1600 games, the ratings predicted the underdog to win 786, but they actually won 785.  In other words, apart from new people entering the system, the Arimaa ratings appear not to have high blood pressure.

One theory is that, because most games are human vs. bot games, the ratings are fairly accurate for human vs. bot games, but the ratings are not accurate for human vs. human games.  When I further restricted my query to human vs. human games, the same incredible result surfaced: In 319 decisive rated over-1600 hvh games, the ratings predicted the underdog to win 96, and they actually won exactly 96!

Now, this doesn't mean that the system is actually functioning perfectly well.  True, the most obvious symptom of inaccurate ratings does not afflict over-1600 players, but just because your blood pressure is normal doesn't mean you don't have some other disease.  I'll run some other tests and report if I turn up anything interesting.




Title: Re: Evidence of inaccurate ratings
Post by 99of9 on May 26th, 2005, 12:14am
Wow that's really interesting.  

I'm glad you remembered the new-player factor, because after the first couple of paragraphs I was about to reply and say "Can you cut out the first 30 games from each player?"

I'll continue to watch with interest, because although I don't have many ideas on how to test ratings systems, these tests interest me a great deal.

When you say "decisive" - do you mean you've just cut out draws, or games lost on time as well?

Title: Re: Evidence of inaccurate ratings
Post by Fritzlein on May 26th, 2005, 8:20am

on 05/26/05 at 00:14:37, 99of9 wrote:
When you say "decisive" - do you mean you've just cut out draws, or games lost on time as well?


I mean the result was "b" or "w" not "d" or "a".  I didn't filter based on the method of victory.

Title: Re: Evidence of inaccurate ratings
Post by Fritzlein on May 26th, 2005, 4:34pm
Well, I did another test.  Because I believe that botbashing generally makes ratings inaccurate, I decided to compare the performance ratings of various players divided into games against bots and games against humans.  I suspected that people who beat bots the same way over and over (like me) would have lower performance against humans whereas people who play experimentally against bots and don't care about their ratings (like PMertens) would have higher performance against humans.  In fact, the data more or less bear this out:

PlayerGames vs HumansPerformance vs HumansGames vs BotsPerformance vs BotsDifference
PMertens322020 741880+140
Belbo212000 831920+80
99of9612200 412120+80
Arimanator211420 871480-60
Adanac201830 831890-60
Naveed221670 821760-90
Omar281900 671990-90
Fritzlein252290 812480-190

Notes on methodology: My database is a month out of date, so it doesn't include the recent spate of blitz games or Arimanator's abrupt ratings rise.  (Also, If the blitz games were included, Omar wouldn't look so good versus bots! :-))  I took all players with over 200 rated games, and looked at their most recent 100 rated games or so.  If they had less than 20 rated games against humans within those 100 (haizhi, bleitner, clauchau) I threw them out.  For the rest, I calculated to within 10 points the rating they would have had to have in order to be predicted to win as many games as they actually won.

One result that surprised me was Naveed doing better against bots than humans, but on further reflection his recent win against me isn't included, and his poor result in the postal tourney (2 of 9 versus humans) explains most of his low anti-human rating.  Perhaps 99of9 is right when he says Naveed is just erratic, and perhaps time controls are a factor here too.  If there were enough good data I would impose restrictions on the time controls as well.

[EDIT: What I wrote next is bunkus; I correct it the post just after.]

At any rate, we can now do some math on the accuracy of the ratings.  If the ratings were reliable, then we should be able to split players' games into groups however we like and get roughly the same performance rating in each group.  They won't be exactly the same, because a player's performance from game to game will vary with a standard deviation of about plus or minus 150 points according to the rating model in use.  However, measured over 20 games the standard deviation of performance should be only about 150/sqrt(20) = 33.5 and over 80 games should be only 150/sqrt(80) = 16.8.  If you measure performance over 20 games and compare it to performance over a different 80 games, the standard deviation of the _difference_ in measured performance should be sqrt(33.5^2 + 16.8^2) = 37.5, or let's say 40 points.

But what to we see from our actual data?  The average difference between performance versus humans and performance versus bots is 98.75, or let's say 100, which is two and a half standard deviations.  If you measure something and _average_ two and a half standard deviations of error, your model is just not right.  I'm not quite sure about the validity of my statistics, but I do believe that anti-human performance and anti-bot performance should not be that divergent.  In fact, it would be more reasonable to conclude that human ratings are on average 100 points wrong depending on what style of play that human adopts versus bots.

Title: Re: Evidence of inaccurate ratings
Post by Fritzlein on May 27th, 2005, 12:03am
Oops.  In my previous post I totally biffed the calculation of how much error we can expect when dividing up data into two sets and comparing performance ratings between those two sets.  I acted as if we made twenty measurements of performance and averaged them, i.e. something like (1340 + 1520 + ... + 1450)/20 = 1420.  But that's not how performance rating is calculated at all.  We have twenty measurements which are each a win or a loss, i.e. each 1 or 0.  The sum of these measurements will have standard deviation sqrt(n*p*q) where n is the number of trials, p is the probability of winning, and q is the probability of losing.  The performance rating will therefore have an standard error of 400/(ln(10)*sqrt(n*p*q)), a formula which one obtains by differentiating the formula for predicted score.

If we assume all the games are evenly matched, i.e. p=0.5, then in 20 games the standard error in performance rating is 78 points.  Playing 80 games instead of 20 will cut the standard error in half, but increasing p increases the error in performance rating.  

Taking my case, for example, I won 79 of my 81 games against bots, for a p of about 0.975, and a standard error in performance rating of 124 rating points over those 81 games.  Against humans I won 23 of 25 for a p of 0.92 and a standard error of 128 rating points over those 25 games.  In total the standard error is sqrt(124^2 + 128^2) = 178.  The actual error of 190 rating points is only 1.06 standard deviations, i.e. a result that extreme will happen 27% of the time by chance.  And mine was the most extreme numerically of all the differences.  :-(

I could go back and calculate how extreme all the other numbers are, but I've got a hunch that they fall within a reasonable range of what could have happened randomly in a system that is otherwise functioning perfectly well.  So I guess it is back to the drawing board as I look for statistical evidence of flawed ratings.

Title: Re: Evidence of inaccurate ratings
Post by 99of9 on May 27th, 2005, 1:39am
I'd suggest at least trying Paul.  Yours are so high in error because your p is so far away from 0.5.

Title: Re: Evidence of inaccurate ratings
Post by omar on May 27th, 2005, 3:56pm

on 05/25/05 at 19:47:29, Fritzlein wrote:
Unfortunately, the game result database doesn't save the rating uncertainty (RU) of each player for each game, so I couldn't filter my query based on both players having a low RU.  


I figured we might need that someday. I did save that info in a log file. I just added two new fields to the games database 'wratingk' and 'bratingk' that contain the RU. There are some games for which that data was lost. In those cases the value of the field is set to -1. If you download the new data files it should have these fields.


Quote:
In 2506 decisive rated over-1600 games, the ratings predicted the underdog to win 786, but they actually won 785.


Just wondering how you do the calculations to come up with 786. Do you take the average rating of the underdog and the average rating of the opponent, then calculate the probability of winning for the underdog and multiply this by 2506?

Title: Re: Evidence of inaccurate ratings
Post by Fritzlein on May 28th, 2005, 9:17am

on 05/27/05 at 15:56:05, omar wrote:
I figured we might need that someday. I did save that info in a log file. I just added two new fields to the games database 'wratingk' and 'bratingk' that contain the RU. There are some games for which that data was lost. In those cases the value of the field is set to -1. If you download the new data files it should have these fields.


Just wondering how you do the calculations to come up with 786. Do you take the average rating of the underdog and the average rating of the opponent, then calculate the probability of winning for the underdog and multiply this by 2506?


Thanks for adding the RU to the database.

To calculate expected wins for the underdog, I used the expected wins formula on a game by game basis and then summed them up.

Title: Re: Evidence of inaccurate ratings
Post by Fritzlein on May 28th, 2005, 5:46pm
OK, 99of9, I decided to complete the computation and try for more accuracy.  I am doing much of it by hand, which introduced a sign error last time I calculated my performance rating versus bots: It's 2420 not 2480.  Here is the table of corrected results and how extreme they are:
PlayerGames vs HPerf vs HGames vs BPerf vs BDiffStd. Dev.# of stddev
PMertens322018 741875+143971.48
Belbo212000 831916+84900.93
99of9612197 412118+791060.75
Adanac201833 831885-5290-0.58
Arimanator211416 871479-6397-0.65
Omar281904 671993-8986-1.04
Naveed221670 821763-9386-1.08
Fritzlein252289 812422-133179-0.74

Now if I am not mistaken, the last column should be 8 independent normally-distributed random variables with mean 0 and standard deviation 1, so if I sum their squares I can perform a chi-square test.  The chi-square statistic is 7.17, which is not suspicious at all.  To be suspiciously low it would need to be 3.49, and to be suspiciously high it would need to be over 13.36, according to the table in the appendix of my statistics book.  So indeed these stats give us no reason to believe the system is functioning badly.  Oh, well.  :-)

Title: Re: Evidence of inaccurate ratings
Post by 99of9 on Jun 9th, 2005, 10:23pm

on 05/28/05 at 17:46:09, Fritzlein wrote:
OK, 99of9, I decided to complete the computation and try for more accuracy.  I am doing much of it by hand


Really??  That is what computers are for!

Anyway, it's definitely interesting to know that we should stop talking about bot-bashing as the source of our problems.

Title: Re: Evidence of inaccurate ratings
Post by Fritzlein on Jun 13th, 2005, 7:01pm
Taking a tip from the thread discussing time controls, I repeated the chi-square test, but this time
dividing between fast and slow games rather than games versus bots and games versus humans.  This time the
results are striking.  There is less than 1 in 100 probability that the results could be so extreme by
chance, arguably less than 1 in 1000 probability.  This appears to be the first solid statistical evidence
that the current rating system is broken.  (Of course we already had lots of anecdotal evidence.)

For this analysis I called any game at 30 seconds per move or faster a "fast" game, and the other games
(which were all 45 seconds per move or slower) I called "slow games".  I considered the most recent 100 to
110 rated games (through June 11) of all 12 players who have played over 200 rated games on the server.  I
threw out Bleitner for playing no fast games in his last 100 and Haizhi for playing only one slow game in
his last 100.  Naveed, PMertens, and Arimanator have played relatively few slow games in their last 100, so
I first did the analysis without them, and then included them in a second pass.
PlayerFast GamesFast PerfFast Std DevSlow GamesSlow PerfSlow Std DevPerf Diff# of stddev
Clauchau311462127.075181843.0+356+2.66
Omar46167860.057194251.2+264+3.35
Belbo72187341.733199167.9+118+1.48
Robinson63208250.345212167.7+39+0.46
Fritzlein492428125.4532466175.4+38+0.18
Adanac35186758.971184948.8-18-0.24
99of938217469.170207954.2-95-1.08

Summing the squares of the last column gives 21.96, where the a number above 18.48 occurs with probability
p<0.01.  Including the less reliable data with less than 30 games per category, the result becomes:

PlayerFast GamesFast PerfFast Std DevSlow GamesSlow PerfSlow Std DevPerf Diff# of stddev
Clauchau311462127.075181843.0+356+2.66
Omar46167860.057194251.2+264+3.35
Naveed87169539.515186791.6+172+1.72
Belbo72187341.733199167.9+118+1.48
Robinson63208250.345212167.7+39+0.46
Fritzlein492428125.4532466175.4+38+0.18
Adanac35186758.971184948.8-18-0.24
PMertens92193336.4111894105.2-39-0.35
99of938217469.170207954.2-95-1.08
Arimanator89191438.6111604105.2-310-2.77

Summing the squares of the last column gives 32.71, where the a number above 29.59 occurs with probability
p<0.001.

On a side note, the data show me that Naveed is better slow than fast (at least recently) which surprises me as much as learning that he is better against bots than humans.  Otherwise the data seems very intuitive, with Belbo and Omar doing much better slow, while 99of9 does better fast.  I didn't overlap enough with Clauchau to anticipate his differential, but apparently he did abysmally against bot_speedy.

My ridiculous performance ratings come from the fact that I've only lost three of my last 102 rated games.  Add in even one more loss, for example my blitz loss to 99of9 which became a win for me due to his Internet lag, and I lose 100 points off my fast rating.

As before, I did much of the calculation by hand, so there are probably errors.  

Title: Re: Evidence of inaccurate ratings
Post by 99of9 on Jun 13th, 2005, 7:43pm

on 06/13/05 at 19:01:44, Fritzlein wrote:
PlayerFast GamesFast PerfFast Std DevSlow GamesSlow PerfSlow Std DevPerf Diff# of stddev
Clauchau311462127.075181843.0+356+2.66
Omar46167860.057194251.2+264+3.35
Naveed87169539.515186791.6+172+1.72
Belbo72187341.733199167.9+118+1.48
Robinson63208250.345212167.7+39+0.46
Fritzlein492428125.4532466175.4+38+0.18
Adanac35186758.971184948.8-18-0.24
PMertens92193336.4111894105.2-39-0.35
99of938217469.170207954.2-95-1.08
Arimanator89191438.6111604105.2-310-2.77


This also neatly divides people into smart and dumb:  those who play most games at the time controls they are best at, and those who don't :-):

Smart
Clauchau
Omar
Fritz
PMertens
Arimanator

Dumb
Naveed
Belbo
Robinson
Adanac
99of9 :P

[edit: I'm sure Fritz noticed this but didn't want to be rude]

Title: Re: Evidence of inaccurate ratings
Post by omar on Jun 15th, 2005, 5:36pm
Im starting to think that maybe the "error" that we sense in the ratings may be due more to our behavior than say a mathamatical flaw with the rating formula. In fact we started out  saying that the error should be due to our selection of opponents and time controls. I know for sure that my rating is off a bit right now since I was playing so many fast games recently. If I hadn't done that my std dev number might have been smaller. Maybe we just need some way to measure how much the ratings are from what they should be.

What about some measure of how much the ratings fluctuate about a moving average; maybe that might give us some idea of how much "error" (or fluctuation) there is in the ratings.

For example use the previous 20 rated games to compute the average rating of a player at the start of a game and see how that compares to the players actual rating at the start of a game. An average of the distance between pairs of these numbers over all the rated games (except the first 20) played by the user might give us an idea of how much fluctuation there is in a particular players rating. Averaging all the players values can give us an idea of the fluctuation for the whole system. We can maybe then compare this fluctuation against that obtained from an empirical simulation and get an idea of how much our behavior is impacting the ratings.

I wouldn't suggest doing this by hand. If we feel that this could be useful then we can pursue writting some programs to compute it.


Title: Re: Evidence of inaccurate ratings
Post by Fritzlein on Jun 15th, 2005, 6:41pm
I'll do something along the lines you suggest: I can write enough of a program for that.  :-)

Title: Re: Evidence of inaccurate ratings
Post by omar on Jun 19th, 2005, 5:21am
I agree with you Pat. In chess I think the ratings are published quarterly or something like that. Although that might be a bit extreme.

Only thing is that people might want some instant feedback. I know that after playing a rated game I always like to check how much my rating was affected.



Title: Re: Evidence of inaccurate ratings
Post by Fritzlein on Jun 19th, 2005, 10:33am

on 06/16/05 at 12:11:30, Arimanator wrote:
If you don't mind my opinion I believe that the ratings would be a lot more accurate, whatever formula is applied, if instead of being changed after each game they were changed instead after several, 5 for example


If I recall correctly, the fellow who designed the American Go Association ratings arrived at the same conclusion.  And I definitely agree that one problem with Arimaa ratings is that they are too volatile.  It is interesting to note, however, that the overly-volatile ratings of the the Internet Chess Club (ICC) were much inferior to the less-volatile ratings of the Free Internet Chess Server (FICS), but that the players on FICS didn't like having more accurate ratings.  Bowing to popular demand, the administration of FICS abritrarily gimmicked their more-accurate rating formula to make the rating swings larger!

Anyway, having each game account for a swing of up to 30 points (at the most stable level) is somewhat wild compared to chess ratings where the most stable level is a swing of 16 points.  And I believe it can be mathematically proven that having greater volatility leads to greater rating inaccuracy.

That said, I believe selection of time controls and selection of opponents are the biggest issues causing rating problems right now.  If the ratings moved only after every five games, or if they moved every game by a smaller amount, either way Omar would be vastly under-rated due to all the blitz games he has lost recently.  And bot-bashing to gain an inflated rating would perhaps take a little longer with reduced volatility, but it would be just as feasible.  So I'd put damping the large swings in ratings on the list of things to fix with the system, but probably not at the top of the list.

Title: Re: Evidence of inaccurate ratings
Post by omar on Jun 19th, 2005, 4:54pm
I forgot to mention that the new performance based rating system will have variable rating swings and will be much more stable than the current system.

I plan to focus on that once we iron out the WC format.



Arimaa Forum » Powered by YaBB 1 Gold - SP 1.3.1!
YaBB © 2000-2003. All Rights Reserved.