Arimaa Forum - Evidence of inaccurate ratings

Welcome, Guest. Please Login or Register.
Apr 29^th, 2024, 12:54am

Home

Help

Members

Arimaa Forum « Evidence of inaccurate ratings »

   Arimaa Forum
   Arimaa
   General Discussion (Moderator: supersamu)
   Evidence of inaccurate ratings

« Previous topic | Next topic »

Pages: 1 2

Notify of replies

Send Topic

Author

Topic: Evidence of inaccurate ratings (Read 4104 times)

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Evidence of inaccurate ratings
« on: May 25^th, 2005, 7:47pm »

Quote

Modify

In another thread, Omar asked me to do some statistical analysis on the accuracy (bzw. inaccuracy) of Arimaa ratings. The first thing I wanted to check was the predicted scores of underdogs compared to their actual scores. This is like the doctor checking your blood pressure. High blood pressure is the symptom of so many different problems that doctors always check your blood pressure first, and if it is high, they try to figure out what is causing it. Most things that could be wrong with a rating system will result in the underdogs scoring more wins than the ratings would predict.

Sure enough, in 9444 decisive rated games, the ratings predicted the underdog to win 2565, but they actually won 2670. For the underdogs to be winning 4.1% more often than predicted may not seem like much, but in truth it suggests substantial rating inaccuracies of some type. The reason this small percentage could show a big problem is that the system has a built in mechanism to self-correct if it is making wrong predictions, so persistent wrong predictions point to something intractably wrong that prevents the system from self-correcting properly.

When I restricted the query to human versus bot games, the problem seemed to become even worse. In 7708 decisive rated hvb games, the ratings predicted the underdog to win 2067, but they actually won 2186, a surplus of 5.7%.

However, just as I was gearing up to estimate in terms of plus-or-minus rating points just how awful the problem of inaccurate ratings appears to be from these statistics, an obvious explanation occurred to me: It is all the fault of newcomers. Almost everyone who ever entered the system played their first rated game against Arimaazilla or, more recently, Arimaalon. The ratings inaccurately have these beginning players as favorites. For those who persist in playing, there is often a reverse trend: after losing enough points to get below the weak bots, the new players are supposedly underdogs, but they quickly learn how to beat the bots, and again the underdogs do better than expected.

Clearly, to get useful results, I need to filter out games involving beginners. Unfortunately, the game result database doesn't save the rating uncertainty (RU) of each player for each game, so I couldn't filter my query based on both players having a low RU. What I did instead was to filter on both players having ratings over 1600, because most players over 1600 have a low RU. The results shocked me. In 2506 decisive rated over-1600 games, the ratings predicted the underdog to win 786, but they actually won 785. In other words, apart from new people entering the system, the Arimaa ratings appear not to have high blood pressure.

One theory is that, because most games are human vs. bot games, the ratings are fairly accurate for human vs. bot games, but the ratings are not accurate for human vs. human games. When I further restricted my query to human vs. human games, the same incredible result surfaced: In 319 decisive rated over-1600 hvh games, the ratings predicted the underdog to win 96, and they actually won exactly 96!

Now, this doesn't mean that the system is actually functioning perfectly well. True, the most obvious symptom of inaccurate ratings does not afflict over-1600 players, but just because your blood pressure is normal doesn't mean you don't have some other disease. I'll run some other tests and report if I turn up anything interesting.

IP Logged

99of9
Forum Guru

Gnobby's creator (player #314)

Gender:

Posts: 1413

Re: Evidence of inaccurate ratings
« Reply #1 on: May 26^th, 2005, 12:14am »

Quote

Modify

Wow that's really interesting.

I'm glad you remembered the new-player factor, because after the first couple of paragraphs I was about to reply and say "Can you cut out the first 30 games from each player?"

I'll continue to watch with interest, because although I don't have many ideas on how to test ratings systems, these tests interest me a great deal.

When you say "decisive" - do you mean you've just cut out draws, or games lost on time as well?

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Evidence of inaccurate ratings
« Reply #2 on: May 26^th, 2005, 8:20am »

Quote

Modify

on May 26^th, 2005, 12:14am, 99of9 wrote:

When you say "decisive" - do you mean you've just cut out draws, or games lost on time as well?

I mean the result was "b" or "w" not "d" or "a". I didn't filter based on the method of victory.

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Evidence of inaccurate ratings
« Reply #3 on: May 26^th, 2005, 4:34pm »

Quote

Modify

Well, I did another test. Because I believe that botbashing generally makes ratings inaccurate, I decided to compare the performance ratings of various players divided into games against bots and games against humans. I suspected that people who beat bots the same way over and over (like me) would have lower performance against humans whereas people who play experimentally against bots and don't care about their ratings (like PMertens) would have higher performance against humans. In fact, the data more or less bear this out:

Player	Games vs Humans	Performance vs Humans	Games vs Bots	Performance vs Bots	Difference
PMertens	32	2020	74	1880	+140
Belbo	21	2000	83	1920	+80
99of9	61	2200	41	2120	+80
Arimanator	21	1420	87	1480	-60
Adanac	20	1830	83	1890	-60
Naveed	22	1670	82	1760	-90
Omar	28	1900	67	1990	-90
Fritzlein	25	2290	81	2480	-190

Notes on methodology: My database is a month out of date, so it doesn't include the recent spate of blitz games or Arimanator's abrupt ratings rise. (Also, If the blitz games were included, Omar wouldn't look so good versus bots!

) I took all players with over 200 rated games, and looked at their most recent 100 rated games or so. If they had less than 20 rated games against humans within those 100 (haizhi, bleitner, clauchau) I threw them out. For the rest, I calculated to within 10 points the rating they would have had to have in order to be predicted to win as many games as they actually won.

One result that surprised me was Naveed doing better against bots than humans, but on further reflection his recent win against me isn't included, and his poor result in the postal tourney (2 of 9 versus humans) explains most of his low anti-human rating. Perhaps 99of9 is right when he says Naveed is just erratic, and perhaps time controls are a factor here too. If there were enough good data I would impose restrictions on the time controls as well.

[EDIT: What I wrote next is bunkus; I correct it the post just after.]

At any rate, we can now do some math on the accuracy of the ratings. If the ratings were reliable, then we should be able to split players' games into groups however we like and get roughly the same performance rating in each group. They won't be exactly the same, because a player's performance from game to game will vary with a standard deviation of about plus or minus 150 points according to the rating model in use. However, measured over 20 games the standard deviation of performance should be only about 150/sqrt(20) = 33.5 and over 80 games should be only 150/sqrt(80) = 16.8. If you measure performance over 20 games and compare it to performance over a different 80 games, the standard deviation of the _difference_ in measured performance should be sqrt(33.5^2 + 16.8^2) = 37.5, or let's say 40 points.

But what to we see from our actual data? The average difference between performance versus humans and performance versus bots is 98.75, or let's say 100, which is two and a half standard deviations. If you measure something and _average_ two and a half standard deviations of error, your model is just not right. I'm not quite sure about the validity of my statistics, but I do believe that anti-human performance and anti-bot performance should not be that divergent. In fact, it would be more reasonable to conclude that human ratings are on average 100 points wrong depending on what style of play that human adopts versus bots.

« Last Edit: May 27^th, 2005, 12:10am by Fritzlein »

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Evidence of inaccurate ratings
« Reply #4 on: May 27^th, 2005, 12:03am »

Quote

Modify

Oops. In my previous post I totally biffed the calculation of how much error we can expect when dividing up data into two sets and comparing performance ratings between those two sets. I acted as if we made twenty measurements of performance and averaged them, i.e. something like (1340 + 1520 + ... + 1450)/20 = 1420. But that's not how performance rating is calculated at all. We have twenty measurements which are each a win or a loss, i.e. each 1 or 0. The sum of these measurements will have standard deviation sqrt(n*p*q) where n is the number of trials, p is the probability of winning, and q is the probability of losing. The performance rating will therefore have an standard error of 400/(ln(10)*sqrt(n*p*q)), a formula which one obtains by differentiating the formula for predicted score.

If we assume all the games are evenly matched, i.e. p=0.5, then in 20 games the standard error in performance rating is 78 points. Playing 80 games instead of 20 will cut the standard error in half, but increasing p increases the error in performance rating.

Taking my case, for example, I won 79 of my 81 games against bots, for a p of about 0.975, and a standard error in performance rating of 124 rating points over those 81 games. Against humans I won 23 of 25 for a p of 0.92 and a standard error of 128 rating points over those 25 games. In total the standard error is sqrt(124^2 + 128^2) = 178. The actual error of 190 rating points is only 1.06 standard deviations, i.e. a result that extreme will happen 27% of the time by chance. And mine was the most extreme numerically of all the differences. Sad

I could go back and calculate how extreme all the other numbers are, but I've got a hunch that they fall within a reasonable range of what could have happened randomly in a system that is otherwise functioning perfectly well. So I guess it is back to the drawing board as I look for statistical evidence of flawed ratings.

« Last Edit: May 27^th, 2005, 12:12am by Fritzlein »

IP Logged

99of9
Forum Guru

Gnobby's creator (player #314)

Gender:

Posts: 1413

Re: Evidence of inaccurate ratings
« Reply #5 on: May 27^th, 2005, 1:39am »

Quote

Modify

I'd suggest at least trying Paul. Yours are so high in error because your p is so far away from 0.5.

IP Logged

omar
Forum Guru

Arimaa player #2

Gender: male

Posts: 1003

Re: Evidence of inaccurate ratings
« Reply #6 on: May 27^th, 2005, 3:56pm »

Quote

Modify

on May 25^th, 2005, 7:47pm, Fritzlein wrote:

Unfortunately, the game result database doesn't save the rating uncertainty (RU) of each player for each game, so I couldn't filter my query based on both players having a low RU.

In 2506 decisive rated over-1600 games, the ratings predicted the underdog to win 786, but they actually won 785.

Just wondering how you do the calculations to come up with 786. Do you take the average rating of the underdog and the average rating of the opponent, then calculate the probability of winning for the underdog and multiply this by 2506?

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Evidence of inaccurate ratings
« Reply #7 on: May 28^th, 2005, 9:17am »

Quote

Modify

on May 27^th, 2005, 3:56pm, omar wrote:

I figured we might need that someday. I did save that info in a log file. I just added two new fields to the games database 'wratingk' and 'bratingk' that contain the RU. There are some games for which that data was lost. In those cases the value of the field is set to -1. If you download the new data files it should have these fields.

Just wondering how you do the calculations to come up with 786. Do you take the average rating of the underdog and the average rating of the opponent, then calculate the probability of winning for the underdog and multiply this by 2506?

Thanks for adding the RU to the database.

To calculate expected wins for the underdog, I used the expected wins formula on a game by game basis and then summed them up.

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Evidence of inaccurate ratings
« Reply #8 on: May 28^th, 2005, 5:46pm »

Quote

Modify

OK, 99of9, I decided to complete the computation and try for more accuracy. I am doing much of it by hand, which introduced a sign error last time I calculated my performance rating versus bots: It's 2420 not 2480. Here is the table of corrected results and how extreme they are:

Player	Games vs H	Perf vs H	Games vs B	Perf vs B	Diff	Std. Dev.	# of stddev
PMertens	32	2018	74	1875	+143	97	1.48
Belbo	21	2000	83	1916	+84	90	0.93
99of9	61	2197	41	2118	+79	106	0.75
Adanac	20	1833	83	1885	-52	90	-0.58
Arimanator	21	1416	87	1479	-63	97	-0.65
Omar	28	1904	67	1993	-89	86	-1.04
Naveed	22	1670	82	1763	-93	86	-1.08
Fritzlein	25	2289	81	2422	-133	179	-0.74

Now if I am not mistaken, the last column should be 8 independent normally-distributed random variables with mean 0 and standard deviation 1, so if I sum their squares I can perform a chi-square test. The chi-square statistic is 7.17, which is not suspicious at all. To be suspiciously low it would need to be 3.49, and to be suspiciously high it would need to be over 13.36, according to the table in the appendix of my statistics book. So indeed these stats give us no reason to believe the system is functioning badly. Oh, well.

IP Logged

99of9
Forum Guru

Gnobby's creator (player #314)

Gender:

Posts: 1413

Re: Evidence of inaccurate ratings
« Reply #9 on: Jun 9^th, 2005, 10:23pm »

Quote

Modify

on May 28^th, 2005, 5:46pm, Fritzlein wrote:

OK, 99of9, I decided to complete the computation and try for more accuracy. I am doing much of it by hand

Really?? That is what computers are for!

Anyway, it's definitely interesting to know that we should stop talking about bot-bashing as the source of our problems.

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Evidence of inaccurate ratings
« Reply #10 on: Jun 13^th, 2005, 7:01pm »

Quote

Modify

Taking a tip from the thread discussing time controls, I repeated the chi-square test, but this time
dividing between fast and slow games rather than games versus bots and games versus humans. This time the
results are striking. There is less than 1 in 100 probability that the results could be so extreme by
chance, arguably less than 1 in 1000 probability. This appears to be the first solid statistical evidence
that the current rating system is broken. (Of course we already had lots of anecdotal evidence.)

For this analysis I called any game at 30 seconds per move or faster a "fast" game, and the other games
(which were all 45 seconds per move or slower) I called "slow games". I considered the most recent 100 to
110 rated games (through June 11) of all 12 players who have played over 200 rated games on the server. I
threw out Bleitner for playing no fast games in his last 100 and Haizhi for playing only one slow game in
his last 100. Naveed, PMertens, and Arimanator have played relatively few slow games in their last 100, so
I first did the analysis without them, and then included them in a second pass.

Player	Fast Games	Fast Perf	Fast Std Dev	Slow Games	Slow Perf	Slow Std Dev	Perf Diff	# of stddev
Clauchau	31	1462	127.0	75	1818	43.0	+356	+2.66
Omar	46	1678	60.0	57	1942	51.2	+264	+3.35
Belbo	72	1873	41.7	33	1991	67.9	+118	+1.48
Robinson	63	2082	50.3	45	2121	67.7	+39	+0.46
Fritzlein	49	2428	125.4	53	2466	175.4	+38	+0.18
Adanac	35	1867	58.9	71	1849	48.8	-18	-0.24
99of9	38	2174	69.1	70	2079	54.2	-95	-1.08

Summing the squares of the last column gives 21.96, where the a number above 18.48 occurs with probability
p<0.01. Including the less reliable data with less than 30 games per category, the result becomes:

Player	Fast Games	Fast Perf	Fast Std Dev	Slow Games	Slow Perf	Slow Std Dev	Perf Diff	# of stddev
Clauchau	31	1462	127.0	75	1818	43.0	+356	+2.66
Omar	46	1678	60.0	57	1942	51.2	+264	+3.35
Naveed	87	1695	39.5	15	1867	91.6	+172	+1.72
Belbo	72	1873	41.7	33	1991	67.9	+118	+1.48
Robinson	63	2082	50.3	45	2121	67.7	+39	+0.46
Fritzlein	49	2428	125.4	53	2466	175.4	+38	+0.18
Adanac	35	1867	58.9	71	1849	48.8	-18	-0.24
PMertens	92	1933	36.4	11	1894	105.2	-39	-0.35
99of9	38	2174	69.1	70	2079	54.2	-95	-1.08
Arimanator	89	1914	38.6	11	1604	105.2	-310	-2.77

Summing the squares of the last column gives 32.71, where the a number above 29.59 occurs with probability
p<0.001.

On a side note, the data show me that Naveed is better slow than fast (at least recently) which surprises me as much as learning that he is better against bots than humans. Otherwise the data seems very intuitive, with Belbo and Omar doing much better slow, while 99of9 does better fast. I didn't overlap enough with Clauchau to anticipate his differential, but apparently he did abysmally against bot_speedy.

My ridiculous performance ratings come from the fact that I've only lost three of my last 102 rated games. Add in even one more loss, for example my blitz loss to 99of9 which became a win for me due to his Internet lag, and I lose 100 points off my fast rating.

As before, I did much of the calculation by hand, so there are probably errors.

« Last Edit: Jun 13^th, 2005, 7:11pm by Fritzlein »

IP Logged

99of9
Forum Guru

Gnobby's creator (player #314)

Gender:

Posts: 1413

Re: Evidence of inaccurate ratings
« Reply #11 on: Jun 13^th, 2005, 7:43pm »

Quote

Modify

on Jun 13^th, 2005, 7:01pm, Fritzlein wrote:

Player	Fast Games	Fast Perf	Fast Std Dev	Slow Games	Slow Perf	Slow Std Dev	Perf Diff	# of stddev
Clauchau	31	1462	127.0	75	1818	43.0	+356	+2.66
Omar	46	1678	60.0	57	1942	51.2	+264	+3.35
Naveed	87	1695	39.5	15	1867	91.6	+172	+1.72
Belbo	72	1873	41.7	33	1991	67.9	+118	+1.48
Robinson	63	2082	50.3	45	2121	67.7	+39	+0.46
Fritzlein	49	2428	125.4	53	2466	175.4	+38	+0.18
Adanac	35	1867	58.9	71	1849	48.8	-18	-0.24
PMertens	92	1933	36.4	11	1894	105.2	-39	-0.35
99of9	38	2174	69.1	70	2079	54.2	-95	-1.08
Arimanator	89	1914	38.6	11	1604	105.2	-310	-2.77

This also neatly divides people into smart and dumb: those who play most games at the time controls they are best at, and those who don't

:

Smart
Clauchau
Omar
Fritz
PMertens
Arimanator

Dumb
Naveed
Belbo
Robinson
Adanac
99of9 Tongue

[edit: I'm sure Fritz noticed this but didn't want to be rude]

« Last Edit: Jun 13^th, 2005, 7:47pm by 99of9 »

IP Logged

omar
Forum Guru

Arimaa player #2

Gender: male

Posts: 1003

Re: Evidence of inaccurate ratings
« Reply #12 on: Jun 15^th, 2005, 5:36pm »

Quote

Modify

Im starting to think that maybe the "error" that we sense in the ratings may be due more to our behavior than say a mathamatical flaw with the rating formula. In fact we started out saying that the error should be due to our selection of opponents and time controls. I know for sure that my rating is off a bit right now since I was playing so many fast games recently. If I hadn't done that my std dev number might have been smaller. Maybe we just need some way to measure how much the ratings are from what they should be.

What about some measure of how much the ratings fluctuate about a moving average; maybe that might give us some idea of how much "error" (or fluctuation) there is in the ratings.

For example use the previous 20 rated games to compute the average rating of a player at the start of a game and see how that compares to the players actual rating at the start of a game. An average of the distance between pairs of these numbers over all the rated games (except the first 20) played by the user might give us an idea of how much fluctuation there is in a particular players rating. Averaging all the players values can give us an idea of the fluctuation for the whole system. We can maybe then compare this fluctuation against that obtained from an empirical simulation and get an idea of how much our behavior is impacting the ratings.

I wouldn't suggest doing this by hand. If we feel that this could be useful then we can pursue writting some programs to compute it.

« Last Edit: Jun 15^th, 2005, 5:50pm by omar »

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Evidence of inaccurate ratings
« Reply #13 on: Jun 15^th, 2005, 6:41pm »

Quote

Modify

I'll do something along the lines you suggest: I can write enough of a program for that.

IP Logged

omar
Forum Guru

Arimaa player #2

Gender: male

Posts: 1003

Re: Evidence of inaccurate ratings
« Reply #14 on: Jun 19^th, 2005, 5:21am »

Quote

Modify

I agree with you Pat. In chess I think the ratings are published quarterly or something like that. Although that might be a bit extreme.

Only thing is that people might want some instant feedback. I know that after playing a rated game I always like to check how much my rating was affected.

« Last Edit: Jun 19^th, 2005, 5:21am by omar »

IP Logged

Pages: 1 2

Notify of replies

Send Topic


« Previous topic \| Next topic »