Arimaa Forum (http://arimaa.com/arimaa/forum/cgi/YaBB.cgi)
Arimaa >> General Discussion >> Experimental new rating system
(Message started by: aaaa on May 26th, 2006, 1:55pm)

Title: Experimental new rating system
Post by aaaa on May 26th, 2006, 1:55pm
Mark Glickman, known for his rating system Glicko, which extended Elo such that it takes time into account, has designed a successor system which also takes into account the volatility of the strength of a player (meaning that one can forego with the fudge of imposing a minimum deviation). It is described here (http://math.bu.edu/people/mg/glicko/glicko2.doc/example.html).

I've adapted Glicko-2 for use in a "real-time" context and took a stab at optimizing the various parameters based on all the rated games in the database. Unfortunately, it has not been originally designed to handle the ultra-short "rating periods" of one second I gave it, making it prone to hanging in the iteration phase. Nevertheless, I didn't want people here to miss out on it, so below is a list of the 50 top-rated players according to a particularly customized version of it (apologies for the bad layout as I couldn't get the data working with the forum table markup). I'm particularly interested in getting queries of a statistical nature about the system as well as hearing what properties exactly is desired of it here.

player rating (old style) rating deviation (old style) volatility
Fritzlein 5.6630497 (2483.772535) 0.8068123002 (140.1576578) 0.0006376507628
99of9 4.910623754 (2353.062755) 0.832506 (144.6211108) 0.0006516585594
robinson 4.62855297 (2304.062039) 0.7831770235 (136.0517895) 0.0006228492358
Adanac 4.610491197 (2300.924388) 0.7379585114 (128.1965291) 0.00056411572
PMertens 4.360161883 (2257.43773) 0.832506 (144.6211108) 0.0008407439315
Ryan_Cable 4.326954398 (2251.668999) 0.832506 (144.6211108) 0.0004918684095
Belbo 4.27533983 (2242.702629) 0.7576357981 (131.6148241) 0.0006193388697
mouse 3.962690082 (2188.389803) 0.832506 (144.6211108) 0.0005526332836
Arimanator 3.78416112 (2157.376145) 0.832506 (144.6211108) 0.0008069304669
RonWeasley 3.697059358 (2142.245018) 0.832506 (144.6211108) 0.0005989948555
chessandgo 3.671875516 (2137.870137) 0.5040354009 (87.55992097) 0.0007090357629
omar 3.546825706 (2116.146759) 0.832506 (144.6211108) 0.0006258232938
naveed 3.451697291 (2099.62126) 0.832506 (144.6211108) 0.0007640425682
blue22 3.382977003 (2087.683322) 0.832506 (144.6211108) 0.0005354651612
bot_Bomb2005CC 3.170315207 (2050.740183) 0.7609260811 (132.1864048) 0.0005321464476
bot_Bomb2005Fast 3.064107257 (2032.289972) 0.8262308837 (143.5310114) 0.000611501966
bot_Bomb2005Blitz 3.053605699 (2030.465664) 0.6707376748 (116.5190732) 0.0009073338997
OLTI 3.03204056 (2026.719416) 0.832506 (144.6211108) 0.0005484800951
bot_Bomb2005P2 2.822907443 (1990.389271) 0.4867876375 (84.56367745) 0.0004840308518
thorin 2.767536776 (1980.7704) 0.832506 (144.6211108) 0.0005788629022
omarFast 2.726652212 (1973.668024) 0.832506 (144.6211108) 0.0006681596461
bot_speedy 2.682962807 (1966.078396) 0.832506 (144.6211108) 0.0007288659695
bleitner 2.610592744 (1953.506428) 0.832506 (144.6211108) 0.0005072821952
jdb 2.610499995 (1953.490316) 0.832506 (144.6211108) 0.0006070601943
bot_Clueless2005Fast 2.58310922 (1948.732051) 0.6668649056 (115.8463043) 0.0006654729265
megamau 2.541955565 (1941.582928) 0.832506 (144.6211108) 0.0006978433977
bot_lightning 2.473854617 (1929.752582) 0.832506 (144.6211108) 0.0006763171832
Swynndla 2.422299879 (1920.796606) 0.7948650666 (138.0822107) 0.0006315119355
frostlad 2.419039538 (1920.230227) 0.8071741155 (140.2205116) 0.0006132318648
BlackKnight 2.347565986 (1907.813998) 0.832506 (144.6211108) 0.0006623329595
bot_GnoBot2005Fast 2.303539877 (1900.16588) 0.7849751855 (136.3641623) 0.000675043096
bot_Clueless2005Blitz 2.2345132 (1888.174717) 0.7551311049 (131.1797143) 0.0007258793672
bot_Clueless2005P2 2.212829489 (1884.407871) 0.6895930611 (119.7945895) 0.0006311812553
bot_Clueless2005CC 2.168773741 (1876.754603) 0.8308199736 (144.328218) 0.0006241990032
bot_Arimaanator 2.090386741 (1863.137386) 0.8148788129 (141.5589546) 0.0004081767749
bot_Clueless2006P2 2.075735855 (1860.592266) 0.7943007282 (137.984175) 0.0007195330871
kamikazeking 2.011747429 (1849.476337) 0.7556039343 (131.2618531) 0.0005792767201
ytri 1.972060074 (1842.581938) 0.832506 (144.6211108) 0.0005580832465
filerank 1.970310098 (1842.277936) 0.832506 (144.6211108) 0.000569107061
haizhi 1.894086955 (1829.036619) 0.832506 (144.6211108) 0.0008102594244
Aamir 1.856510045 (1822.508841) 0.832506 (144.6211108) 0.0006060001632
bot_haizhi 1.702288469 (1795.717808) 0.832506 (144.6211108) 0.0006619388723
bot_Bomb2004CC 1.674421387 (1790.8768) 0.832506 (144.6211108) 0.000606956265
clauchau 1.633644449 (1783.79312) 0.832506 (144.6211108) 0.0005567039486
grey_0x2A 1.593474756 (1776.814929) 0.832506 (144.6211108) 0.000601434371
deselby 1.591926598 (1776.545986) 0.832506 (144.6211108) 0.0006369085893
CeeJay 1.578303073 (1774.179338) 0.832506 (144.6211108) 0.0007813484887
bot_Aamira2006Fast 1.574885954 (1773.585723) 0.7628650488 (132.523238) 0.0006310808374
bot_Clueless2006Fast 1.562548955 (1771.442567) 0.832506 (144.6211108) 0.0006612313097
bot_Loc2005Blitz 1.538297761 (1767.229703) 0.7531418141 (130.834139) 0.0006051832325

Title: Re: Experimental new rating system
Post by Fritzlein on May 26th, 2006, 3:45pm
Hey, this rocks.  I have the highest respect for Mark Glickman, and it's cool to see what numbers are produced by an implementation of the Glicko system.  (Actually, I confess I liked Glicko but never read up on Glicko-2; an omission I shall shortly remedy.)

I'm curious why blue22 is ranked so much lower and Belbo so much higher in your ratings than the official ratings.  Maybe it's because Glicko doesn't like to move the ratings around as much?  Belbo established a very high rating with tons of games, and has since dropped off his peak in the official ratings, but maybe Glicko considered him extremely firmly established and didn't let his rating move down as much.

The rating deviation of 84 for Bomb2005P2 seems suspicious to me.  That bot has played 812 games, but only 447 of those were rated.  Why should it have a deviation so much lower than mine, when I've played 773 rated games?

Anyway, the issue I am most concerned about is not an issue that Glickman has addressed at all, to the best of my knowledge.  What troubles me is the non-transitivity of the ratings.  You can see the non-transitivity in action all the time in Arimaa.  Sometimes a newcomer will get stuck on BombP1 on the ladder, and lose thirty times in a row, driving their rating down to, say, 1200.  Meanwhile a newcomer who figures out a technique for beating BombP1 might win thirty in a row and pump their rating to 1800.  But the gap the between the two humans is not 600 points.  They are each properly rated relative to BombP1, but improperly rated relative to each other.  That is to say, the ratings are not transitive.

I would love there to be some mechanism whereby a ton of games against a single opponent would have a reduced impact on one's rating, in order to mitigate the effects of non-transitivity.  In my mind it seems roughly correct to weight games against a single opponent by the square root of the number of games so that, for example, 25 games against one opponent would have the same impact as one game each against five different opponents.

But I recognize that reducing the weight of certain games is a kludge, and I wish that I could think of a more elegant way to deal with non-transitivity.  I'd love to hear alternative suggestions.  Non-transitivity is such a huge problem, though, that I don't think it can be ignored.

Title: Re: Experimental new rating system
Post by aaaa on May 26th, 2006, 4:50pm

on 05/26/06 at 15:45:46, Fritzlein wrote:
Hey, this rocks.  I have the highest respect for Mark Glickman, and it's cool to see what numbers are produced by an implementation of the Glicko system.  (Actually, I confess I liked Glicko but never read up on Glicko-2; an omission I shall shortly remedy.)

Once again I would like to point out that Glicko-2 was not originally intended to be applied on a game-by-game basis. Glicko was modified to do so for the Free Internet Chess Server with Glickman's knowledge and I was curious enough to find out if the same was possible for Glicko-2.


on 05/26/06 at 15:45:46, Fritzlein wrote:
I'm curious why blue22 is ranked so much lower and Belbo so much higher in your ratings than the official ratings.  Maybe it's because Glicko doesn't like to move the ratings around as much?  Belbo established a very high rating with tons of games, and has since dropped off his peak in the official ratings, but maybe Glicko considered him extremely firmly established and didn't let his rating move down as much.

If you look at the fifth column, you can see that Belbo has been given a higher volatility than blue22. For some reason, the system thinks Belbo's performance is less consistent than blue22's (0.0006193388697 vs 0.0005354651612).


on 05/26/06 at 15:45:46, Fritzlein wrote:
The rating deviation of 84 for Bomb2005P2 seems suspicious to me.  That bot has played 812 games, but only 447 of those were rated.  Why should it have a deviation so much lower than mine, when I've played 773 rated games?

Probably due to the large amount of bot-bot matches taken into account, maximizing the prediction power of the system has resulted in the rating deviation growing very fast if a player doesn't play in a while. Depending on one's volatility, it will take only about 20 days before one's rating deviation becomes the maximum again. I've already been experimenting with excluding bot-bot matches from consideration.

Title: Re: Experimental new rating system
Post by Ryan_Cable on May 26th, 2006, 10:20pm
This does support my belief that our ratings are currently too compressed.  Other than that, I'm not clear on what the advantage of this system is over our current system.

Our current system is fairly easy to understand, and anyone can calculate the possible rating changes that would result from playing a given opponent.  I would not want to give that up unless there is a substantial improvement in rating accuracy.

Title: Re: Experimental new rating system
Post by aaaa on May 27th, 2006, 9:57am
Here's the result again after the choice of parameters has been optimized for rated games including at least one human. Tell me if this one is more sane.

player rating (old style) rating deviation (old style) volatility
Fritzlein 5.482469551 (2452.402549) 0.8047089398 (139.7922667) 0.000538855377
99of9 4.73692475 (2322.888146) 0.8288227 (143.981256) 0.0005558081298
robinson 4.435887799 (2270.59267) 0.7746629858 (134.5727496) 0.0005238959024
Adanac 4.401616739 (2264.639176) 0.7049974879 (122.4706126) 0.0004748660898
PMertens 4.212564795 (2231.797489) 0.8288227 (143.981256) 0.0006714964567
Ryan_Cable 4.147041625 (2220.414948) 0.8288227 (143.981256) 0.0004148334762
Belbo 4.058155186 (2204.973791) 0.7336598637 (127.4497775) 0.0005050846755
mouse 3.839142456 (2166.927381) 0.8288227 (143.981256) 0.0004747277447
RonWeasley 3.555657933 (2117.681074) 0.8288227 (143.981256) 0.0005129970293
Arimanator 3.55381375 (2117.360706) 0.8288227 (143.981256) 0.0006905204892
omar 3.399174901 (2090.497186) 0.8288227 (143.981256) 0.0005327536732
chessandgo 3.381264438 (2087.385819) 0.4764257499 (82.76363314) 0.0006149328405
naveed 3.282873511 (2070.293564) 0.8288227 (143.981256) 0.0006009688416
blue22 3.212482055 (2058.065315) 0.7934868976 (137.8427982) 0.0004544315336
bot_Bomb2005CC 2.970377488 (2016.007442) 0.6806065302 (118.2334691) 0.000455604352
OLTI 2.94866222 (2012.235114) 0.8288227 (143.981256) 0.0004670586044
bot_Bomb2005Blitz 2.838656284 (1993.125125) 0.6355597776 (110.4080463) 0.0007068470581
bot_Bomb2005Fast 2.82836776 (1991.337825) 0.8188877809 (142.2553838) 0.0005046538155
bot_Bomb2005P2 2.677105717 (1965.060915) 0.4418521777 (76.75758823) 0.0004078946749
omarFast 2.648254447 (1960.048936) 0.8288227 (143.981256) 0.0005759785643
thorin 2.530981686 (1939.67657) 0.8288227 (143.981256) 0.0004996524011
bleitner 2.474353922 (1929.83932) 0.8288227 (143.981256) 0.0004316956614
bot_speedy 2.434224123 (1922.868059) 0.8288227 (143.981256) 0.0005830126789
jdb 2.427799948 (1921.752066) 0.8288227 (143.981256) 0.0005177591978
megamau 2.420580392 (1920.4979) 0.8288227 (143.981256) 0.0006022504608
bot_Clueless2005Fast 2.388905996 (1914.995494) 0.6231364636 (108.2498956) 0.0005694958969
bot_lightning 2.342006325 (1906.848186) 0.8288227 (143.981256) 0.0005799969339
frostlad 2.256872674 (1892.058956) 0.7260031127 (126.1196635) 0.0005255536504
Swynndla 2.210853008 (1884.064521) 0.7130838884 (123.8753643) 0.0005439431955
BlackKnight 2.1852398 (1879.615051) 0.8288227 (143.981256) 0.0005633310056
bot_GnoBot2005Fast 2.095862689 (1864.088655) 0.712771455 (123.8210891) 0.0005831434752
bot_Clueless2005P2 2.072370715 (1860.007681) 0.6243452409 (108.4598817) 0.0005399026625
bot_Clueless2005Blitz 2.06877531 (1859.383096) 0.7454407812 (129.4963325) 0.0006259613599
bot_Clueless2005CC 2.025541811 (1851.872667) 0.7638517253 (132.6946412) 0.0005358564737
bot_Arimaanator 1.960662587 (1840.601991) 0.7346765571 (127.6263952) 0.0003402609408
kamikazeking 1.882753526 (1827.0678) 0.7424657372 (128.9795144) 0.0004977506049
bot_Clueless2006P2 1.863160001 (1823.664056) 0.7153645161 (124.2715499) 0.0006227940527
ytri 1.849213454 (1821.241293) 0.8288227 (143.981256) 0.0004792753914
filerank 1.821360335 (1816.40271) 0.8288227 (143.981256) 0.0004887645442
Aamir 1.796505038 (1812.084903) 0.8288227 (143.981256) 0.0005229366749
haizhi 1.7666057 (1806.890856) 0.8288227 (143.981256) 0.0007029900686
bot_haizhi 1.581152235 (1774.674288) 0.8288227 (143.981256) 0.000576510305
bot_Bomb2004CC 1.521184847 (1764.256885) 0.8288227 (143.981256) 0.0005199753632
grey_0x2A 1.497306546 (1760.108799) 0.8288227 (143.981256) 0.0005169386559
clauchau 1.483899672 (1757.779786) 0.8288227 (143.981256) 0.0004723108198
deselby 1.456556559 (1753.029801) 0.8288227 (143.981256) 0.0005515469025
CeeJay 1.450091942 (1751.906782) 0.8288227 (143.981256) 0.0006684186778
6sense 1.445184387 (1751.054252) 0.8288227 (143.981256) 0.0005290605304
bot_Clueless2006Fast 1.440374412 (1750.218674) 0.7802854833 (135.5494775) 0.0005738607317
Paul 1.431647882 (1748.70272) 0.8288227 (143.981256) 0.0005009657358

Title: Re: Experimental new rating system
Post by Fritzlein on May 29th, 2006, 10:49am
I decided to produce ratings based purely on rated human vs. human games (1553 of them so far), due to discussion in another thread.  However, when I started to implement an idea we had discussed earlier to deal with non-transitivity, I got distracted by a different, older idea I had of making the ratings retrospective.

All rating systems I know of use an "update and forget" method.  After each game (or each rating period) players get new ratings, but how they reached their ratings is thrown away.  They carry only their rating forward, and no history.  Forgetting history might have some disadvantages, in my estimation.

Suppose, for example, that chessandgo joins the server and for his first ten human games loses to me ten times.  The rating system gives me hardly any credit for my wins.  If then chessandgo beats a lot of other people to push his rating up, we know after the fact that he was pretty good all along, but that doesn't help me any.  I only get points for beating a sub-1500 player even if he was increasing towards 1900 strength by the end our first ten games.

So I created a new historical system (or one might say retrospective system) to counteract this trend.  It remembers all the old game results, and if someone does better (or worse) in the future, it retrospectively adjusts their ratings up (or down) in the past, as well as retropectively adjusting the awards and penalities to their opponents.

There's actually just one formula in the FRIAR system (Fritz's Retrospectively Iterated Arimaa Ratings):  Your rating as of any game is the average of your rating from the game before and your rating from the game after, plus the award/penalty for the game itself.  The game award/penalty is calculated from the same formula as standard Elo ratings with a k-factor of 15, i.e.

15 * (score - 1/(1+10^((Ropp - Rmine)/400)))

If it is a player's first game, his "rating from the game before" is 1500.  If it is a player's last game, then he just gets the game award tacked on to the previous game.

To calculate the ratings to match this formula, I just iterated a bunch of times.  The first interesting point is that the ratings are much more volatile than standard Elo ratings with a k-factor of 32.

The second interesting point is that the ratings converge glacially slowly.  I did 200 iterations overnight, but I suspect that the extreme ratings would push out an additional hundred points if only I could do 2000 iterations.  Unfortunately, my code is dog-slow because all parameters are stored (and looked up) in MS Access tables.  If someone did this properly with a C array and some pointers, it would probably take a second per iteration instead of a minute per as it took me.

So here are the not-really-converged ratings according to FRIAR, based only on 1553 hvh rated games, and compared to the current server ratings:

Name  FRIAR  Sever
Fritzlein 2320 2309
Adanac 2245 2177
robinson 2230 2148
99of9 2212 2169
Belbo 2172 2002
PMertens 2115 2086
Ryan_Cable 2085 2130
chessandgo 2052 2015
omar 2050 1947
blue22 1989 2005
Swynndla 1989 1790
RonWeasley 1979 1941
BlackKnight 1918 1833
naveed 1876 1956
jdb 1875 1796
OLTI 1850 1958
Spunk 1750 1472
mouse 1728 2051
KT2006 1715 1657
frostlad 1715 1807
seanick 1702 1537
grey_0x2A 1692 1709
Arimanator 1689 2035
kamikazeking 1668 1751
thorin 1654 1895
megamau 1649 1788

Belbo has a significantly higher rating under FRIAR.  This makes complete sense because he had a stellar result in last year's postal tourney, and has hardly played humans since then, except for the four games he has already won in this year's postal.  His reduced server rating is due to losing a few to BombFast while training for the WC, and FRIAR ignores such games.

Swynndla also gets a huge boost in FRIAR from beating tons of different human players, even though many were newcomers.  He may therefore be somewhat overrated in FRIAR, but I don't mind seeing that the same strategy that works in Player of the Month also boosts the FRIAR rating.

I'm pleased that FRIAR rates jdb and naveed about the same, despite their divergent server ratings.

I had never heard of Spunk before, but he had a good record in the very early days of the server against omar, who later turned out to be very good.  Then when the early bots came on-line, Spunk lost all his points to those bots, then left.  The FRIAR rating for Spunk actually nearly matches his server rating from before the time he started to play bots.

I'm sure seanick will be happy to note that FRIAR respects his record against human opponents and ignores his string of losses to tough bots.

FRIAR gives a huge rating penalty to mouse relative to mouse's server rating.  This reflects the fact that mouse has only played 12 rated games against humans ever.  He has a 6-6 record against fairly tough opposition, but it simply isn't enough games to pull away from 1500 very far.

Arimanator, in contrast, has played enough games against humans to establish a rating, but his 22-46 record doesn't put him very high in the FRIAR rankings.  His high server rating is attributable largely to bot-bashing.

I was surpised clauchau didn't make the list of top players, but after peaking at 1899, he dropped back to 1626.  That goes to show what happens if you don't keep up with advances in Arimaa theory.

Haizhi, filerank, ytri, and some other players with a decent server ranking are invisible to FRIAR because they have played no games or hardly any games against humans.  Thorin will show up in the rankings much more clearly once the current postal tournament is over, I guarantee.

On the whole, I don't think FRIAR ratings are any more accurate than the server ratings in terms of predicting future game outcomes.  Neverthelss, I think FRIAR admirably meets the goals of a pure-human rating to go alongside the standard server rating.

Title: Re: Experimental new rating system
Post by Ryan_Cable on May 29th, 2006, 2:09pm
I don’t understand how the retrospective iteration works.  Are you assuming that everyone has constant skill over time?  That seems like a particularly bad idea.

I am pleasantly surprised to see how high my HvH rating is.  I thought I was significantly more overrated than that.

Title: Re: Experimental new rating system
Post by Fritzlein on May 29th, 2006, 2:37pm

on 05/29/06 at 14:09:15, Ryan_Cable wrote:
I don’t understand how the retrospective iteration works.  Are you assuming that everyone has constant skill over time?

No, no, I'm not holding skill constant over time.  Your rating at a given time is influenced most by the games very near it, and less and less by games that far before it or far after it.  So your rating at the time of your second game is hardly influenced at all by whether your hundredth game was a win or a loss.

To each player of each game, I assign a rating that is supposed to represent his skill at the time of that game.  The assumption is that his skill at that time will be approximately the average of his skill the game before and the game after.

Take my last three games, for example:

32240 Ryan_Cable vs. Fritzlein
32276 Fritzlein vs. chessandgo
32282 Fritzlein vs. Swynndla

As part of my iterative pass through the ratings, I want to re-calculate how strong I was when I played game 32276.  I look ahead and see I was rated 2310 in game 32282, but only rated 2302 in game 32240.  My rating should be near the average of 2306.  I beat chessandgo who was rated 2052, So I recalculate my rating in game 32276 as

2306 + 15*(1-1/(1+10^((2052-2306)/400) =

2308.8221

When the ratings stabilize after many many iterations, each player's rating in each game will be exactly equal to the average of his ratings before and after, plus the bonus (penalty) for winning (losing) the game in question.  

This list I gave was only the ratings of each player at the end of the line; I apparently peaked about 150 points higher than my final rating.  Long winning streaks or losing streaks will cause your rating to whip around even more in the FRIAR system than in the current server system.

There is probably a much cleverer way to reach convergence than by making pass after pass of setting each rating in each game to what it would have been given the other ratings of the previous iteration.  My coding ability was only adequate for a simplistic solution that doesn't run fast enough to converge in a reasonable amount of time.  :-(  In C on a fast computer, however, the simplistic iteration might be adequate.

Title: Re: Experimental new rating system
Post by chessandgo on May 29th, 2006, 5:50pm

on 05/29/06 at 10:49:10, Fritzlein wrote:
Suppose, for example, that chessandgo joins the server and ...


I'm fortunate not to have you as a math teacher :
let chessandgo and BlackKnight be real numbers, then chessandgo^2 + Blacknight = ...
it would be really harder to write down equations :)

Title: Re: Experimental new rating system
Post by seanick on May 31st, 2006, 1:21am
Yeah, I am all for this new rating system, heh heh...

what about something that kept track of time taken? would the best players games take longer per move relative to the time scale, than less highly rated players? does the line go up or down in terms of % of available time per move, when playing someone of equal rating? Are those numbers easily mineable or are they somewhat obscured within various sources?

I am not a linux user but have begun to study some things analytically with code on win32. so such things would interest me except for the problem of having to use linux. I wouldn't mind, but ... my employer would have a few reservations about the idea.

Title: Re: Experimental new rating system
Post by Fritzlein on May 31st, 2006, 9:18am
One problem with the server ratings (which FRIAR doesn't address in the slightest) is that different humans seem to benefit differently from extra thinking time.  Some players, notably Belbo and Omar, are tigers at a slow time control or postally, but tend to fall apart in fast games.  Other players, most notably kamikazeking and PMertens, can play great moves even at blitz speeds, but don't seem to get very much better given more time.  (Actually, PMertens doesn't even use all of his time given more time.)

In my opinion it isn't a good idea to say the players who can move faster are the better players.  There are different kinds of skill.  I'd rather say that some players are good at blitz and other players are good at postal games.  In another thread we discussed having ratings reflect time control.

http://arimaa.com/arimaa/forum/cgi/YaBB.cgi?board=talk;action=display;num=1103741634;start=0#0

Note that back then the fastest time control available was 30 seconds per move, and it was already an issue!


Title: Re: Experimental new rating system
Post by aaaa on May 31st, 2006, 12:33pm
You might be interested in this article (http://www.chessbase.com/newsdetail.asp?newsid=562), where it is proposed that games at different time controls are to be given different weights.

Title: Re: Experimental new rating system
Post by Fritzlein on Jun 6th, 2006, 12:11am
I decided that a k-factor of 15 was making the FRIAR ratings way too volatile; I lowered it to 10.  I ran the numbers again, this time letting them converge a bit longer.  Also I added in last week's games, 23 more.  (Sorry, chessandgo, your four big wins from Sunday and Monday aren't there yet; you would surely be over 2100 with them included.)  The FRIAR top 25, with number of games played:

rate      games      username
2417      215      Fritzlein
2236      201      99of9
2228      100      Adanac
2194      265      PMertens
2182      201      robinson
2149      121      Belbo
2086      111      Ryan_Cable
2034      116      omar
2031      126      jdb
2015      67      chessandgo
2013      103      Swynndla
1963      73      blue22
1961      19      RonWeasley
1912      223      naveed
1897      79      OLTI
1894      18      BlackKnight
1765      66      kamikazeking
1742      22      frostlad
1714      13      Spunk
1701      68      Arimanator
1691      12      mouse
1680      23      grey_0x2A
1645      16      KT2006
1644      49      megamau
1639      43      clauchau

Title: Re: Experimental new rating system
Post by Fritzlein on Jun 6th, 2006, 1:14am
And now the fun part: a graph of the historical FRIAR ratings.  (Only the top 7 by number of hvh games get personalized colors; sorry!)

http://www.math.umn.edu/~juhn0008/HistoricalFRIAR.png

Note how volatile the ratings are even with the k-factor reduced to 10.  On the official server ratings I retained the top ranking even when I tied for fourth in the 2006 World Championship, but the FRIAR ratings have me dipping below robinson, Adanac, and PMertens, i.e. all three of the WC medalists.

At the same time that FRIAR ratings are volatile, note that people have to play a significant number of games to move far from 1500.  In this sense the volatility of FRIAR is opposite to that of the server.  On the server your rating changes a lot at first, and slowly later.  With FRIAR your rating changes slowly until you have played fifteen games or so, but later on winning streaks (or losing streaks) have a bigger effect than they do on the server.

I note that in August 2004, around the time I joined the server, FRIAR considered 99of9 to be the most dominant player of any time period.  My current ratings lead of 180 points looks wimpy compared to the 350-point lead 99of9 had back then.

Title: Re: Experimental new rating system
Post by chessandgo on Jun 6th, 2006, 9:12am
Great !!! I had the feeling that this forum had not been used for ages  >:( ... thanks for putting once more some life in it Fritz !

I see nothing but a big yellow line in there ;)

Title: Re: Experimental new rating system
Post by Fritzlein on Jun 6th, 2006, 12:54pm

on 05/31/06 at 12:33:56, aaaa wrote:
You might be interested in this article (http://www.chessbase.com/newsdetail.asp?newsid=562), where it is proposed that games at different time controls are to be given different weights.

It is interesting that blitz chess games can be used to make classical chess ratings more predictive of classical chess results.  I wonder, however, whether Sonas did any study of whether people have statistically distinguishable playing strengths at different time controls.  For Arimaa there is clearly a difference in playing strength.  See the eleventh post in this thread (http://arimaa.com/arimaa/forum/cgi/YaBB.cgi?board=talk;action=display;num=1117068449)

Title: Re: Experimental new rating system
Post by Fritzlein on Jun 6th, 2006, 1:18pm

on 05/26/06 at 16:50:03, aaaa wrote:
maximizing the prediction power of the system has resulted in the rating deviation growing very fast if a player doesn't play in a while. Depending on one's volatility, it will take only about 20 days before one's rating deviation becomes the maximum again.

It's very interesting that maximizing predictive accuracy means that our uncertainty should stay high (or at least rapidly increase back to the maximum) and thus the ratings should remain "loose" i.e. subject to quick change.

I wonder whether people really do change rapidly in playing strength from three weeks of inactivity.  Certainly the server bots do not.  I suspect that the reason "loose" ratings are more predictive is actually more due to intransitivity than due to rapidly-changing skill level.

If I suddenly decide to pump up my rating by beating one bot over and over again, the rating system will be more predictive if it quickly forgets my history and adjusts to my new performance.  Likewise if someone dominates the first dozen bots on the ladder before getting stuck on one they lose to twenty times in a row, the rating system will be more predictive if it quickly forgets the prior wins and lowers that player's rating to predict future losses.

In other words, the worse the problem of intransitivity is, the more stupid slow-changing ratings appear to be compared to fast-changing ratings.  Yet we can be sure that fast-changing ratings for bots (especially fixed-performance bots ) are not reflecting underlying changes in skill.  The changes in a bot's rating mostly reflect changes in who is dominating (or being dominated by) the bot at any given time, i.e. it is mostly reflecting the intransitivity of ratings.

Of course these are just my speculations.  Statistically verifying my thesis would be quite another matter.  ;-)

Title: Re: Experimental new rating system
Post by mouse on Jun 6th, 2006, 3:29pm
I think these rating calculations is interesting. And I believe they point out the need to have a seperate HvH, HvB and BvB rating especially players highligted seem to be exampels of players who have very diffenrent ratings against bots and humans.


on 05/29/06 at 10:49:10, Fritzlein wrote:
Name  FRIAR  Sever
Fritzlein 2320 2309
Adanac 2245 2177
robinson 2230 2148
99of9 2212 2169
Belbo 2172 2002
PMertens 2115 2086
Ryan_Cable 2085 2130
chessandgo 2052 2015
omar 2050 1947
blue22 1989 2005
Swynndla 1989 1790
RonWeasley 1979 1941
BlackKnight 1918 1833
naveed 1876 1956
jdb 1875 1796
OLTI 1850 1958
Spunk 1750 1472
mouse 1728 2051
KT2006 1715 1657
frostlad 1715 1807
seanick 1702 1537
grey_0x2A 1692 1709
Arimanator 1689 2035
kamikazeking 1668 1751
thorin 1654 1895
megamau 1649 1788


Title: Re: Experimental new rating system
Post by Fritzlein on Jun 19th, 2006, 1:36am
I updated my database with 78 more rated hvh games and ran the FRIAR ratings again.  Seeing the results made me realize that FRIAR ratings were way more volatile than I wanted.  Winning streaks and losing streaks created huge swings.   Chessandgo's rating, for example, went up and then down more than 100 points in a two-week span.

To counteract this volatility, I reduced the k-factor further from 10 to 6.  That would make a 100-point swing into only a 60-point swing.  The scale remains the same (e.g. a rating difference of 200 points still should mean a 76% winning chance for the favorite), but each player's early ratings become all the more tied to 1500.  It now takes about 25 games against humans for your FRIAR rating to really start floating freely.  I guess that could be considered a feature: you stay near 1500 until there's quite a bit of proof you should be higher or lower.

Here are the new FRIAR top 20, with number of games played:

rate      games      username
2387      218      Fritzlein
2240      203      99of9
2116      122      Belbo
2033      204      robinson
2026      110      Adanac
2024      115      Ryan_Cable
2020      279      PMertens
2007      95      chessandgo
2005      84      OLTI
1932      119      omar
1927      105      Swynndla
1914      130      jdb
1884      75      blue22
1882      223      naveed
1867      20      RonWeasley
1819      18      BlackKnight
1753      72      kamikazeking
1705      23      frostlad
1693      68      Arimanator
1673      13      thorin

I note Belbo jumped from 6th place to 3rd.  He posted only a win over OLTI in the postal tournament, extending his winning streak to six games, and FRIAR responds to streaks.  Also the people he jumped, namely Adanac, robinson, and PMertens, posted results of 4-5, 0-3, and 4-6 respectively against other top players, losing records which weren't compensated by a win or two against lower-ranked folks.

99of9 had only two results, postal wins over blue22 and robinson, which extends 99of9's winning streak to seven and widens his lead over any contenders for second place.  Similarly I posted only three wins, extending my winning streak to nine.  Even with the k-factor reduced to 6, FRIAR is very concerned to know, "What have you done for me lately?"

Chessandgo leapfrogged jdb and omar, and helped drag OLTI up too by losing to OLTI twice while compiling a 9-4 combined record against PMertens, RyanCable, and Adanac.

RonWeasley's only result was beating seanick, but Ron dropped in the rankings anyway because I reduced the k-factor.   That binds him more closely to 1500, and he has only played 20 rated hvh games so far.  I couldn't believe Ron has played only twenty games given the impact he has had around here, but the on-line game record agrees.  The fact that Ron could play more if the client were more stable is by itself a powerful argument for improving the client.

Title: Re: Experimental new rating system
Post by RonWeasley on Jun 19th, 2006, 11:44am

Quote:
I couldn't believe Ron has played only twenty games given the impact he has had around here,


Much like the impact a bludger has on a beater.  I'm just batting practice most of the time, but if you don't look out, I might knock you off your broom!

Title: Re: Experimental new rating system
Post by chessandgo on Jun 19th, 2006, 8:11pm
Karl, don't you feel like the more you improve your rating system, the closer to Omar's actual ratings the output gets ?  :P

Title: Re: Experimental new rating system
Post by Fritzlein on Jun 19th, 2006, 10:40pm
Well, the main thing I like about FRIAR right now compared to Omar's ratings is that FRIAR uses only hvh results.  To get a good comparison, I would have to run Omar's ratings side-by-side with only hvh games as input.  Then I could compare directly.

Hmmm, actually that's not a bad idea...

Title: Re: Experimental new rating system
Post by Fritzlein on Jun 26th, 2006, 5:38pm
I updated FRIAR with 27 more games.  

rate      games      username
2485      223      Fritzlein
2242      203      99of9
2223      106      chessandgo
2155      123      Belbo
2064      206      robinson
2061      86      OLTI
2060      115      Adanac
2044      282      PMertens
2003      119      Ryan_Cable
1943      105      Swynndla
1924      120      omar
1887      75      blue22
1884      223      naveed
1868      22      RonWeasley
1847      133      jdb
1819      18      BlackKnight
1732      73      kamikazeking
1708      23      frostlad
1693      68      Arimanator
1675      13      thorin

The big story is chessandgo jumping 5 more places.  Chessandgo won against Adanac, seanick, RonWeasley, Ryan_Cable, jdb, robinson, PMertens, and seanick, while losing only three games, all to me.  That incidentally also pushes my rating to a ridiculous high.

OLTI gets a retrospective bounce from his previous victories over chessandgo, and from having not lost in a while.

This week shows once again that FRIAR ratings are more volatile than server ratings.  I gained 100 points for three wins!  I might have to reduce the k-factor still further.

Title: Re: Experimental new rating system
Post by OLTI on Jun 28th, 2006, 4:24am
I'm starting to like FRIAR  ;D

Title: Re: Experimental new rating system
Post by DorianGaray on Jun 28th, 2006, 6:03am

on 06/26/06 at 17:38:36, Fritzlein wrote:
...That incidentally also pushes my rating to a ridiculous high...

Shouldn't that sole detail give you pause about the validity of your system? You've played for nearly two years yet your rating is completly uspet , by what? Another player's latest 5 games!

Ratings are only valid if they can predict accurately the likelyhood that one player has to win against another at any given time, otherwise they are just useless numbers. I don't see anything here that shows that your system is doing that.

Title: Re: Experimental new rating system
Post by Fritzlein on Jun 28th, 2006, 8:10am

on 06/28/06 at 06:03:05, DorianGaray wrote:
Ratings are only valid if they can predict accurately the likelyhood that one player has to win against another at any given time, otherwise they are just useless numbers. I don't see anything here that shows that your system is doing that.

Sure, I don't think the FRIAR ratings are very predictive.  My idea for a new rating system is working rather poorly so far.  I think I would have to do something to reduce the volatility to make the results more reasonable.   Also I'm starting to not like how strongly everyone is tied to 1500 for their first few games.

The main point of generating ratings based only on human games is that the standard server ratings can be distorted a great deal by bot-bashing, or alternatively by losing repeatedly to a bot that one can't beat.  Both of these distortions seem reasonably common.

Take your case for example.  You have played 196 rated games against bots and 10 rated games against humans.  I submit that your rating of 2228 doesn't tell us very much about your chances of winning against, say, 99of9, who is the player closest in rating to you.  You might beat him 90% of the time, or you might lose to him 90% of the time; we just can't tell from your server rating which is based mostly on your games versus bots.

Your ten results against humans (and their ratings) were

L 1949
W 1442
L 2317
W 1483
W 1516
W 1384
L 1863
W 1408
W 1517
W 1518

The rating that predicts a result of 7-3 against this opposition is 1823.  With a true strength of 2228, you would be expected to be 9-1 against this opposition, on average.  This appears to be a situation where the server ratings are not very predictive, although we have no reliable idea how you would do against human opposition until you start playing more games against humans.  Maybe if you played more human opponents you would justify a rating of 2200, or 2500, or anything.  We simply can't predict very well on the basis of only ten games.

I'd be interested in human-only ratings lists even if they were generated by a totally different methodology than FRIAR ratings.   In my opinion it is ultimately going to be very hard to generate valid (or accurate, or predictive) ratings if games against bots are included at all, no matter what methodology is used.

I agree, however, that in the case of FRIAR ratings, the cure is worse than the disease.  There's no way I deserve to be rated 200 points ahead of everyone else based on my results against humans, and that's not the only visible distortion in FRIAR ratings.

Title: Re: Experimental new rating system
Post by Fritzlein on Jun 30th, 2006, 9:47pm
I changed FRIAR again.  To cut out the still-too-high volatility, I reduced the k-factor all the way to 2.  Also, to reduce the too-strong tie to 1500.  Instead of assuming that everyone has a strength of 1500 at the start, I don't make any assumptions about initial strength.  The tie to 1500 is rather 3 draws against a 1500-rated player at the beginning of everyone's game record.  The new list, produced from exactly the same games as the old list, is

rate      games      username
2443      223      Fritzlein
2216      203      99of9
2182      106      chessandgo
2102      206      robinson
2096      123      Belbo
2095      115      Adanac
2064      282      PMertens
2020      119      Ryan_Cable
1988      86      OLTI
1933      22      RonWeasley
1911      18      BlackKnight
1904      120      omar
1904      223      naveed
1898      105      Swynndla
1869      75      blue22
1854      13      Spunk
1851      133      jdb
1810      73      kamikazeking
1776      13      thorin
1772      12      mouse

One improvement is that people on a recent winning streak (me, chessandgo, Belbo, OLTI) aren't overly rewarded, and the people performing a bit below par recently (robinson, Adanac) aren't overly punished.  The more stable ratings means the present ratings are affected by games further back.

A second improvement is that RonWeasley jumps to a more reasonable level from not being so closely tied to 1500.  A third, more subtle improvement is that Swynndla dips.  He beat lots of newcomers, but since the newcomers themselves are less tied to 1500, they end up lower, and Swynndla therefore gets less boost from those victories.

One possible criticism is that I am still rated too far ahead of 99of9.  To check this I used server ratings to calculate my performance rating over my last 50 rated games against humans.  It turns out I am 47-3 (I didn't realize!), which gives me a 2429 performance rating versus humans.  99of9 is 38-12 against somewhat tougher opposition, for a performance rating of 2210 versus humans.  Since the FRIAR ratings are not far from vs. human performance ratings calculated by server ratings, I don't feel as bad about my inflated rating.

On the whole I'm rather pleased with my latest version.  Graph to follow.

Title: Re: Experimental new rating system
Post by chessandgo on Jun 30th, 2006, 11:05pm

on 06/30/06 at 21:47:17, Fritzlein wrote:
One possible criticism is that I am still rated too far ahead of 99of9.  To check this I used server ratings to calculate my performance rating over my last 50 rated games against humans.  It turns out I am 47-3 (I didn't realize!), which gives me a 2429 performance rating versus humans.  99of9 is 38-12 against somewhat tougher opposition, for a performance rating of 2210 versus humans.  Since the FRIAR ratings are not far from vs. human performance ratings calculated by server ratings, I don't feel as bad about my inflated rating.


I don't think you'll manage to design a rating system not giving you on top with huge lead :)
why don't you come and play to score a 50-0 instead of taking care of everyone's rating  ? I'd be glad to be take 3 last losses ;)

Title: Re: Experimental new rating system
Post by chessandgo on Jun 30th, 2006, 11:06pm
*to take the 3 last losses*

Title: Re: Experimental new rating system
Post by Fritzlein on Jul 1st, 2006, 9:08am
Here's a graph of my latest FRIAR ratings over time:
http://www.math.umn.edu/~juhn0008/HistoricalFRIAR2.png
Compared to the more-volatile system before:
http://www.math.umn.edu/~juhn0008/HistoricalFRIAR.png
I think the new one is a more reasonable guess at how fast playing skill actually changes, don't you?

Title: Re: Experimental new rating system
Post by aaaa on Apr 1st, 2007, 9:58pm
Real-time Glicko-2 was perhaps not such a hot idea after all. Since the original Glicko HAS found real-time applications, e.g. at the Free Internet Chess Server, it should have been obvious to experiment with that one first. The current rating system of Arimaa, although seemingly based on the same principles, strikes me as too much a kludge. In that light it would be interesting to see how it compares to a more faithfully implemented Glicko system. Again, the system here is optimized based on every human-involved rated game (currently numbering 34577). Any comments or questions on these ratings?

player rating RD
Fritzlein 2536.589308 151.3183257
chessandgo 2444.57311 141.2896649
syed 2413.391102 96.44818116
DorianGaray 2345.602406 152.2
Belbo 2331.532722 152.2
PMertens 2324.733678 152.2
99of9 2314.738731 152.2
robinson 2246.385158 152.2
RonWeasley 2219.323314 152.2
Arimanator 2118.80046 152.2
jdb 2104.273778 152.2
blue22 2094.621401 136.7078874
bot_Bomb2005Fast 2092.449876 90.26640411
obiwan 2091.575049 121.5170209
Ryan_Cable 2083.344584 152.2
thorin 2081.512715 152.2
mouse 2076.716541 152.2
arimaa_master 2062.639947 115.1808582
omar 2049.651718 152.2
Brendan 2046.479086 139.1713486
bot_Bomb2005P2 2041.870531 106.0700486
OLTI 2038.889491 142.6923251
clauchau 2035.542834 152.2
kerdamdam 2019.697991 152.2
jawdirk 2017.579664 152.2
bot_Bomb2004CC 2009.561431 152.2
bot_GnoBot2005Blitz 2008.050303 96.53058166
Adanac 2007.836721 152.2
The_Jeh 1985.357159 152.2
UltraWeak 1979.211538 152.2
bot_Clueless2005P2 1963.922121 147.516796
omarFast 1950.756119 152.2
bot_Zombie 1947.281312 152.2
woh 1936.687461 152.2
bleitner 1927.015185 152.2
kamikazeking 1920.066142 152.2
bot_Bomb2005Blitz 1903.92491 151.2594177
bot_GnoBot2006Blitz 1893.79771 152.2
bot_speedy 1874.594987 152.2
BlackKnight 1870.604424 152.2
bot_lightning 1859.550822 152.2
Soter 1858.647498 93.07385538
bot_Bomb2005CC 1851.349992 152.2
Chegorimaa 1839.335515 98.60569559
bot_Clueless2005Fast 1839.173029 152.2
fritzlforpresident 1837.236962 152.2
petitprince 1825.132353 113.2571471
megamau 1824.555125 152.2
bot_Clueless2005Blitz 1818.806912 152.2
filerank 1817.020033 152.2



Arimaa Forum » Powered by YaBB 1 Gold - SP 1.3.1!
YaBB © 2000-2003. All Rights Reserved.