|
||||||
Title: Experimental new rating system Post by aaaa on May 26th, 2006, 1:55pm Mark Glickman, known for his rating system Glicko, which extended Elo such that it takes time into account, has designed a successor system which also takes into account the volatility of the strength of a player (meaning that one can forego with the fudge of imposing a minimum deviation). It is described here (http://math.bu.edu/people/mg/glicko/glicko2.doc/example.html). I've adapted Glicko-2 for use in a "real-time" context and took a stab at optimizing the various parameters based on all the rated games in the database. Unfortunately, it has not been originally designed to handle the ultra-short "rating periods" of one second I gave it, making it prone to hanging in the iteration phase. Nevertheless, I didn't want people here to miss out on it, so below is a list of the 50 top-rated players according to a particularly customized version of it (apologies for the bad layout as I couldn't get the data working with the forum table markup). I'm particularly interested in getting queries of a statistical nature about the system as well as hearing what properties exactly is desired of it here. player rating (old style) rating deviation (old style) volatility Fritzlein 5.6630497 (2483.772535) 0.8068123002 (140.1576578) 0.0006376507628 99of9 4.910623754 (2353.062755) 0.832506 (144.6211108) 0.0006516585594 robinson 4.62855297 (2304.062039) 0.7831770235 (136.0517895) 0.0006228492358 Adanac 4.610491197 (2300.924388) 0.7379585114 (128.1965291) 0.00056411572 PMertens 4.360161883 (2257.43773) 0.832506 (144.6211108) 0.0008407439315 Ryan_Cable 4.326954398 (2251.668999) 0.832506 (144.6211108) 0.0004918684095 Belbo 4.27533983 (2242.702629) 0.7576357981 (131.6148241) 0.0006193388697 mouse 3.962690082 (2188.389803) 0.832506 (144.6211108) 0.0005526332836 Arimanator 3.78416112 (2157.376145) 0.832506 (144.6211108) 0.0008069304669 RonWeasley 3.697059358 (2142.245018) 0.832506 (144.6211108) 0.0005989948555 chessandgo 3.671875516 (2137.870137) 0.5040354009 (87.55992097) 0.0007090357629 omar 3.546825706 (2116.146759) 0.832506 (144.6211108) 0.0006258232938 naveed 3.451697291 (2099.62126) 0.832506 (144.6211108) 0.0007640425682 blue22 3.382977003 (2087.683322) 0.832506 (144.6211108) 0.0005354651612 bot_Bomb2005CC 3.170315207 (2050.740183) 0.7609260811 (132.1864048) 0.0005321464476 bot_Bomb2005Fast 3.064107257 (2032.289972) 0.8262308837 (143.5310114) 0.000611501966 bot_Bomb2005Blitz 3.053605699 (2030.465664) 0.6707376748 (116.5190732) 0.0009073338997 OLTI 3.03204056 (2026.719416) 0.832506 (144.6211108) 0.0005484800951 bot_Bomb2005P2 2.822907443 (1990.389271) 0.4867876375 (84.56367745) 0.0004840308518 thorin 2.767536776 (1980.7704) 0.832506 (144.6211108) 0.0005788629022 omarFast 2.726652212 (1973.668024) 0.832506 (144.6211108) 0.0006681596461 bot_speedy 2.682962807 (1966.078396) 0.832506 (144.6211108) 0.0007288659695 bleitner 2.610592744 (1953.506428) 0.832506 (144.6211108) 0.0005072821952 jdb 2.610499995 (1953.490316) 0.832506 (144.6211108) 0.0006070601943 bot_Clueless2005Fast 2.58310922 (1948.732051) 0.6668649056 (115.8463043) 0.0006654729265 megamau 2.541955565 (1941.582928) 0.832506 (144.6211108) 0.0006978433977 bot_lightning 2.473854617 (1929.752582) 0.832506 (144.6211108) 0.0006763171832 Swynndla 2.422299879 (1920.796606) 0.7948650666 (138.0822107) 0.0006315119355 frostlad 2.419039538 (1920.230227) 0.8071741155 (140.2205116) 0.0006132318648 BlackKnight 2.347565986 (1907.813998) 0.832506 (144.6211108) 0.0006623329595 bot_GnoBot2005Fast 2.303539877 (1900.16588) 0.7849751855 (136.3641623) 0.000675043096 bot_Clueless2005Blitz 2.2345132 (1888.174717) 0.7551311049 (131.1797143) 0.0007258793672 bot_Clueless2005P2 2.212829489 (1884.407871) 0.6895930611 (119.7945895) 0.0006311812553 bot_Clueless2005CC 2.168773741 (1876.754603) 0.8308199736 (144.328218) 0.0006241990032 bot_Arimaanator 2.090386741 (1863.137386) 0.8148788129 (141.5589546) 0.0004081767749 bot_Clueless2006P2 2.075735855 (1860.592266) 0.7943007282 (137.984175) 0.0007195330871 kamikazeking 2.011747429 (1849.476337) 0.7556039343 (131.2618531) 0.0005792767201 ytri 1.972060074 (1842.581938) 0.832506 (144.6211108) 0.0005580832465 filerank 1.970310098 (1842.277936) 0.832506 (144.6211108) 0.000569107061 haizhi 1.894086955 (1829.036619) 0.832506 (144.6211108) 0.0008102594244 Aamir 1.856510045 (1822.508841) 0.832506 (144.6211108) 0.0006060001632 bot_haizhi 1.702288469 (1795.717808) 0.832506 (144.6211108) 0.0006619388723 bot_Bomb2004CC 1.674421387 (1790.8768) 0.832506 (144.6211108) 0.000606956265 clauchau 1.633644449 (1783.79312) 0.832506 (144.6211108) 0.0005567039486 grey_0x2A 1.593474756 (1776.814929) 0.832506 (144.6211108) 0.000601434371 deselby 1.591926598 (1776.545986) 0.832506 (144.6211108) 0.0006369085893 CeeJay 1.578303073 (1774.179338) 0.832506 (144.6211108) 0.0007813484887 bot_Aamira2006Fast 1.574885954 (1773.585723) 0.7628650488 (132.523238) 0.0006310808374 bot_Clueless2006Fast 1.562548955 (1771.442567) 0.832506 (144.6211108) 0.0006612313097 bot_Loc2005Blitz 1.538297761 (1767.229703) 0.7531418141 (130.834139) 0.0006051832325 |
||||||
Title: Re: Experimental new rating system Post by Fritzlein on May 26th, 2006, 3:45pm Hey, this rocks. I have the highest respect for Mark Glickman, and it's cool to see what numbers are produced by an implementation of the Glicko system. (Actually, I confess I liked Glicko but never read up on Glicko-2; an omission I shall shortly remedy.) I'm curious why blue22 is ranked so much lower and Belbo so much higher in your ratings than the official ratings. Maybe it's because Glicko doesn't like to move the ratings around as much? Belbo established a very high rating with tons of games, and has since dropped off his peak in the official ratings, but maybe Glicko considered him extremely firmly established and didn't let his rating move down as much. The rating deviation of 84 for Bomb2005P2 seems suspicious to me. That bot has played 812 games, but only 447 of those were rated. Why should it have a deviation so much lower than mine, when I've played 773 rated games? Anyway, the issue I am most concerned about is not an issue that Glickman has addressed at all, to the best of my knowledge. What troubles me is the non-transitivity of the ratings. You can see the non-transitivity in action all the time in Arimaa. Sometimes a newcomer will get stuck on BombP1 on the ladder, and lose thirty times in a row, driving their rating down to, say, 1200. Meanwhile a newcomer who figures out a technique for beating BombP1 might win thirty in a row and pump their rating to 1800. But the gap the between the two humans is not 600 points. They are each properly rated relative to BombP1, but improperly rated relative to each other. That is to say, the ratings are not transitive. I would love there to be some mechanism whereby a ton of games against a single opponent would have a reduced impact on one's rating, in order to mitigate the effects of non-transitivity. In my mind it seems roughly correct to weight games against a single opponent by the square root of the number of games so that, for example, 25 games against one opponent would have the same impact as one game each against five different opponents. But I recognize that reducing the weight of certain games is a kludge, and I wish that I could think of a more elegant way to deal with non-transitivity. I'd love to hear alternative suggestions. Non-transitivity is such a huge problem, though, that I don't think it can be ignored. |
||||||
Title: Re: Experimental new rating system Post by aaaa on May 26th, 2006, 4:50pm on 05/26/06 at 15:45:46, Fritzlein wrote:
Once again I would like to point out that Glicko-2 was not originally intended to be applied on a game-by-game basis. Glicko was modified to do so for the Free Internet Chess Server with Glickman's knowledge and I was curious enough to find out if the same was possible for Glicko-2. on 05/26/06 at 15:45:46, Fritzlein wrote:
If you look at the fifth column, you can see that Belbo has been given a higher volatility than blue22. For some reason, the system thinks Belbo's performance is less consistent than blue22's (0.0006193388697 vs 0.0005354651612). on 05/26/06 at 15:45:46, Fritzlein wrote:
Probably due to the large amount of bot-bot matches taken into account, maximizing the prediction power of the system has resulted in the rating deviation growing very fast if a player doesn't play in a while. Depending on one's volatility, it will take only about 20 days before one's rating deviation becomes the maximum again. I've already been experimenting with excluding bot-bot matches from consideration. |
||||||
Title: Re: Experimental new rating system Post by Ryan_Cable on May 26th, 2006, 10:20pm This does support my belief that our ratings are currently too compressed. Other than that, I'm not clear on what the advantage of this system is over our current system. Our current system is fairly easy to understand, and anyone can calculate the possible rating changes that would result from playing a given opponent. I would not want to give that up unless there is a substantial improvement in rating accuracy. |
||||||
Title: Re: Experimental new rating system Post by aaaa on May 27th, 2006, 9:57am Here's the result again after the choice of parameters has been optimized for rated games including at least one human. Tell me if this one is more sane. player rating (old style) rating deviation (old style) volatility Fritzlein 5.482469551 (2452.402549) 0.8047089398 (139.7922667) 0.000538855377 99of9 4.73692475 (2322.888146) 0.8288227 (143.981256) 0.0005558081298 robinson 4.435887799 (2270.59267) 0.7746629858 (134.5727496) 0.0005238959024 Adanac 4.401616739 (2264.639176) 0.7049974879 (122.4706126) 0.0004748660898 PMertens 4.212564795 (2231.797489) 0.8288227 (143.981256) 0.0006714964567 Ryan_Cable 4.147041625 (2220.414948) 0.8288227 (143.981256) 0.0004148334762 Belbo 4.058155186 (2204.973791) 0.7336598637 (127.4497775) 0.0005050846755 mouse 3.839142456 (2166.927381) 0.8288227 (143.981256) 0.0004747277447 RonWeasley 3.555657933 (2117.681074) 0.8288227 (143.981256) 0.0005129970293 Arimanator 3.55381375 (2117.360706) 0.8288227 (143.981256) 0.0006905204892 omar 3.399174901 (2090.497186) 0.8288227 (143.981256) 0.0005327536732 chessandgo 3.381264438 (2087.385819) 0.4764257499 (82.76363314) 0.0006149328405 naveed 3.282873511 (2070.293564) 0.8288227 (143.981256) 0.0006009688416 blue22 3.212482055 (2058.065315) 0.7934868976 (137.8427982) 0.0004544315336 bot_Bomb2005CC 2.970377488 (2016.007442) 0.6806065302 (118.2334691) 0.000455604352 OLTI 2.94866222 (2012.235114) 0.8288227 (143.981256) 0.0004670586044 bot_Bomb2005Blitz 2.838656284 (1993.125125) 0.6355597776 (110.4080463) 0.0007068470581 bot_Bomb2005Fast 2.82836776 (1991.337825) 0.8188877809 (142.2553838) 0.0005046538155 bot_Bomb2005P2 2.677105717 (1965.060915) 0.4418521777 (76.75758823) 0.0004078946749 omarFast 2.648254447 (1960.048936) 0.8288227 (143.981256) 0.0005759785643 thorin 2.530981686 (1939.67657) 0.8288227 (143.981256) 0.0004996524011 bleitner 2.474353922 (1929.83932) 0.8288227 (143.981256) 0.0004316956614 bot_speedy 2.434224123 (1922.868059) 0.8288227 (143.981256) 0.0005830126789 jdb 2.427799948 (1921.752066) 0.8288227 (143.981256) 0.0005177591978 megamau 2.420580392 (1920.4979) 0.8288227 (143.981256) 0.0006022504608 bot_Clueless2005Fast 2.388905996 (1914.995494) 0.6231364636 (108.2498956) 0.0005694958969 bot_lightning 2.342006325 (1906.848186) 0.8288227 (143.981256) 0.0005799969339 frostlad 2.256872674 (1892.058956) 0.7260031127 (126.1196635) 0.0005255536504 Swynndla 2.210853008 (1884.064521) 0.7130838884 (123.8753643) 0.0005439431955 BlackKnight 2.1852398 (1879.615051) 0.8288227 (143.981256) 0.0005633310056 bot_GnoBot2005Fast 2.095862689 (1864.088655) 0.712771455 (123.8210891) 0.0005831434752 bot_Clueless2005P2 2.072370715 (1860.007681) 0.6243452409 (108.4598817) 0.0005399026625 bot_Clueless2005Blitz 2.06877531 (1859.383096) 0.7454407812 (129.4963325) 0.0006259613599 bot_Clueless2005CC 2.025541811 (1851.872667) 0.7638517253 (132.6946412) 0.0005358564737 bot_Arimaanator 1.960662587 (1840.601991) 0.7346765571 (127.6263952) 0.0003402609408 kamikazeking 1.882753526 (1827.0678) 0.7424657372 (128.9795144) 0.0004977506049 bot_Clueless2006P2 1.863160001 (1823.664056) 0.7153645161 (124.2715499) 0.0006227940527 ytri 1.849213454 (1821.241293) 0.8288227 (143.981256) 0.0004792753914 filerank 1.821360335 (1816.40271) 0.8288227 (143.981256) 0.0004887645442 Aamir 1.796505038 (1812.084903) 0.8288227 (143.981256) 0.0005229366749 haizhi 1.7666057 (1806.890856) 0.8288227 (143.981256) 0.0007029900686 bot_haizhi 1.581152235 (1774.674288) 0.8288227 (143.981256) 0.000576510305 bot_Bomb2004CC 1.521184847 (1764.256885) 0.8288227 (143.981256) 0.0005199753632 grey_0x2A 1.497306546 (1760.108799) 0.8288227 (143.981256) 0.0005169386559 clauchau 1.483899672 (1757.779786) 0.8288227 (143.981256) 0.0004723108198 deselby 1.456556559 (1753.029801) 0.8288227 (143.981256) 0.0005515469025 CeeJay 1.450091942 (1751.906782) 0.8288227 (143.981256) 0.0006684186778 6sense 1.445184387 (1751.054252) 0.8288227 (143.981256) 0.0005290605304 bot_Clueless2006Fast 1.440374412 (1750.218674) 0.7802854833 (135.5494775) 0.0005738607317 Paul 1.431647882 (1748.70272) 0.8288227 (143.981256) 0.0005009657358 |
||||||
Title: Re: Experimental new rating system Post by Fritzlein on May 29th, 2006, 10:49am I decided to produce ratings based purely on rated human vs. human games (1553 of them so far), due to discussion in another thread. However, when I started to implement an idea we had discussed earlier to deal with non-transitivity, I got distracted by a different, older idea I had of making the ratings retrospective. All rating systems I know of use an "update and forget" method. After each game (or each rating period) players get new ratings, but how they reached their ratings is thrown away. They carry only their rating forward, and no history. Forgetting history might have some disadvantages, in my estimation. Suppose, for example, that chessandgo joins the server and for his first ten human games loses to me ten times. The rating system gives me hardly any credit for my wins. If then chessandgo beats a lot of other people to push his rating up, we know after the fact that he was pretty good all along, but that doesn't help me any. I only get points for beating a sub-1500 player even if he was increasing towards 1900 strength by the end our first ten games. So I created a new historical system (or one might say retrospective system) to counteract this trend. It remembers all the old game results, and if someone does better (or worse) in the future, it retrospectively adjusts their ratings up (or down) in the past, as well as retropectively adjusting the awards and penalities to their opponents. There's actually just one formula in the FRIAR system (Fritz's Retrospectively Iterated Arimaa Ratings): Your rating as of any game is the average of your rating from the game before and your rating from the game after, plus the award/penalty for the game itself. The game award/penalty is calculated from the same formula as standard Elo ratings with a k-factor of 15, i.e. 15 * (score - 1/(1+10^((Ropp - Rmine)/400))) If it is a player's first game, his "rating from the game before" is 1500. If it is a player's last game, then he just gets the game award tacked on to the previous game. To calculate the ratings to match this formula, I just iterated a bunch of times. The first interesting point is that the ratings are much more volatile than standard Elo ratings with a k-factor of 32. The second interesting point is that the ratings converge glacially slowly. I did 200 iterations overnight, but I suspect that the extreme ratings would push out an additional hundred points if only I could do 2000 iterations. Unfortunately, my code is dog-slow because all parameters are stored (and looked up) in MS Access tables. If someone did this properly with a C array and some pointers, it would probably take a second per iteration instead of a minute per as it took me. So here are the not-really-converged ratings according to FRIAR, based only on 1553 hvh rated games, and compared to the current server ratings: Name FRIAR Sever Fritzlein 2320 2309 Adanac 2245 2177 robinson 2230 2148 99of9 2212 2169 Belbo 2172 2002 PMertens 2115 2086 Ryan_Cable 2085 2130 chessandgo 2052 2015 omar 2050 1947 blue22 1989 2005 Swynndla 1989 1790 RonWeasley 1979 1941 BlackKnight 1918 1833 naveed 1876 1956 jdb 1875 1796 OLTI 1850 1958 Spunk 1750 1472 mouse 1728 2051 KT2006 1715 1657 frostlad 1715 1807 seanick 1702 1537 grey_0x2A 1692 1709 Arimanator 1689 2035 kamikazeking 1668 1751 thorin 1654 1895 megamau 1649 1788 Belbo has a significantly higher rating under FRIAR. This makes complete sense because he had a stellar result in last year's postal tourney, and has hardly played humans since then, except for the four games he has already won in this year's postal. His reduced server rating is due to losing a few to BombFast while training for the WC, and FRIAR ignores such games. Swynndla also gets a huge boost in FRIAR from beating tons of different human players, even though many were newcomers. He may therefore be somewhat overrated in FRIAR, but I don't mind seeing that the same strategy that works in Player of the Month also boosts the FRIAR rating. I'm pleased that FRIAR rates jdb and naveed about the same, despite their divergent server ratings. I had never heard of Spunk before, but he had a good record in the very early days of the server against omar, who later turned out to be very good. Then when the early bots came on-line, Spunk lost all his points to those bots, then left. The FRIAR rating for Spunk actually nearly matches his server rating from before the time he started to play bots. I'm sure seanick will be happy to note that FRIAR respects his record against human opponents and ignores his string of losses to tough bots. FRIAR gives a huge rating penalty to mouse relative to mouse's server rating. This reflects the fact that mouse has only played 12 rated games against humans ever. He has a 6-6 record against fairly tough opposition, but it simply isn't enough games to pull away from 1500 very far. Arimanator, in contrast, has played enough games against humans to establish a rating, but his 22-46 record doesn't put him very high in the FRIAR rankings. His high server rating is attributable largely to bot-bashing. I was surpised clauchau didn't make the list of top players, but after peaking at 1899, he dropped back to 1626. That goes to show what happens if you don't keep up with advances in Arimaa theory. Haizhi, filerank, ytri, and some other players with a decent server ranking are invisible to FRIAR because they have played no games or hardly any games against humans. Thorin will show up in the rankings much more clearly once the current postal tournament is over, I guarantee. On the whole, I don't think FRIAR ratings are any more accurate than the server ratings in terms of predicting future game outcomes. Neverthelss, I think FRIAR admirably meets the goals of a pure-human rating to go alongside the standard server rating. |
||||||
Title: Re: Experimental new rating system Post by Ryan_Cable on May 29th, 2006, 2:09pm I don’t understand how the retrospective iteration works. Are you assuming that everyone has constant skill over time? That seems like a particularly bad idea. I am pleasantly surprised to see how high my HvH rating is. I thought I was significantly more overrated than that. |
||||||
Title: Re: Experimental new rating system Post by Fritzlein on May 29th, 2006, 2:37pm on 05/29/06 at 14:09:15, Ryan_Cable wrote:
No, no, I'm not holding skill constant over time. Your rating at a given time is influenced most by the games very near it, and less and less by games that far before it or far after it. So your rating at the time of your second game is hardly influenced at all by whether your hundredth game was a win or a loss. To each player of each game, I assign a rating that is supposed to represent his skill at the time of that game. The assumption is that his skill at that time will be approximately the average of his skill the game before and the game after. Take my last three games, for example: 32240 Ryan_Cable vs. Fritzlein 32276 Fritzlein vs. chessandgo 32282 Fritzlein vs. Swynndla As part of my iterative pass through the ratings, I want to re-calculate how strong I was when I played game 32276. I look ahead and see I was rated 2310 in game 32282, but only rated 2302 in game 32240. My rating should be near the average of 2306. I beat chessandgo who was rated 2052, So I recalculate my rating in game 32276 as 2306 + 15*(1-1/(1+10^((2052-2306)/400) = 2308.8221 When the ratings stabilize after many many iterations, each player's rating in each game will be exactly equal to the average of his ratings before and after, plus the bonus (penalty) for winning (losing) the game in question. This list I gave was only the ratings of each player at the end of the line; I apparently peaked about 150 points higher than my final rating. Long winning streaks or losing streaks will cause your rating to whip around even more in the FRIAR system than in the current server system. There is probably a much cleverer way to reach convergence than by making pass after pass of setting each rating in each game to what it would have been given the other ratings of the previous iteration. My coding ability was only adequate for a simplistic solution that doesn't run fast enough to converge in a reasonable amount of time. :-( In C on a fast computer, however, the simplistic iteration might be adequate. |
||||||
Title: Re: Experimental new rating system Post by chessandgo on May 29th, 2006, 5:50pm on 05/29/06 at 10:49:10, Fritzlein wrote:
I'm fortunate not to have you as a math teacher : let chessandgo and BlackKnight be real numbers, then chessandgo^2 + Blacknight = ... it would be really harder to write down equations :) |
||||||
Title: Re: Experimental new rating system Post by seanick on May 31st, 2006, 1:21am Yeah, I am all for this new rating system, heh heh... what about something that kept track of time taken? would the best players games take longer per move relative to the time scale, than less highly rated players? does the line go up or down in terms of % of available time per move, when playing someone of equal rating? Are those numbers easily mineable or are they somewhat obscured within various sources? I am not a linux user but have begun to study some things analytically with code on win32. so such things would interest me except for the problem of having to use linux. I wouldn't mind, but ... my employer would have a few reservations about the idea. |
||||||
Title: Re: Experimental new rating system Post by Fritzlein on May 31st, 2006, 9:18am One problem with the server ratings (which FRIAR doesn't address in the slightest) is that different humans seem to benefit differently from extra thinking time. Some players, notably Belbo and Omar, are tigers at a slow time control or postally, but tend to fall apart in fast games. Other players, most notably kamikazeking and PMertens, can play great moves even at blitz speeds, but don't seem to get very much better given more time. (Actually, PMertens doesn't even use all of his time given more time.) In my opinion it isn't a good idea to say the players who can move faster are the better players. There are different kinds of skill. I'd rather say that some players are good at blitz and other players are good at postal games. In another thread we discussed having ratings reflect time control. http://arimaa.com/arimaa/forum/cgi/YaBB.cgi?board=talk;action=display;num=1103741634;start=0#0 Note that back then the fastest time control available was 30 seconds per move, and it was already an issue! |
||||||
Title: Re: Experimental new rating system Post by aaaa on May 31st, 2006, 12:33pm You might be interested in this article (http://www.chessbase.com/newsdetail.asp?newsid=562), where it is proposed that games at different time controls are to be given different weights. |
||||||
Title: Re: Experimental new rating system Post by Fritzlein on Jun 6th, 2006, 12:11am I decided that a k-factor of 15 was making the FRIAR ratings way too volatile; I lowered it to 10. I ran the numbers again, this time letting them converge a bit longer. Also I added in last week's games, 23 more. (Sorry, chessandgo, your four big wins from Sunday and Monday aren't there yet; you would surely be over 2100 with them included.) The FRIAR top 25, with number of games played: rate games username 2417 215 Fritzlein 2236 201 99of9 2228 100 Adanac 2194 265 PMertens 2182 201 robinson 2149 121 Belbo 2086 111 Ryan_Cable 2034 116 omar 2031 126 jdb 2015 67 chessandgo 2013 103 Swynndla 1963 73 blue22 1961 19 RonWeasley 1912 223 naveed 1897 79 OLTI 1894 18 BlackKnight 1765 66 kamikazeking 1742 22 frostlad 1714 13 Spunk 1701 68 Arimanator 1691 12 mouse 1680 23 grey_0x2A 1645 16 KT2006 1644 49 megamau 1639 43 clauchau |
||||||
Title: Re: Experimental new rating system Post by Fritzlein on Jun 6th, 2006, 1:14am And now the fun part: a graph of the historical FRIAR ratings. (Only the top 7 by number of hvh games get personalized colors; sorry!) http://www.math.umn.edu/~juhn0008/HistoricalFRIAR.png Note how volatile the ratings are even with the k-factor reduced to 10. On the official server ratings I retained the top ranking even when I tied for fourth in the 2006 World Championship, but the FRIAR ratings have me dipping below robinson, Adanac, and PMertens, i.e. all three of the WC medalists. At the same time that FRIAR ratings are volatile, note that people have to play a significant number of games to move far from 1500. In this sense the volatility of FRIAR is opposite to that of the server. On the server your rating changes a lot at first, and slowly later. With FRIAR your rating changes slowly until you have played fifteen games or so, but later on winning streaks (or losing streaks) have a bigger effect than they do on the server. I note that in August 2004, around the time I joined the server, FRIAR considered 99of9 to be the most dominant player of any time period. My current ratings lead of 180 points looks wimpy compared to the 350-point lead 99of9 had back then. |
||||||
Title: Re: Experimental new rating system Post by chessandgo on Jun 6th, 2006, 9:12am Great !!! I had the feeling that this forum had not been used for ages >:( ... thanks for putting once more some life in it Fritz ! I see nothing but a big yellow line in there ;) |
||||||
Title: Re: Experimental new rating system Post by Fritzlein on Jun 6th, 2006, 12:54pm on 05/31/06 at 12:33:56, aaaa wrote:
It is interesting that blitz chess games can be used to make classical chess ratings more predictive of classical chess results. I wonder, however, whether Sonas did any study of whether people have statistically distinguishable playing strengths at different time controls. For Arimaa there is clearly a difference in playing strength. See the eleventh post in this thread (http://arimaa.com/arimaa/forum/cgi/YaBB.cgi?board=talk;action=display;num=1117068449) |
||||||
Title: Re: Experimental new rating system Post by Fritzlein on Jun 6th, 2006, 1:18pm on 05/26/06 at 16:50:03, aaaa wrote:
It's very interesting that maximizing predictive accuracy means that our uncertainty should stay high (or at least rapidly increase back to the maximum) and thus the ratings should remain "loose" i.e. subject to quick change. I wonder whether people really do change rapidly in playing strength from three weeks of inactivity. Certainly the server bots do not. I suspect that the reason "loose" ratings are more predictive is actually more due to intransitivity than due to rapidly-changing skill level. If I suddenly decide to pump up my rating by beating one bot over and over again, the rating system will be more predictive if it quickly forgets my history and adjusts to my new performance. Likewise if someone dominates the first dozen bots on the ladder before getting stuck on one they lose to twenty times in a row, the rating system will be more predictive if it quickly forgets the prior wins and lowers that player's rating to predict future losses. In other words, the worse the problem of intransitivity is, the more stupid slow-changing ratings appear to be compared to fast-changing ratings. Yet we can be sure that fast-changing ratings for bots (especially fixed-performance bots ) are not reflecting underlying changes in skill. The changes in a bot's rating mostly reflect changes in who is dominating (or being dominated by) the bot at any given time, i.e. it is mostly reflecting the intransitivity of ratings. Of course these are just my speculations. Statistically verifying my thesis would be quite another matter. ;-) |
||||||
Title: Re: Experimental new rating system Post by mouse on Jun 6th, 2006, 3:29pm I think these rating calculations is interesting. And I believe they point out the need to have a seperate HvH, HvB and BvB rating especially players highligted seem to be exampels of players who have very diffenrent ratings against bots and humans. on 05/29/06 at 10:49:10, Fritzlein wrote:
|
||||||
Title: Re: Experimental new rating system Post by Fritzlein on Jun 19th, 2006, 1:36am I updated my database with 78 more rated hvh games and ran the FRIAR ratings again. Seeing the results made me realize that FRIAR ratings were way more volatile than I wanted. Winning streaks and losing streaks created huge swings. Chessandgo's rating, for example, went up and then down more than 100 points in a two-week span. To counteract this volatility, I reduced the k-factor further from 10 to 6. That would make a 100-point swing into only a 60-point swing. The scale remains the same (e.g. a rating difference of 200 points still should mean a 76% winning chance for the favorite), but each player's early ratings become all the more tied to 1500. It now takes about 25 games against humans for your FRIAR rating to really start floating freely. I guess that could be considered a feature: you stay near 1500 until there's quite a bit of proof you should be higher or lower. Here are the new FRIAR top 20, with number of games played: rate games username 2387 218 Fritzlein 2240 203 99of9 2116 122 Belbo 2033 204 robinson 2026 110 Adanac 2024 115 Ryan_Cable 2020 279 PMertens 2007 95 chessandgo 2005 84 OLTI 1932 119 omar 1927 105 Swynndla 1914 130 jdb 1884 75 blue22 1882 223 naveed 1867 20 RonWeasley 1819 18 BlackKnight 1753 72 kamikazeking 1705 23 frostlad 1693 68 Arimanator 1673 13 thorin I note Belbo jumped from 6th place to 3rd. He posted only a win over OLTI in the postal tournament, extending his winning streak to six games, and FRIAR responds to streaks. Also the people he jumped, namely Adanac, robinson, and PMertens, posted results of 4-5, 0-3, and 4-6 respectively against other top players, losing records which weren't compensated by a win or two against lower-ranked folks. 99of9 had only two results, postal wins over blue22 and robinson, which extends 99of9's winning streak to seven and widens his lead over any contenders for second place. Similarly I posted only three wins, extending my winning streak to nine. Even with the k-factor reduced to 6, FRIAR is very concerned to know, "What have you done for me lately?" Chessandgo leapfrogged jdb and omar, and helped drag OLTI up too by losing to OLTI twice while compiling a 9-4 combined record against PMertens, RyanCable, and Adanac. RonWeasley's only result was beating seanick, but Ron dropped in the rankings anyway because I reduced the k-factor. That binds him more closely to 1500, and he has only played 20 rated hvh games so far. I couldn't believe Ron has played only twenty games given the impact he has had around here, but the on-line game record agrees. The fact that Ron could play more if the client were more stable is by itself a powerful argument for improving the client. |
||||||
Title: Re: Experimental new rating system Post by RonWeasley on Jun 19th, 2006, 11:44am Quote:
Much like the impact a bludger has on a beater. I'm just batting practice most of the time, but if you don't look out, I might knock you off your broom! |
||||||
Title: Re: Experimental new rating system Post by chessandgo on Jun 19th, 2006, 8:11pm Karl, don't you feel like the more you improve your rating system, the closer to Omar's actual ratings the output gets ? :P |
||||||
Title: Re: Experimental new rating system Post by Fritzlein on Jun 19th, 2006, 10:40pm Well, the main thing I like about FRIAR right now compared to Omar's ratings is that FRIAR uses only hvh results. To get a good comparison, I would have to run Omar's ratings side-by-side with only hvh games as input. Then I could compare directly. Hmmm, actually that's not a bad idea... |
||||||
Title: Re: Experimental new rating system Post by Fritzlein on Jun 26th, 2006, 5:38pm I updated FRIAR with 27 more games. rate games username 2485 223 Fritzlein 2242 203 99of9 2223 106 chessandgo 2155 123 Belbo 2064 206 robinson 2061 86 OLTI 2060 115 Adanac 2044 282 PMertens 2003 119 Ryan_Cable 1943 105 Swynndla 1924 120 omar 1887 75 blue22 1884 223 naveed 1868 22 RonWeasley 1847 133 jdb 1819 18 BlackKnight 1732 73 kamikazeking 1708 23 frostlad 1693 68 Arimanator 1675 13 thorin The big story is chessandgo jumping 5 more places. Chessandgo won against Adanac, seanick, RonWeasley, Ryan_Cable, jdb, robinson, PMertens, and seanick, while losing only three games, all to me. That incidentally also pushes my rating to a ridiculous high. OLTI gets a retrospective bounce from his previous victories over chessandgo, and from having not lost in a while. This week shows once again that FRIAR ratings are more volatile than server ratings. I gained 100 points for three wins! I might have to reduce the k-factor still further. |
||||||
Title: Re: Experimental new rating system Post by OLTI on Jun 28th, 2006, 4:24am I'm starting to like FRIAR ;D |
||||||
Title: Re: Experimental new rating system Post by DorianGaray on Jun 28th, 2006, 6:03am on 06/26/06 at 17:38:36, Fritzlein wrote:
Shouldn't that sole detail give you pause about the validity of your system? You've played for nearly two years yet your rating is completly uspet , by what? Another player's latest 5 games! Ratings are only valid if they can predict accurately the likelyhood that one player has to win against another at any given time, otherwise they are just useless numbers. I don't see anything here that shows that your system is doing that. |
||||||
Title: Re: Experimental new rating system Post by Fritzlein on Jun 28th, 2006, 8:10am on 06/28/06 at 06:03:05, DorianGaray wrote:
Sure, I don't think the FRIAR ratings are very predictive. My idea for a new rating system is working rather poorly so far. I think I would have to do something to reduce the volatility to make the results more reasonable. Also I'm starting to not like how strongly everyone is tied to 1500 for their first few games. The main point of generating ratings based only on human games is that the standard server ratings can be distorted a great deal by bot-bashing, or alternatively by losing repeatedly to a bot that one can't beat. Both of these distortions seem reasonably common. Take your case for example. You have played 196 rated games against bots and 10 rated games against humans. I submit that your rating of 2228 doesn't tell us very much about your chances of winning against, say, 99of9, who is the player closest in rating to you. You might beat him 90% of the time, or you might lose to him 90% of the time; we just can't tell from your server rating which is based mostly on your games versus bots. Your ten results against humans (and their ratings) were L 1949 W 1442 L 2317 W 1483 W 1516 W 1384 L 1863 W 1408 W 1517 W 1518 The rating that predicts a result of 7-3 against this opposition is 1823. With a true strength of 2228, you would be expected to be 9-1 against this opposition, on average. This appears to be a situation where the server ratings are not very predictive, although we have no reliable idea how you would do against human opposition until you start playing more games against humans. Maybe if you played more human opponents you would justify a rating of 2200, or 2500, or anything. We simply can't predict very well on the basis of only ten games. I'd be interested in human-only ratings lists even if they were generated by a totally different methodology than FRIAR ratings. In my opinion it is ultimately going to be very hard to generate valid (or accurate, or predictive) ratings if games against bots are included at all, no matter what methodology is used. I agree, however, that in the case of FRIAR ratings, the cure is worse than the disease. There's no way I deserve to be rated 200 points ahead of everyone else based on my results against humans, and that's not the only visible distortion in FRIAR ratings. |
||||||
Title: Re: Experimental new rating system Post by Fritzlein on Jun 30th, 2006, 9:47pm I changed FRIAR again. To cut out the still-too-high volatility, I reduced the k-factor all the way to 2. Also, to reduce the too-strong tie to 1500. Instead of assuming that everyone has a strength of 1500 at the start, I don't make any assumptions about initial strength. The tie to 1500 is rather 3 draws against a 1500-rated player at the beginning of everyone's game record. The new list, produced from exactly the same games as the old list, is rate games username 2443 223 Fritzlein 2216 203 99of9 2182 106 chessandgo 2102 206 robinson 2096 123 Belbo 2095 115 Adanac 2064 282 PMertens 2020 119 Ryan_Cable 1988 86 OLTI 1933 22 RonWeasley 1911 18 BlackKnight 1904 120 omar 1904 223 naveed 1898 105 Swynndla 1869 75 blue22 1854 13 Spunk 1851 133 jdb 1810 73 kamikazeking 1776 13 thorin 1772 12 mouse One improvement is that people on a recent winning streak (me, chessandgo, Belbo, OLTI) aren't overly rewarded, and the people performing a bit below par recently (robinson, Adanac) aren't overly punished. The more stable ratings means the present ratings are affected by games further back. A second improvement is that RonWeasley jumps to a more reasonable level from not being so closely tied to 1500. A third, more subtle improvement is that Swynndla dips. He beat lots of newcomers, but since the newcomers themselves are less tied to 1500, they end up lower, and Swynndla therefore gets less boost from those victories. One possible criticism is that I am still rated too far ahead of 99of9. To check this I used server ratings to calculate my performance rating over my last 50 rated games against humans. It turns out I am 47-3 (I didn't realize!), which gives me a 2429 performance rating versus humans. 99of9 is 38-12 against somewhat tougher opposition, for a performance rating of 2210 versus humans. Since the FRIAR ratings are not far from vs. human performance ratings calculated by server ratings, I don't feel as bad about my inflated rating. On the whole I'm rather pleased with my latest version. Graph to follow. |
||||||
Title: Re: Experimental new rating system Post by chessandgo on Jun 30th, 2006, 11:05pm on 06/30/06 at 21:47:17, Fritzlein wrote:
I don't think you'll manage to design a rating system not giving you on top with huge lead :) why don't you come and play to score a 50-0 instead of taking care of everyone's rating ? I'd be glad to be take 3 last losses ;) |
||||||
Title: Re: Experimental new rating system Post by chessandgo on Jun 30th, 2006, 11:06pm *to take the 3 last losses* |
||||||
Title: Re: Experimental new rating system Post by Fritzlein on Jul 1st, 2006, 9:08am Here's a graph of my latest FRIAR ratings over time: http://www.math.umn.edu/~juhn0008/HistoricalFRIAR2.png Compared to the more-volatile system before: http://www.math.umn.edu/~juhn0008/HistoricalFRIAR.png I think the new one is a more reasonable guess at how fast playing skill actually changes, don't you? |
||||||
Title: Re: Experimental new rating system Post by aaaa on Apr 1st, 2007, 9:58pm Real-time Glicko-2 was perhaps not such a hot idea after all. Since the original Glicko HAS found real-time applications, e.g. at the Free Internet Chess Server, it should have been obvious to experiment with that one first. The current rating system of Arimaa, although seemingly based on the same principles, strikes me as too much a kludge. In that light it would be interesting to see how it compares to a more faithfully implemented Glicko system. Again, the system here is optimized based on every human-involved rated game (currently numbering 34577). Any comments or questions on these ratings? player rating RD Fritzlein 2536.589308 151.3183257 chessandgo 2444.57311 141.2896649 syed 2413.391102 96.44818116 DorianGaray 2345.602406 152.2 Belbo 2331.532722 152.2 PMertens 2324.733678 152.2 99of9 2314.738731 152.2 robinson 2246.385158 152.2 RonWeasley 2219.323314 152.2 Arimanator 2118.80046 152.2 jdb 2104.273778 152.2 blue22 2094.621401 136.7078874 bot_Bomb2005Fast 2092.449876 90.26640411 obiwan 2091.575049 121.5170209 Ryan_Cable 2083.344584 152.2 thorin 2081.512715 152.2 mouse 2076.716541 152.2 arimaa_master 2062.639947 115.1808582 omar 2049.651718 152.2 Brendan 2046.479086 139.1713486 bot_Bomb2005P2 2041.870531 106.0700486 OLTI 2038.889491 142.6923251 clauchau 2035.542834 152.2 kerdamdam 2019.697991 152.2 jawdirk 2017.579664 152.2 bot_Bomb2004CC 2009.561431 152.2 bot_GnoBot2005Blitz 2008.050303 96.53058166 Adanac 2007.836721 152.2 The_Jeh 1985.357159 152.2 UltraWeak 1979.211538 152.2 bot_Clueless2005P2 1963.922121 147.516796 omarFast 1950.756119 152.2 bot_Zombie 1947.281312 152.2 woh 1936.687461 152.2 bleitner 1927.015185 152.2 kamikazeking 1920.066142 152.2 bot_Bomb2005Blitz 1903.92491 151.2594177 bot_GnoBot2006Blitz 1893.79771 152.2 bot_speedy 1874.594987 152.2 BlackKnight 1870.604424 152.2 bot_lightning 1859.550822 152.2 Soter 1858.647498 93.07385538 bot_Bomb2005CC 1851.349992 152.2 Chegorimaa 1839.335515 98.60569559 bot_Clueless2005Fast 1839.173029 152.2 fritzlforpresident 1837.236962 152.2 petitprince 1825.132353 113.2571471 megamau 1824.555125 152.2 bot_Clueless2005Blitz 1818.806912 152.2 filerank 1817.020033 152.2 |
||||||
Arimaa Forum » Powered by YaBB 1 Gold - SP 1.3.1! YaBB © 2000-2003. All Rights Reserved. |