Author |
Topic: Experimental new rating system (Read 9239 times) |
|
Fritzlein
Forum Guru
Arimaa player #706
Gender:
Posts: 5928
|
|
Re: Experimental new rating system
« Reply #15 on: Jun 6th, 2006, 12:54pm » |
Quote Modify
|
on May 31st, 2006, 12:33pm, aaaa wrote:You might be interested in this article, where it is proposed that games at different time controls are to be given different weights. |
| It is interesting that blitz chess games can be used to make classical chess ratings more predictive of classical chess results. I wonder, however, whether Sonas did any study of whether people have statistically distinguishable playing strengths at different time controls. For Arimaa there is clearly a difference in playing strength. See the eleventh post in this thread
|
|
IP Logged |
|
|
|
Fritzlein
Forum Guru
Arimaa player #706
Gender:
Posts: 5928
|
|
Re: Experimental new rating system
« Reply #16 on: Jun 6th, 2006, 1:18pm » |
Quote Modify
|
on May 26th, 2006, 4:50pm, aaaa wrote:maximizing the prediction power of the system has resulted in the rating deviation growing very fast if a player doesn't play in a while. Depending on one's volatility, it will take only about 20 days before one's rating deviation becomes the maximum again. |
| It's very interesting that maximizing predictive accuracy means that our uncertainty should stay high (or at least rapidly increase back to the maximum) and thus the ratings should remain "loose" i.e. subject to quick change. I wonder whether people really do change rapidly in playing strength from three weeks of inactivity. Certainly the server bots do not. I suspect that the reason "loose" ratings are more predictive is actually more due to intransitivity than due to rapidly-changing skill level. If I suddenly decide to pump up my rating by beating one bot over and over again, the rating system will be more predictive if it quickly forgets my history and adjusts to my new performance. Likewise if someone dominates the first dozen bots on the ladder before getting stuck on one they lose to twenty times in a row, the rating system will be more predictive if it quickly forgets the prior wins and lowers that player's rating to predict future losses. In other words, the worse the problem of intransitivity is, the more stupid slow-changing ratings appear to be compared to fast-changing ratings. Yet we can be sure that fast-changing ratings for bots (especially fixed-performance bots ) are not reflecting underlying changes in skill. The changes in a bot's rating mostly reflect changes in who is dominating (or being dominated by) the bot at any given time, i.e. it is mostly reflecting the intransitivity of ratings. Of course these are just my speculations. Statistically verifying my thesis would be quite another matter.
|
|
IP Logged |
|
|
|
mouse
Forum Senior Member
Arimaa player #784
Gender:
Posts: 45
|
|
Re: Experimental new rating system
« Reply #17 on: Jun 6th, 2006, 3:29pm » |
Quote Modify
|
I think these rating calculations is interesting. And I believe they point out the need to have a seperate HvH, HvB and BvB rating especially players highligted seem to be exampels of players who have very diffenrent ratings against bots and humans. on May 29th, 2006, 10:49am, Fritzlein wrote: Name FRIAR Sever Fritzlein 2320 2309 Adanac 2245 2177 robinson 2230 2148 99of9 2212 2169 Belbo 2172 2002 PMertens 2115 2086 Ryan_Cable 2085 2130 chessandgo 2052 2015 omar 2050 1947 blue22 1989 2005 Swynndla 1989 1790 RonWeasley 1979 1941 BlackKnight 1918 1833 naveed 1876 1956 jdb 1875 1796 OLTI 1850 1958 Spunk 1750 1472 mouse 1728 2051 KT2006 1715 1657 frostlad 1715 1807 seanick 1702 1537 grey_0x2A 1692 1709 Arimanator 1689 2035 kamikazeking 1668 1751 thorin 1654 1895 megamau 1649 1788 |
|
|
|
IP Logged |
|
|
|
Fritzlein
Forum Guru
Arimaa player #706
Gender:
Posts: 5928
|
|
Re: Experimental new rating system
« Reply #18 on: Jun 19th, 2006, 1:36am » |
Quote Modify
|
I updated my database with 78 more rated hvh games and ran the FRIAR ratings again. Seeing the results made me realize that FRIAR ratings were way more volatile than I wanted. Winning streaks and losing streaks created huge swings. Chessandgo's rating, for example, went up and then down more than 100 points in a two-week span. To counteract this volatility, I reduced the k-factor further from 10 to 6. That would make a 100-point swing into only a 60-point swing. The scale remains the same (e.g. a rating difference of 200 points still should mean a 76% winning chance for the favorite), but each player's early ratings become all the more tied to 1500. It now takes about 25 games against humans for your FRIAR rating to really start floating freely. I guess that could be considered a feature: you stay near 1500 until there's quite a bit of proof you should be higher or lower. Here are the new FRIAR top 20, with number of games played: rate games username 2387 218 Fritzlein 2240 203 99of9 2116 122 Belbo 2033 204 robinson 2026 110 Adanac 2024 115 Ryan_Cable 2020 279 PMertens 2007 95 chessandgo 2005 84 OLTI 1932 119 omar 1927 105 Swynndla 1914 130 jdb 1884 75 blue22 1882 223 naveed 1867 20 RonWeasley 1819 18 BlackKnight 1753 72 kamikazeking 1705 23 frostlad 1693 68 Arimanator 1673 13 thorin I note Belbo jumped from 6th place to 3rd. He posted only a win over OLTI in the postal tournament, extending his winning streak to six games, and FRIAR responds to streaks. Also the people he jumped, namely Adanac, robinson, and PMertens, posted results of 4-5, 0-3, and 4-6 respectively against other top players, losing records which weren't compensated by a win or two against lower-ranked folks. 99of9 had only two results, postal wins over blue22 and robinson, which extends 99of9's winning streak to seven and widens his lead over any contenders for second place. Similarly I posted only three wins, extending my winning streak to nine. Even with the k-factor reduced to 6, FRIAR is very concerned to know, "What have you done for me lately?" Chessandgo leapfrogged jdb and omar, and helped drag OLTI up too by losing to OLTI twice while compiling a 9-4 combined record against PMertens, RyanCable, and Adanac. RonWeasley's only result was beating seanick, but Ron dropped in the rankings anyway because I reduced the k-factor. That binds him more closely to 1500, and he has only played 20 rated hvh games so far. I couldn't believe Ron has played only twenty games given the impact he has had around here, but the on-line game record agrees. The fact that Ron could play more if the client were more stable is by itself a powerful argument for improving the client.
|
|
IP Logged |
|
|
|
RonWeasley
Forum Guru
Harry's friend (Arimaa player #441)
Gender:
Posts: 882
|
|
Re: Experimental new rating system
« Reply #19 on: Jun 19th, 2006, 11:44am » |
Quote Modify
|
Quote:I couldn't believe Ron has played only twenty games given the impact he has had around here, |
| Much like the impact a bludger has on a beater. I'm just batting practice most of the time, but if you don't look out, I might knock you off your broom!
|
|
IP Logged |
|
|
|
chessandgo
Forum Guru
Arimaa player #1889
Gender:
Posts: 1244
|
|
Re: Experimental new rating system
« Reply #20 on: Jun 19th, 2006, 8:11pm » |
Quote Modify
|
Karl, don't you feel like the more you improve your rating system, the closer to Omar's actual ratings the output gets ?
|
|
IP Logged |
|
|
|
Fritzlein
Forum Guru
Arimaa player #706
Gender:
Posts: 5928
|
|
Re: Experimental new rating system
« Reply #21 on: Jun 19th, 2006, 10:40pm » |
Quote Modify
|
Well, the main thing I like about FRIAR right now compared to Omar's ratings is that FRIAR uses only hvh results. To get a good comparison, I would have to run Omar's ratings side-by-side with only hvh games as input. Then I could compare directly. Hmmm, actually that's not a bad idea...
|
|
IP Logged |
|
|
|
Fritzlein
Forum Guru
Arimaa player #706
Gender:
Posts: 5928
|
|
Re: Experimental new rating system
« Reply #22 on: Jun 26th, 2006, 5:38pm » |
Quote Modify
|
I updated FRIAR with 27 more games. rate games username 2485 223 Fritzlein 2242 203 99of9 2223 106 chessandgo 2155 123 Belbo 2064 206 robinson 2061 86 OLTI 2060 115 Adanac 2044 282 PMertens 2003 119 Ryan_Cable 1943 105 Swynndla 1924 120 omar 1887 75 blue22 1884 223 naveed 1868 22 RonWeasley 1847 133 jdb 1819 18 BlackKnight 1732 73 kamikazeking 1708 23 frostlad 1693 68 Arimanator 1675 13 thorin The big story is chessandgo jumping 5 more places. Chessandgo won against Adanac, seanick, RonWeasley, Ryan_Cable, jdb, robinson, PMertens, and seanick, while losing only three games, all to me. That incidentally also pushes my rating to a ridiculous high. OLTI gets a retrospective bounce from his previous victories over chessandgo, and from having not lost in a while. This week shows once again that FRIAR ratings are more volatile than server ratings. I gained 100 points for three wins! I might have to reduce the k-factor still further.
|
« Last Edit: Jun 26th, 2006, 5:39pm by Fritzlein » |
IP Logged |
|
|
|
OLTI
Forum Full Member
Arimaa player #1034
Gender:
Posts: 25
|
|
Re: Experimental new rating system
« Reply #23 on: Jun 28th, 2006, 4:24am » |
Quote Modify
|
I'm starting to like FRIAR
|
|
IP Logged |
|
|
|
DorianGaray
Forum Guru
Arimaa player #1210
Gender:
Posts: 55
|
|
Re: Experimental new rating system
« Reply #24 on: Jun 28th, 2006, 6:03am » |
Quote Modify
|
on Jun 26th, 2006, 5:38pm, Fritzlein wrote:...That incidentally also pushes my rating to a ridiculous high... |
| Shouldn't that sole detail give you pause about the validity of your system? You've played for nearly two years yet your rating is completly uspet , by what? Another player's latest 5 games! Ratings are only valid if they can predict accurately the likelyhood that one player has to win against another at any given time, otherwise they are just useless numbers. I don't see anything here that shows that your system is doing that.
|
« Last Edit: Jun 28th, 2006, 6:04am by DorianGaray » |
IP Logged |
|
|
|
Fritzlein
Forum Guru
Arimaa player #706
Gender:
Posts: 5928
|
|
Re: Experimental new rating system
« Reply #25 on: Jun 28th, 2006, 8:10am » |
Quote Modify
|
on Jun 28th, 2006, 6:03am, DorianGaray wrote:Ratings are only valid if they can predict accurately the likelyhood that one player has to win against another at any given time, otherwise they are just useless numbers. I don't see anything here that shows that your system is doing that. |
| Sure, I don't think the FRIAR ratings are very predictive. My idea for a new rating system is working rather poorly so far. I think I would have to do something to reduce the volatility to make the results more reasonable. Also I'm starting to not like how strongly everyone is tied to 1500 for their first few games. The main point of generating ratings based only on human games is that the standard server ratings can be distorted a great deal by bot-bashing, or alternatively by losing repeatedly to a bot that one can't beat. Both of these distortions seem reasonably common. Take your case for example. You have played 196 rated games against bots and 10 rated games against humans. I submit that your rating of 2228 doesn't tell us very much about your chances of winning against, say, 99of9, who is the player closest in rating to you. You might beat him 90% of the time, or you might lose to him 90% of the time; we just can't tell from your server rating which is based mostly on your games versus bots. Your ten results against humans (and their ratings) were L 1949 W 1442 L 2317 W 1483 W 1516 W 1384 L 1863 W 1408 W 1517 W 1518 The rating that predicts a result of 7-3 against this opposition is 1823. With a true strength of 2228, you would be expected to be 9-1 against this opposition, on average. This appears to be a situation where the server ratings are not very predictive, although we have no reliable idea how you would do against human opposition until you start playing more games against humans. Maybe if you played more human opponents you would justify a rating of 2200, or 2500, or anything. We simply can't predict very well on the basis of only ten games. I'd be interested in human-only ratings lists even if they were generated by a totally different methodology than FRIAR ratings. In my opinion it is ultimately going to be very hard to generate valid (or accurate, or predictive) ratings if games against bots are included at all, no matter what methodology is used. I agree, however, that in the case of FRIAR ratings, the cure is worse than the disease. There's no way I deserve to be rated 200 points ahead of everyone else based on my results against humans, and that's not the only visible distortion in FRIAR ratings.
|
|
IP Logged |
|
|
|
Fritzlein
Forum Guru
Arimaa player #706
Gender:
Posts: 5928
|
|
Re: Experimental new rating system
« Reply #26 on: Jun 30th, 2006, 9:47pm » |
Quote Modify
|
I changed FRIAR again. To cut out the still-too-high volatility, I reduced the k-factor all the way to 2. Also, to reduce the too-strong tie to 1500. Instead of assuming that everyone has a strength of 1500 at the start, I don't make any assumptions about initial strength. The tie to 1500 is rather 3 draws against a 1500-rated player at the beginning of everyone's game record. The new list, produced from exactly the same games as the old list, is rate games username 2443 223 Fritzlein 2216 203 99of9 2182 106 chessandgo 2102 206 robinson 2096 123 Belbo 2095 115 Adanac 2064 282 PMertens 2020 119 Ryan_Cable 1988 86 OLTI 1933 22 RonWeasley 1911 18 BlackKnight 1904 120 omar 1904 223 naveed 1898 105 Swynndla 1869 75 blue22 1854 13 Spunk 1851 133 jdb 1810 73 kamikazeking 1776 13 thorin 1772 12 mouse One improvement is that people on a recent winning streak (me, chessandgo, Belbo, OLTI) aren't overly rewarded, and the people performing a bit below par recently (robinson, Adanac) aren't overly punished. The more stable ratings means the present ratings are affected by games further back. A second improvement is that RonWeasley jumps to a more reasonable level from not being so closely tied to 1500. A third, more subtle improvement is that Swynndla dips. He beat lots of newcomers, but since the newcomers themselves are less tied to 1500, they end up lower, and Swynndla therefore gets less boost from those victories. One possible criticism is that I am still rated too far ahead of 99of9. To check this I used server ratings to calculate my performance rating over my last 50 rated games against humans. It turns out I am 47-3 (I didn't realize!), which gives me a 2429 performance rating versus humans. 99of9 is 38-12 against somewhat tougher opposition, for a performance rating of 2210 versus humans. Since the FRIAR ratings are not far from vs. human performance ratings calculated by server ratings, I don't feel as bad about my inflated rating. On the whole I'm rather pleased with my latest version. Graph to follow.
|
|
IP Logged |
|
|
|
chessandgo
Forum Guru
Arimaa player #1889
Gender:
Posts: 1244
|
|
Re: Experimental new rating system
« Reply #27 on: Jun 30th, 2006, 11:05pm » |
Quote Modify
|
on Jun 30th, 2006, 9:47pm, Fritzlein wrote: One possible criticism is that I am still rated too far ahead of 99of9. To check this I used server ratings to calculate my performance rating over my last 50 rated games against humans. It turns out I am 47-3 (I didn't realize!), which gives me a 2429 performance rating versus humans. 99of9 is 38-12 against somewhat tougher opposition, for a performance rating of 2210 versus humans. Since the FRIAR ratings are not far from vs. human performance ratings calculated by server ratings, I don't feel as bad about my inflated rating. |
| I don't think you'll manage to design a rating system not giving you on top with huge lead why don't you come and play to score a 50-0 instead of taking care of everyone's rating ? I'd be glad to be take 3 last losses
|
|
IP Logged |
|
|
|
chessandgo
Forum Guru
Arimaa player #1889
Gender:
Posts: 1244
|
|
Re: Experimental new rating system
« Reply #28 on: Jun 30th, 2006, 11:06pm » |
Quote Modify
|
*to take the 3 last losses*
|
|
IP Logged |
|
|
|
Fritzlein
Forum Guru
Arimaa player #706
Gender:
Posts: 5928
|
|
Re: Experimental new rating system
« Reply #29 on: Jul 1st, 2006, 9:08am » |
Quote Modify
|
Here's a graph of my latest FRIAR ratings over time: Compared to the more-volatile system before: I think the new one is a more reasonable guess at how fast playing skill actually changes, don't you?
|
|
IP Logged |
|
|
|
|