Arimaa Forum - Experimental new rating system

Welcome, Guest. Please Login or Register.
May 18^th, 2024, 9:32am

Home

Help

Members

Arimaa Forum « Experimental new rating system »

   Arimaa Forum
   Arimaa
   General Discussion (Moderator: supersamu)
   Experimental new rating system

« Previous topic | Next topic »

Pages: 1 2 3

Notify of replies

Send Topic

Author

Topic: Experimental new rating system (Read 9239 times)

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Experimental new rating system
« Reply #15 on: Jun 6^th, 2006, 12:54pm »

Quote

Modify

on May 31^st, 2006, 12:33pm, aaaa wrote:

You might be interested in this article, where it is proposed that games at different time controls are to be given different weights.

It is interesting that blitz chess games can be used to make classical chess ratings more predictive of classical chess results. I wonder, however, whether Sonas did any study of whether people have statistically distinguishable playing strengths at different time controls. For Arimaa there is clearly a difference in playing strength. See the eleventh post in this thread

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Experimental new rating system
« Reply #16 on: Jun 6^th, 2006, 1:18pm »

Quote

Modify

on May 26^th, 2006, 4:50pm, aaaa wrote:

maximizing the prediction power of the system has resulted in the rating deviation growing very fast if a player doesn't play in a while. Depending on one's volatility, it will take only about 20 days before one's rating deviation becomes the maximum again.

It's very interesting that maximizing predictive accuracy means that our uncertainty should stay high (or at least rapidly increase back to the maximum) and thus the ratings should remain "loose" i.e. subject to quick change.

I wonder whether people really do change rapidly in playing strength from three weeks of inactivity. Certainly the server bots do not. I suspect that the reason "loose" ratings are more predictive is actually more due to intransitivity than due to rapidly-changing skill level.

If I suddenly decide to pump up my rating by beating one bot over and over again, the rating system will be more predictive if it quickly forgets my history and adjusts to my new performance. Likewise if someone dominates the first dozen bots on the ladder before getting stuck on one they lose to twenty times in a row, the rating system will be more predictive if it quickly forgets the prior wins and lowers that player's rating to predict future losses.

In other words, the worse the problem of intransitivity is, the more stupid slow-changing ratings appear to be compared to fast-changing ratings. Yet we can be sure that fast-changing ratings for bots (especially fixed-performance bots ) are not reflecting underlying changes in skill. The changes in a bot's rating mostly reflect changes in who is dominating (or being dominated by) the bot at any given time, i.e. it is mostly reflecting the intransitivity of ratings.

Of course these are just my speculations. Statistically verifying my thesis would be quite another matter. Wink

IP Logged

mouse
Forum Senior Member

Arimaa player #784

Gender: male

Posts: 45

Re: Experimental new rating system
« Reply #17 on: Jun 6^th, 2006, 3:29pm »

Quote

Modify

I think these rating calculations is interesting. And I believe they point out the need to have a seperate HvH, HvB and BvB rating especially players highligted seem to be exampels of players who have very diffenrent ratings against bots and humans.

on May 29^th, 2006, 10:49am, Fritzlein wrote:

Name FRIAR Sever
Fritzlein 2320 2309
Adanac 2245 2177
robinson 2230 2148
99of9 2212 2169
Belbo 2172 2002
PMertens 2115 2086
Ryan_Cable 2085 2130
chessandgo 2052 2015
omar 2050 1947
blue22 1989 2005
Swynndla 1989 1790
RonWeasley 1979 1941
BlackKnight 1918 1833
naveed 1876 1956
jdb 1875 1796
OLTI 1850 1958
Spunk 1750 1472
mouse 1728 2051
KT2006 1715 1657
frostlad 1715 1807
seanick 1702 1537
grey_0x2A 1692 1709
Arimanator 1689 2035
kamikazeking 1668 1751
thorin 1654 1895
megamau 1649 1788

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Experimental new rating system
« Reply #18 on: Jun 19^th, 2006, 1:36am »

Quote

Modify

I updated my database with 78 more rated hvh games and ran the FRIAR ratings again. Seeing the results made me realize that FRIAR ratings were way more volatile than I wanted. Winning streaks and losing streaks created huge swings. Chessandgo's rating, for example, went up and then down more than 100 points in a two-week span.

To counteract this volatility, I reduced the k-factor further from 10 to 6. That would make a 100-point swing into only a 60-point swing. The scale remains the same (e.g. a rating difference of 200 points still should mean a 76% winning chance for the favorite), but each player's early ratings become all the more tied to 1500. It now takes about 25 games against humans for your FRIAR rating to really start floating freely. I guess that could be considered a feature: you stay near 1500 until there's quite a bit of proof you should be higher or lower.

Here are the new FRIAR top 20, with number of games played:

rate games username
2387 218 Fritzlein
2240 203 99of9
2116 122 Belbo
2033 204 robinson
2026 110 Adanac
2024 115 Ryan_Cable
2020 279 PMertens
2007 95 chessandgo
2005 84 OLTI
1932 119 omar
1927 105 Swynndla
1914 130 jdb
1884 75 blue22
1882 223 naveed
1867 20 RonWeasley
1819 18 BlackKnight
1753 72 kamikazeking
1705 23 frostlad
1693 68 Arimanator
1673 13 thorin

I note Belbo jumped from 6th place to 3rd. He posted only a win over OLTI in the postal tournament, extending his winning streak to six games, and FRIAR responds to streaks. Also the people he jumped, namely Adanac, robinson, and PMertens, posted results of 4-5, 0-3, and 4-6 respectively against other top players, losing records which weren't compensated by a win or two against lower-ranked folks.

99of9 had only two results, postal wins over blue22 and robinson, which extends 99of9's winning streak to seven and widens his lead over any contenders for second place. Similarly I posted only three wins, extending my winning streak to nine. Even with the k-factor reduced to 6, FRIAR is very concerned to know, "What have you done for me lately?"

Chessandgo leapfrogged jdb and omar, and helped drag OLTI up too by losing to OLTI twice while compiling a 9-4 combined record against PMertens, RyanCable, and Adanac.

RonWeasley's only result was beating seanick, but Ron dropped in the rankings anyway because I reduced the k-factor. That binds him more closely to 1500, and he has only played 20 rated hvh games so far. I couldn't believe Ron has played only twenty games given the impact he has had around here, but the on-line game record agrees. The fact that Ron could play more if the client were more stable is by itself a powerful argument for improving the client.

IP Logged

RonWeasley
Forum Guru

Harry's friend (Arimaa player #441)

Gender: male

Posts: 882

Re: Experimental new rating system
« Reply #19 on: Jun 19^th, 2006, 11:44am »

Quote

Modify

Quote:

I couldn't believe Ron has played only twenty games given the impact he has had around here,

Much like the impact a bludger has on a beater. I'm just batting practice most of the time, but if you don't look out, I might knock you off your broom!

IP Logged

chessandgo
Forum Guru

Arimaa player #1889

Gender: male

Posts: 1244

Re: Experimental new rating system
« Reply #20 on: Jun 19^th, 2006, 8:11pm »

Quote

Modify

Karl, don't you feel like the more you improve your rating system, the closer to Omar's actual ratings the output gets ? Tongue

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Experimental new rating system
« Reply #21 on: Jun 19^th, 2006, 10:40pm »

Quote

Modify

Well, the main thing I like about FRIAR right now compared to Omar's ratings is that FRIAR uses only hvh results. To get a good comparison, I would have to run Omar's ratings side-by-side with only hvh games as input. Then I could compare directly.

Hmmm, actually that's not a bad idea...

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Experimental new rating system
« Reply #22 on: Jun 26^th, 2006, 5:38pm »

Quote

Modify

I updated FRIAR with 27 more games.

rate games username
2485 223 Fritzlein
2242 203 99of9
2223 106 chessandgo
2155 123 Belbo
2064 206 robinson
2061 86 OLTI
2060 115 Adanac
2044 282 PMertens
2003 119 Ryan_Cable
1943 105 Swynndla
1924 120 omar
1887 75 blue22
1884 223 naveed
1868 22 RonWeasley
1847 133 jdb
1819 18 BlackKnight
1732 73 kamikazeking
1708 23 frostlad
1693 68 Arimanator
1675 13 thorin

The big story is chessandgo jumping 5 more places. Chessandgo won against Adanac, seanick, RonWeasley, Ryan_Cable, jdb, robinson, PMertens, and seanick, while losing only three games, all to me. That incidentally also pushes my rating to a ridiculous high.

OLTI gets a retrospective bounce from his previous victories over chessandgo, and from having not lost in a while.

This week shows once again that FRIAR ratings are more volatile than server ratings. I gained 100 points for three wins! I might have to reduce the k-factor still further.

« Last Edit: Jun 26^th, 2006, 5:39pm by Fritzlein »

IP Logged

OLTI
Forum Full Member

Arimaa player #1034

Gender: male

Posts: 25

Re: Experimental new rating system
« Reply #23 on: Jun 28^th, 2006, 4:24am »

Quote

Modify

I'm starting to like FRIAR Grin

IP Logged

DorianGaray
Forum Guru

Arimaa player #1210

Gender: male

Posts: 55

Re: Experimental new rating system
« Reply #24 on: Jun 28^th, 2006, 6:03am »

Quote

Modify

on Jun 26^th, 2006, 5:38pm, Fritzlein wrote:

...That incidentally also pushes my rating to a ridiculous high...

Shouldn't that sole detail give you pause about the validity of your system? You've played for nearly two years yet your rating is completly uspet , by what? Another player's latest 5 games!

Ratings are only valid if they can predict accurately the likelyhood that one player has to win against another at any given time, otherwise they are just useless numbers. I don't see anything here that shows that your system is doing that.

« Last Edit: Jun 28^th, 2006, 6:04am by DorianGaray »

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Experimental new rating system
« Reply #25 on: Jun 28^th, 2006, 8:10am »

Quote

Modify

on Jun 28^th, 2006, 6:03am, DorianGaray wrote:

Ratings are only valid if they can predict accurately the likelyhood that one player has to win against another at any given time, otherwise they are just useless numbers. I don't see anything here that shows that your system is doing that.

Sure, I don't think the FRIAR ratings are very predictive. My idea for a new rating system is working rather poorly so far. I think I would have to do something to reduce the volatility to make the results more reasonable. Also I'm starting to not like how strongly everyone is tied to 1500 for their first few games.

The main point of generating ratings based only on human games is that the standard server ratings can be distorted a great deal by bot-bashing, or alternatively by losing repeatedly to a bot that one can't beat. Both of these distortions seem reasonably common.

Take your case for example. You have played 196 rated games against bots and 10 rated games against humans. I submit that your rating of 2228 doesn't tell us very much about your chances of winning against, say, 99of9, who is the player closest in rating to you. You might beat him 90% of the time, or you might lose to him 90% of the time; we just can't tell from your server rating which is based mostly on your games versus bots.

Your ten results against humans (and their ratings) were

L 1949
W 1442
L 2317
W 1483
W 1516
W 1384
L 1863
W 1408
W 1517
W 1518

The rating that predicts a result of 7-3 against this opposition is 1823. With a true strength of 2228, you would be expected to be 9-1 against this opposition, on average. This appears to be a situation where the server ratings are not very predictive, although we have no reliable idea how you would do against human opposition until you start playing more games against humans. Maybe if you played more human opponents you would justify a rating of 2200, or 2500, or anything. We simply can't predict very well on the basis of only ten games.

I'd be interested in human-only ratings lists even if they were generated by a totally different methodology than FRIAR ratings. In my opinion it is ultimately going to be very hard to generate valid (or accurate, or predictive) ratings if games against bots are included at all, no matter what methodology is used.

I agree, however, that in the case of FRIAR ratings, the cure is worse than the disease. There's no way I deserve to be rated 200 points ahead of everyone else based on my results against humans, and that's not the only visible distortion in FRIAR ratings.

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Experimental new rating system
« Reply #26 on: Jun 30^th, 2006, 9:47pm »

Quote

Modify

I changed FRIAR again. To cut out the still-too-high volatility, I reduced the k-factor all the way to 2. Also, to reduce the too-strong tie to 1500. Instead of assuming that everyone has a strength of 1500 at the start, I don't make any assumptions about initial strength. The tie to 1500 is rather 3 draws against a 1500-rated player at the beginning of everyone's game record. The new list, produced from exactly the same games as the old list, is

rate games username
2443 223 Fritzlein
2216 203 99of9
2182 106 chessandgo
2102 206 robinson
2096 123 Belbo
2095 115 Adanac
2064 282 PMertens
2020 119 Ryan_Cable
1988 86 OLTI
1933 22 RonWeasley
1911 18 BlackKnight
1904 120 omar
1904 223 naveed
1898 105 Swynndla
1869 75 blue22
1854 13 Spunk
1851 133 jdb
1810 73 kamikazeking
1776 13 thorin
1772 12 mouse

One improvement is that people on a recent winning streak (me, chessandgo, Belbo, OLTI) aren't overly rewarded, and the people performing a bit below par recently (robinson, Adanac) aren't overly punished. The more stable ratings means the present ratings are affected by games further back.

A second improvement is that RonWeasley jumps to a more reasonable level from not being so closely tied to 1500. A third, more subtle improvement is that Swynndla dips. He beat lots of newcomers, but since the newcomers themselves are less tied to 1500, they end up lower, and Swynndla therefore gets less boost from those victories.

One possible criticism is that I am still rated too far ahead of 99of9. To check this I used server ratings to calculate my performance rating over my last 50 rated games against humans. It turns out I am 47-3 (I didn't realize!), which gives me a 2429 performance rating versus humans. 99of9 is 38-12 against somewhat tougher opposition, for a performance rating of 2210 versus humans. Since the FRIAR ratings are not far from vs. human performance ratings calculated by server ratings, I don't feel as bad about my inflated rating.

On the whole I'm rather pleased with my latest version. Graph to follow.

IP Logged

chessandgo
Forum Guru

Arimaa player #1889

Gender: male

Posts: 1244

Re: Experimental new rating system
« Reply #27 on: Jun 30^th, 2006, 11:05pm »

Quote

Modify

on Jun 30^th, 2006, 9:47pm, Fritzlein wrote:

One possible criticism is that I am still rated too far ahead of 99of9. To check this I used server ratings to calculate my performance rating over my last 50 rated games against humans. It turns out I am 47-3 (I didn't realize!), which gives me a 2429 performance rating versus humans. 99of9 is 38-12 against somewhat tougher opposition, for a performance rating of 2210 versus humans. Since the FRIAR ratings are not far from vs. human performance ratings calculated by server ratings, I don't feel as bad about my inflated rating.

I don't think you'll manage to design a rating system not giving you on top with huge lead

why don't you come and play to score a 50-0 instead of taking care of everyone's rating ? I'd be glad to be take 3 last losses Wink

IP Logged

chessandgo
Forum Guru

Arimaa player #1889

Gender: male

Posts: 1244

Re: Experimental new rating system
« Reply #28 on: Jun 30^th, 2006, 11:06pm »

Quote

Modify

*to take the 3 last losses*

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Experimental new rating system
« Reply #29 on: Jul 1^st, 2006, 9:08am »

Quote

Modify

Here's a graph of my latest FRIAR ratings over time:

Compared to the more-volatile system before:

I think the new one is a more reasonable guess at how fast playing skill actually changes, don't you?

IP Logged

Pages: 1 2 3

Notify of replies

Send Topic


« Previous topic \| Next topic »