Welcome, Guest. Please Login or Register.
Apr 20th, 2024, 9:08am

Home Home Help Help Search Search Members Members Login Login Register Register
Arimaa Forum « Whole History Ratings »


   Arimaa Forum
   Arimaa
   General Discussion
(Moderator: supersamu)
   Whole History Ratings
« Previous topic | Next topic »
Pages: 1 2 3 4  ...  10 Reply Reply Notify of replies Notify of replies Send Topic Send Topic Print Print
   Author  Topic: Whole History Ratings  (Read 67405 times)
aaaa
Forum Guru
*****



Arimaa player #958

   


Posts: 768
Re: Whole History Ratings
« Reply #15 on: Apr 27th, 2008, 1:52pm »
Quote Quote Modify Modify

on Apr 12th, 2008, 2:11am, omar wrote:

 
Yes, I also think that P1 and P2 type bot ratings should be fixed. Additionally the rating system should be anchored so that random play has a rating of zero. This can be achieved using a random bot which has a random setup and and picks random moves (i.e. enumerates all the possible unique positions that can arise from the current position and randomly selects one).

I think that's a very bad idea. My common sense tells me that that would result in an extreme runaway inflation of ratings.
I don't see why there should be any rating anchoring.
IP Logged
omar
Forum Guru
*****



Arimaa player #2

   


Gender: male
Posts: 1003
Re: Whole History Ratings
« Reply #16 on: May 2nd, 2008, 1:26pm »
Quote Quote Modify Modify

There was a lot of discussion about anchoring the rating system several years ago. Here is the link to that thread.
 
http://arimaa.com/arimaa/forum/cgi/YaBB.cgi?board=talk;action=display;nu m=1065901453;start=0
 
I guess one could say that even our current rating system is anchored based on a new player having a rating of 1500 and all the established players ratings are relative to that. But there is nothing magical about the number 1500, I could have just as easily chose a new players rating to be 0 and the rating of the current 2000 rated players would just be 500. Likewise if new player ratings were 10000 the current 2000 rated players would be rated 10500. So just by picking some value for the new player rating we are anchoring the rating system to it.
 
I feel that a rating of 0 should mean random play because you are not trying to win or lose. And its very easy to create such a player for Arimaa and most other games now that we have computers Smiley
 
The other thing to consider is how good is the anchor. A good anchor should be stable and never change over time. Also it should be a good source and sink so that it can cushion disruptions in the system and prevent rating inflation and deflation. I think programs that play with a fixed strength (such as the P1, P2 bots) would be better anchors than new players with strength that can vary significantly.
 
Ideally it would be nice to have a rating system that is stable across time so that it is reasonable to compare ratings of players from different eras. Also if other games used a similarly scaled rating system with zero as random play then it even becomes possible to compare the complexity of different games. Maybe the ratings for chess would range from 0 to 5000, but the ratings for Go range from 0 to 15000.  
 
The tricky part is how do we go from the random bot having a rating of zero up to human players and CC level bots. You would not want to just put out the random bot for anyone to play against and fix its rating to zero. Everyone would only get points from it and hardly ever lose points. Which I think is why you feel there would be runaway inflation of ratings. But what if the random bot was only used off-line to establish the rating of some beginner level bots like ArimaaScoreP1 and then these bots were made available online with fixed ratings. The other non-fixed rating bots like ShallowBlue, LocP1, etc would setup to periodically play the fixed rated bots. Also the initial rating of new players would be set based on looking at how new players have performed against the bots in the first level of the ladder.
 
Of course having a nice anchor for the rating system doesn't prevent players from abusing the rating system. That's a different issue which needs to be dealt with separately.
 
IP Logged
aaaa
Forum Guru
*****



Arimaa player #958

   


Posts: 768
Re: Whole History Ratings
« Reply #17 on: May 20th, 2008, 1:55pm »
Quote Quote Modify Modify

You have to realistically expect that, by all likelihood, a random playing bot would have an Elo rating of at least several thousands below zero by current measures. That would mean that for a stable system, you would have to have an enormous amount of bots to cover the range between it and the rest of the field, to mention nothing of the number of games that would have to be played.
 
If inflation is really bothering you, one possible solution I can think of, would be to assume, for now, that every new player would have to be a beginner and thus to give him or her the average of the lowest rating ever of every established player. There would, of course, be a considerable risk of deflation then.
IP Logged
Fritzlein
Forum Guru
*****



Arimaa player #706

   
Email

Gender: male
Posts: 5928
Re: Whole History Ratings
« Reply #18 on: May 21st, 2008, 8:36am »
Quote Quote Modify Modify

on May 20th, 2008, 1:55pm, aaaa wrote:
You have to realistically expect that, by all likelihood, a random playing bot would have an Elo rating of at least several thousands below zero by current measures.

You are not the only one to have this intuition, but it seems not to be upheld by experiments so far.  (I infer from your post that you did not review the old thread Omar linked from his post directly before yours.)  It seems that a random player would have a rating near zero on the current random scale, or anyway not more than a few hundred below zero.
 
Now that I re-read that thread myself, I think it would be a good idea to anchor ArimaaScoreP1 at a rating of 1000 (or something) immediately.  I have gradually become convinced that ratings are inflating on the server, including the ratings of fixed-performance bots, and that the culprit is beginners playing only the first part of the bot ladder.
« Last Edit: May 21st, 2008, 8:53am by Fritzlein » IP Logged

mistre
Forum Guru
*****





   


Gender: male
Posts: 553
Re: Whole History Ratings
« Reply #19 on: May 21st, 2008, 10:43am »
Quote Quote Modify Modify

Here is an idea concerning WHR.  Why not experiment with a human vs human only rating using WHR and keep our original rating separate?
 
On the bot side, we anchor lower bots and eventually change our original rating to a bot-only rating.  
 
Does anyone see a problem having 2 different ratings?  We theoretically already have that with P8.  To me, this seems like the simplest solution so that you don't have to worry about mixing vs bot and vs human games together.
« Last Edit: May 21st, 2008, 10:49am by mistre » IP Logged

omar
Forum Guru
*****



Arimaa player #2

   


Gender: male
Posts: 1003
Re: Whole History Ratings
« Reply #20 on: Jun 3rd, 2008, 5:34am »
Quote Quote Modify Modify

on May 21st, 2008, 8:36am, Fritzlein wrote:

I have gradually become convinced that ratings are inflating on the server, including the ratings of fixed-performance bots, and that the culprit is beginners playing only the first part of the bot ladder.

 
Looking at what is happening to the rating of fixed-performance bots sounds like a good way to keep tab on the health of the rating system. So are the fixed rating bots definitely increasing in rating?
 
IP Logged
omar
Forum Guru
*****



Arimaa player #2

   


Gender: male
Posts: 1003
Re: Whole History Ratings
« Reply #21 on: Jun 3rd, 2008, 5:41am »
Quote Quote Modify Modify

on May 21st, 2008, 10:43am, mistre wrote:
Here is an idea concerning WHR.  Why not experiment with a human vs human only rating using WHR and keep our original rating separate?

 
I think it is a good idea. Would someone would like to do this and post the results. I will link to this also besides the P8 ratings.
 
IP Logged
Fritzlein
Forum Guru
*****



Arimaa player #706

   
Email

Gender: male
Posts: 5928
Re: Whole History Ratings
« Reply #22 on: Jun 8th, 2008, 11:16am »
Quote Quote Modify Modify

on Jun 3rd, 2008, 5:34am, omar wrote:
Looking at what is happening to the rating of fixed-performance bots sounds like a good way to keep tab on the health of the rating system. So are the fixed rating bots definitely increasing in rating?

I queried the game database for the average rating of each bot each year, with a minimum of 30 rated games for the bot-year to count.

Player \ Year   .  2003  2004  2005  2006  2007  2008
---------------    ----  ----  ----  ----  ----  ----
Aamira2006Blitz .     .     .     .  1646  1751  1771
Aamira2006CC    .     .     .     .  1657  1669  1666
Aamira2006Fast  .     .     .     .  1685  1693  1735
Aamira2006P1    .     .     .     .  1227  1314  1284
Aamira2006P2    .     .     .     .  1518  1575  1627
Arimaalon .     .     .  1167  1240  1217  1364  1323
ArimaaScoreP1   .     .     .     .  1181  1309  1250
ArimaaScoreP2   .     .     .     .  1297  1368  1343
Arimaazilla     .  1516  1419  1449  1451  1502  1542
Bomb2005Blitz   .     .     .  1876  1856  1931  2048
Bomb2005CC.     .     .     .  1774  1858  1916  1915
Bomb2005Fast    .     .     .  1827  1826  1930  1869
Bomb2005P1.     .     .     .  1488  1632  1715  1736
Bomb2005P2.     .     .     .  1752  1806  1887  1902
Clueless2005Blitz     .     .  1660  1793  1878  1875
Clueless2005CC  .     .     .  1621  1777  1807  1822
Clueless2005Fast.     .     .  1662  1761  1784  1794
Clueless2005P1  .     .     .  1645  1662  1656  1636
Clueless2005P2  .     .     .     .  1750  1760  1762
Clueless2006Blitz     .     .     .  1423  1420  1543
Clueless2006Fast.     .     .     .  1602  1683  1627
Clueless2006P1  .     .     .     .  1688  1716  1704
Clueless2006P2  .     .     .     .  1705  1652  1711
GnoBot2005Blitz .     .     .  1652  1747  1841  1911
GnoBot2005CC    .     .     .  1535  1661  1600  1627
GnoBot2005Fast  .     .     .  1541  1724  1734  1772
GnoBot2005P1    .     .     .  1382  1262  1392  1378
GnoBot2005P2    .     .     .  1552  1608  1651  1660
Loc2005Blitz    .     .     .  1602  1571  1568  1704
Loc2005CC .     .     .     .  1419  1539  1508  1518
Loc2005Fast     .     .     .  1438  1498  1557  1570
Loc2005P1 .     .     .     .  1404  1314  1412  1374
Loc2005P2 .     .     .     .  1425  1412  1498  1582
Loc2006Blitz    .     .     .     .  1579  1614  1724
Loc2006P1 .     .     .     .     .  1356  1448  1431
Loc2006P2 .     .     .     .     .  1585  1599  1644
ShallowBlue     .     .     .     .  1224  1326  1291

 
The average year-over-year rating gain for all bots was
 
2005 to 2006: 49
2006 to 2007: 52
2007 to 2008: 17
 
which works out to a gain of 50 points per year when you consider that we are only a third of the way through 2008.
 
The average year-over-year rating gain for fixed performance bots only was
 
2005 to 2006: 3
2006 to 2007: 66
2007 to 2008: 2
 
which isn't quite as bad, but still noticeable.
 
The average year-over-year rating gain for Fast and Blitz bots only was
 
2005 to 2006: 65
2006 to 2007: 52
2007 to 2008: 43
 
If we look at the total rating increase for speedy bots in two and a third years (160 points) and subtract out the total rating increase in fixed performance bots (71 points of inflation) that gives us an increase of 38 rating points per year due to faster hardware alone.  Taking the rating of chessando (2496), subtract the rating of Bomb2005CC (1916) and divide by 38, we can forecast that faster hardware will allow bots to win the Arimaa Challenge in 15 years, i.e. a couple of years after the Challenge prize expires.  Of course this doesn't take into account new programming techniques or new human strategic discoveries.
IP Logged

omar
Forum Guru
*****



Arimaa player #2

   


Gender: male
Posts: 1003
Re: Whole History Ratings
« Reply #23 on: Jun 11th, 2008, 2:57pm »
Quote Quote Modify Modify

Thanks for posting this data Karl. I had not seen this before my last post requesting some graphs of the bot rating histories. I like this better.
IP Logged
woh
Forum Guru
*****



Arimaa player #2128

   


Gender: male
Posts: 254
Re: Whole History Ratings
« Reply #24 on: Mar 12th, 2009, 4:24am »
Quote Quote Modify Modify

I managed to implement the whole history ratings. The first result are based on all rated games up to January 31st. All HvH as well as all HvB and all BvB games are included.
 
The time resolution used is 1 sec. This means that the rating at the time of each game is calculated separately. In the article a resolution of 1 day is used but tests have shown that the difference is not noticeable. For the variability of the ratings in time 60 EloČ/day was used. Finally the prior used was 1 game won and 1 game lost against a player with a rating of 1220.319747. At first I used 1 win and 1 loss against a player with a rating of 0 as mentioned in the article. But that gave me ratings in the range of -600 to 1400 and I wanted ratings that are better comparable with the gameroom ratings. So what to use? 1 win and 1 loss against a player of 1500, the rating new player used to start with, or 1300 at what rating new players now start? Then I came up with the idea to choose the rating in such a way that the final rating of bot_ArimaaScoreP1 is 1000, the rating to which this bot is now fixed in the gameroom. (The rating of bot_ArimaaScoreP1 is not fixed in these whole history ratings; it can vary over time just like the rating of any other player.)
 
For comparison I have plotted the gameroom ratings against the whole history ratings

as well as the difference between the 2 ratings against the gameroom ratings
.
In both graphs the red square represents bot_ArimaScoreP1.
 
For some players I have also plotted the history of their whole history rating together with their gameroom rating. The history of bot_ArimaaScoreP1 is one of the players' histories included. You can see how his rating varies over time to end at a rating of 1000. (The time on the x-axis is in days since September 1st 2002.)
 
The whole history ratings predict the outcome of a game correctly in 77,2% of the games an improvement over the gameroom ratings that predict 72,6% of the games correctly. Fritzlein suggested to use the root mean square of the difference between the actual result of a game and the result predicted by the ratings to compare the performance of the ratings. By this criterion the whole history rating with 0,394 do also better than the gameroom ratings that score 0,424.
 
 
on Jun 3rd, 2008, 5:41am, omar wrote:

 
I think it is a good idea. Would someone would like to do this and post the results. I will link to this also besides the P8 ratings.
 

 
 
The second result are based only on the rated HvH games. Otherwise all parameters are the same. Again the games up till January 31st are included.
IP Logged

Fritzlein
Forum Guru
*****



Arimaa player #706

   
Email

Gender: male
Posts: 5928
Re: Whole History Ratings
« Reply #25 on: Mar 12th, 2009, 8:30am »
Quote Quote Modify Modify

This is a fantastic contribution, woh.  I recently commented again on the need for "real" ratings, and the whole history ratings based on human vs. human games are the most real ratings I have seen.
 
We have seen some systems that improve on the game-room ratings, but no previous offering has been able to simultaneously solve the "isolated group" problem and the "time varying" problem.
 
Systems which process the game history sequentially are vulnerable to distortion from newcomers playing each other and getting established ratings from each other even though both players are overrated to start.  Later losses by those newcomers to established players don't have as much corrective impact as they should.
 
On the other hand, systems which process the game history simultaneously are vulnerable to distortion from real changes in skill.  Usually when someone processes the whole game history, I come out rated higher than chessandgo because I won about thirty straight games against him while he was learning.  Those games show that I was a better player then, but are scant evidence that I am a better player now.
 
Just scanning the results of the human-versus-human whole history ratings, I don't see any obvious distortions, except for the whole scale having been shifted down.  In particular, my early success against chessandgo didn't push my current rating above his.  Also I don't see many players with thin records floating way above the median.  (Although I want to look at how Rabbit got to #22, since I don't recall hearing of him before.)
 
I propose that the human-vs-human whole history ratings be used to seed tournaments from now on, starting with the Postal Mixer, but most especially that they be used to seed the next World Championship.  Or if WHR comes with an uncertainty interval, then we could seed based on the WHR rating minus the lower uncertainty bound.
 
A while ago I tried to implement something like the WHR, but my successive approximation took ages to converge, making it computationally infeasible as a substitute for game room ratings.  What are the CPU requirements of WHR?  How quickly can a new game result be added in?  I'm quite interested in WHR being added to the game room if the computation cost is low enough.  Certainly it seems like a better way to spend arimaa.com server CPU than the periodic recalculation of p8 ratings.
 
Thanks again for coding this up, woh!
« Last Edit: Mar 12th, 2009, 8:31am by Fritzlein » IP Logged

omar
Forum Guru
*****



Arimaa player #2

   


Gender: male
Posts: 1003
Re: Whole History Ratings
« Reply #26 on: Mar 13th, 2009, 6:43am »
Quote Quote Modify Modify

Great job woh. This is fantastic.
 
I added a link to your web page on the 'Top Rated Players' page.
 
I agree with Karl that we should start using the WHR HH ratings for seeding of future tournaments.
IP Logged
Tuks
Forum Guru
*****



Arimaa player #2626

   


Gender: male
Posts: 203
Re: Whole History Ratings
« Reply #27 on: Mar 13th, 2009, 9:47am »
Quote Quote Modify Modify

you might want to revise it though, no matter how much convincing you do, you will not convince me "Rabbit" who happened to win 3 out of 3 human games has any chance against any of the top players in a non-postal match Wink other than that, i like it, it presents an actual rating i want to be up at the top in
IP Logged
mistre
Forum Guru
*****





   


Gender: male
Posts: 553
Re: Whole History Ratings
« Reply #28 on: Mar 13th, 2009, 11:10am »
Quote Quote Modify Modify

Great contribution Woh!  Your personal graph looks like the current stock market!  LOL.
 
Perhaps there should be a minimum number of games needed to be ranked - which would fix the "Rabbit" situation.
« Last Edit: Mar 13th, 2009, 11:11am by mistre » IP Logged

omar
Forum Guru
*****



Arimaa player #2

   


Gender: male
Posts: 1003
Re: Whole History Ratings
« Reply #29 on: Mar 18th, 2009, 7:29pm »
Quote Quote Modify Modify

Ever since woh posted his results of using WHR on the HH games, I've been tempted to improve my position on that list Smiley Which of course means I have to start playing more HH games. I plan to do that more once the events ease up.
 
Even though I don't care too much about my rating and play experimental games against bots as rated games, I think most all the rated HH games I've played were taken seriously. Thus, I think the data going into the WHR rating system was pretty good and so the ratings we get out from that will probably be more accurate. I feel very comfortable with using these ratings to seed human tournaments. Also a key feature I like about WHR is the ability to retroactively unrate games; I'll explain in a bit why this is so good.
 
I've started to view our current rating system as an unofficial superficial rating system which simply serves to provide immediate feedback of ratings to new users. Also for people who like to boost their ratings against bots it gives them a longer term goal and initially will help them get better. Eventually they will learn which ratings really matter and find yet another goal to try for.
 
There is the possibility that people will eventually try to distort their WHR HH ratings also. It would not be that hard to do by creating multiple accounts and sacrificing the ratings of a few accounts to boost the rating of one account. I think we might eventually need to keep a flag with every game that indicates if the game should be used in WHR rating calculations or not. This will allow us to retroactively not include rated games that look suspicious. Eventually we could even mark legitimate rated HB games for inclusion in WHR.
 
So woh how can we get daily updates of our WHR ratings? If you want I can setup to run the calculations on the arimaa.com server once a day.
IP Logged
Pages: 1 2 3 4  ...  10 Reply Reply Notify of replies Notify of replies Send Topic Send Topic Print Print

« Previous topic | Next topic »

Arimaa Forum » Powered by YaBB 1 Gold - SP 1.3.1!
YaBB © 2000-2003. All Rights Reserved.