Author |
Topic: Whole History Ratings (Read 67405 times) |
|
aaaa
Forum Guru
Arimaa player #958
Posts: 768
|
|
Re: Whole History Ratings
« Reply #15 on: Apr 27th, 2008, 1:52pm » |
Quote Modify
|
on Apr 12th, 2008, 2:11am, omar wrote: Yes, I also think that P1 and P2 type bot ratings should be fixed. Additionally the rating system should be anchored so that random play has a rating of zero. This can be achieved using a random bot which has a random setup and and picks random moves (i.e. enumerates all the possible unique positions that can arise from the current position and randomly selects one). |
| I think that's a very bad idea. My common sense tells me that that would result in an extreme runaway inflation of ratings. I don't see why there should be any rating anchoring.
|
|
IP Logged |
|
|
|
omar
Forum Guru
Arimaa player #2
Gender:
Posts: 1003
|
|
Re: Whole History Ratings
« Reply #16 on: May 2nd, 2008, 1:26pm » |
Quote Modify
|
There was a lot of discussion about anchoring the rating system several years ago. Here is the link to that thread. http://arimaa.com/arimaa/forum/cgi/YaBB.cgi?board=talk;action=display;nu m=1065901453;start=0 I guess one could say that even our current rating system is anchored based on a new player having a rating of 1500 and all the established players ratings are relative to that. But there is nothing magical about the number 1500, I could have just as easily chose a new players rating to be 0 and the rating of the current 2000 rated players would just be 500. Likewise if new player ratings were 10000 the current 2000 rated players would be rated 10500. So just by picking some value for the new player rating we are anchoring the rating system to it. I feel that a rating of 0 should mean random play because you are not trying to win or lose. And its very easy to create such a player for Arimaa and most other games now that we have computers The other thing to consider is how good is the anchor. A good anchor should be stable and never change over time. Also it should be a good source and sink so that it can cushion disruptions in the system and prevent rating inflation and deflation. I think programs that play with a fixed strength (such as the P1, P2 bots) would be better anchors than new players with strength that can vary significantly. Ideally it would be nice to have a rating system that is stable across time so that it is reasonable to compare ratings of players from different eras. Also if other games used a similarly scaled rating system with zero as random play then it even becomes possible to compare the complexity of different games. Maybe the ratings for chess would range from 0 to 5000, but the ratings for Go range from 0 to 15000. The tricky part is how do we go from the random bot having a rating of zero up to human players and CC level bots. You would not want to just put out the random bot for anyone to play against and fix its rating to zero. Everyone would only get points from it and hardly ever lose points. Which I think is why you feel there would be runaway inflation of ratings. But what if the random bot was only used off-line to establish the rating of some beginner level bots like ArimaaScoreP1 and then these bots were made available online with fixed ratings. The other non-fixed rating bots like ShallowBlue, LocP1, etc would setup to periodically play the fixed rated bots. Also the initial rating of new players would be set based on looking at how new players have performed against the bots in the first level of the ladder. Of course having a nice anchor for the rating system doesn't prevent players from abusing the rating system. That's a different issue which needs to be dealt with separately.
|
|
IP Logged |
|
|
|
aaaa
Forum Guru
Arimaa player #958
Posts: 768
|
|
Re: Whole History Ratings
« Reply #17 on: May 20th, 2008, 1:55pm » |
Quote Modify
|
You have to realistically expect that, by all likelihood, a random playing bot would have an Elo rating of at least several thousands below zero by current measures. That would mean that for a stable system, you would have to have an enormous amount of bots to cover the range between it and the rest of the field, to mention nothing of the number of games that would have to be played. If inflation is really bothering you, one possible solution I can think of, would be to assume, for now, that every new player would have to be a beginner and thus to give him or her the average of the lowest rating ever of every established player. There would, of course, be a considerable risk of deflation then.
|
|
IP Logged |
|
|
|
Fritzlein
Forum Guru
Arimaa player #706
Gender:
Posts: 5928
|
|
Re: Whole History Ratings
« Reply #18 on: May 21st, 2008, 8:36am » |
Quote Modify
|
on May 20th, 2008, 1:55pm, aaaa wrote:You have to realistically expect that, by all likelihood, a random playing bot would have an Elo rating of at least several thousands below zero by current measures. |
| You are not the only one to have this intuition, but it seems not to be upheld by experiments so far. (I infer from your post that you did not review the old thread Omar linked from his post directly before yours.) It seems that a random player would have a rating near zero on the current random scale, or anyway not more than a few hundred below zero. Now that I re-read that thread myself, I think it would be a good idea to anchor ArimaaScoreP1 at a rating of 1000 (or something) immediately. I have gradually become convinced that ratings are inflating on the server, including the ratings of fixed-performance bots, and that the culprit is beginners playing only the first part of the bot ladder.
|
« Last Edit: May 21st, 2008, 8:53am by Fritzlein » |
IP Logged |
|
|
|
mistre
Forum Guru
Gender:
Posts: 553
|
|
Re: Whole History Ratings
« Reply #19 on: May 21st, 2008, 10:43am » |
Quote Modify
|
Here is an idea concerning WHR. Why not experiment with a human vs human only rating using WHR and keep our original rating separate? On the bot side, we anchor lower bots and eventually change our original rating to a bot-only rating. Does anyone see a problem having 2 different ratings? We theoretically already have that with P8. To me, this seems like the simplest solution so that you don't have to worry about mixing vs bot and vs human games together.
|
« Last Edit: May 21st, 2008, 10:49am by mistre » |
IP Logged |
|
|
|
omar
Forum Guru
Arimaa player #2
Gender:
Posts: 1003
|
|
Re: Whole History Ratings
« Reply #20 on: Jun 3rd, 2008, 5:34am » |
Quote Modify
|
on May 21st, 2008, 8:36am, Fritzlein wrote: I have gradually become convinced that ratings are inflating on the server, including the ratings of fixed-performance bots, and that the culprit is beginners playing only the first part of the bot ladder. |
| Looking at what is happening to the rating of fixed-performance bots sounds like a good way to keep tab on the health of the rating system. So are the fixed rating bots definitely increasing in rating?
|
|
IP Logged |
|
|
|
omar
Forum Guru
Arimaa player #2
Gender:
Posts: 1003
|
|
Re: Whole History Ratings
« Reply #21 on: Jun 3rd, 2008, 5:41am » |
Quote Modify
|
on May 21st, 2008, 10:43am, mistre wrote:Here is an idea concerning WHR. Why not experiment with a human vs human only rating using WHR and keep our original rating separate? |
| I think it is a good idea. Would someone would like to do this and post the results. I will link to this also besides the P8 ratings.
|
|
IP Logged |
|
|
|
Fritzlein
Forum Guru
Arimaa player #706
Gender:
Posts: 5928
|
|
Re: Whole History Ratings
« Reply #22 on: Jun 8th, 2008, 11:16am » |
Quote Modify
|
on Jun 3rd, 2008, 5:34am, omar wrote:Looking at what is happening to the rating of fixed-performance bots sounds like a good way to keep tab on the health of the rating system. So are the fixed rating bots definitely increasing in rating? |
| I queried the game database for the average rating of each bot each year, with a minimum of 30 rated games for the bot-year to count. Player \ Year . 2003 2004 2005 2006 2007 2008 --------------- ---- ---- ---- ---- ---- ---- Aamira2006Blitz . . . . 1646 1751 1771 Aamira2006CC . . . . 1657 1669 1666 Aamira2006Fast . . . . 1685 1693 1735 Aamira2006P1 . . . . 1227 1314 1284 Aamira2006P2 . . . . 1518 1575 1627 Arimaalon . . . 1167 1240 1217 1364 1323 ArimaaScoreP1 . . . . 1181 1309 1250 ArimaaScoreP2 . . . . 1297 1368 1343 Arimaazilla . 1516 1419 1449 1451 1502 1542 Bomb2005Blitz . . . 1876 1856 1931 2048 Bomb2005CC. . . . 1774 1858 1916 1915 Bomb2005Fast . . . 1827 1826 1930 1869 Bomb2005P1. . . . 1488 1632 1715 1736 Bomb2005P2. . . . 1752 1806 1887 1902 Clueless2005Blitz . . 1660 1793 1878 1875 Clueless2005CC . . . 1621 1777 1807 1822 Clueless2005Fast. . . 1662 1761 1784 1794 Clueless2005P1 . . . 1645 1662 1656 1636 Clueless2005P2 . . . . 1750 1760 1762 Clueless2006Blitz . . . 1423 1420 1543 Clueless2006Fast. . . . 1602 1683 1627 Clueless2006P1 . . . . 1688 1716 1704 Clueless2006P2 . . . . 1705 1652 1711 GnoBot2005Blitz . . . 1652 1747 1841 1911 GnoBot2005CC . . . 1535 1661 1600 1627 GnoBot2005Fast . . . 1541 1724 1734 1772 GnoBot2005P1 . . . 1382 1262 1392 1378 GnoBot2005P2 . . . 1552 1608 1651 1660 Loc2005Blitz . . . 1602 1571 1568 1704 Loc2005CC . . . . 1419 1539 1508 1518 Loc2005Fast . . . 1438 1498 1557 1570 Loc2005P1 . . . . 1404 1314 1412 1374 Loc2005P2 . . . . 1425 1412 1498 1582 Loc2006Blitz . . . . 1579 1614 1724 Loc2006P1 . . . . . 1356 1448 1431 Loc2006P2 . . . . . 1585 1599 1644 ShallowBlue . . . . 1224 1326 1291 The average year-over-year rating gain for all bots was 2005 to 2006: 49 2006 to 2007: 52 2007 to 2008: 17 which works out to a gain of 50 points per year when you consider that we are only a third of the way through 2008. The average year-over-year rating gain for fixed performance bots only was 2005 to 2006: 3 2006 to 2007: 66 2007 to 2008: 2 which isn't quite as bad, but still noticeable. The average year-over-year rating gain for Fast and Blitz bots only was 2005 to 2006: 65 2006 to 2007: 52 2007 to 2008: 43 If we look at the total rating increase for speedy bots in two and a third years (160 points) and subtract out the total rating increase in fixed performance bots (71 points of inflation) that gives us an increase of 38 rating points per year due to faster hardware alone. Taking the rating of chessando (2496), subtract the rating of Bomb2005CC (1916) and divide by 38, we can forecast that faster hardware will allow bots to win the Arimaa Challenge in 15 years, i.e. a couple of years after the Challenge prize expires. Of course this doesn't take into account new programming techniques or new human strategic discoveries.
|
|
IP Logged |
|
|
|
omar
Forum Guru
Arimaa player #2
Gender:
Posts: 1003
|
|
Re: Whole History Ratings
« Reply #23 on: Jun 11th, 2008, 2:57pm » |
Quote Modify
|
Thanks for posting this data Karl. I had not seen this before my last post requesting some graphs of the bot rating histories. I like this better.
|
|
IP Logged |
|
|
|
woh
Forum Guru
Arimaa player #2128
Gender:
Posts: 254
|
|
Re: Whole History Ratings
« Reply #24 on: Mar 12th, 2009, 4:24am » |
Quote Modify
|
I managed to implement the whole history ratings. The first result are based on all rated games up to January 31st. All HvH as well as all HvB and all BvB games are included. The time resolution used is 1 sec. This means that the rating at the time of each game is calculated separately. In the article a resolution of 1 day is used but tests have shown that the difference is not noticeable. For the variability of the ratings in time 60 EloČ/day was used. Finally the prior used was 1 game won and 1 game lost against a player with a rating of 1220.319747. At first I used 1 win and 1 loss against a player with a rating of 0 as mentioned in the article. But that gave me ratings in the range of -600 to 1400 and I wanted ratings that are better comparable with the gameroom ratings. So what to use? 1 win and 1 loss against a player of 1500, the rating new player used to start with, or 1300 at what rating new players now start? Then I came up with the idea to choose the rating in such a way that the final rating of bot_ArimaaScoreP1 is 1000, the rating to which this bot is now fixed in the gameroom. (The rating of bot_ArimaaScoreP1 is not fixed in these whole history ratings; it can vary over time just like the rating of any other player.) For comparison I have plotted the gameroom ratings against the whole history ratings as well as the difference between the 2 ratings against the gameroom ratings . In both graphs the red square represents bot_ArimaScoreP1. For some players I have also plotted the history of their whole history rating together with their gameroom rating. The history of bot_ArimaaScoreP1 is one of the players' histories included. You can see how his rating varies over time to end at a rating of 1000. (The time on the x-axis is in days since September 1st 2002.) The whole history ratings predict the outcome of a game correctly in 77,2% of the games an improvement over the gameroom ratings that predict 72,6% of the games correctly. Fritzlein suggested to use the root mean square of the difference between the actual result of a game and the result predicted by the ratings to compare the performance of the ratings. By this criterion the whole history rating with 0,394 do also better than the gameroom ratings that score 0,424. on Jun 3rd, 2008, 5:41am, omar wrote: I think it is a good idea. Would someone would like to do this and post the results. I will link to this also besides the P8 ratings. |
| The second result are based only on the rated HvH games. Otherwise all parameters are the same. Again the games up till January 31st are included.
|
|
IP Logged |
|
|
|
Fritzlein
Forum Guru
Arimaa player #706
Gender:
Posts: 5928
|
|
Re: Whole History Ratings
« Reply #25 on: Mar 12th, 2009, 8:30am » |
Quote Modify
|
This is a fantastic contribution, woh. I recently commented again on the need for "real" ratings, and the whole history ratings based on human vs. human games are the most real ratings I have seen. We have seen some systems that improve on the game-room ratings, but no previous offering has been able to simultaneously solve the "isolated group" problem and the "time varying" problem. Systems which process the game history sequentially are vulnerable to distortion from newcomers playing each other and getting established ratings from each other even though both players are overrated to start. Later losses by those newcomers to established players don't have as much corrective impact as they should. On the other hand, systems which process the game history simultaneously are vulnerable to distortion from real changes in skill. Usually when someone processes the whole game history, I come out rated higher than chessandgo because I won about thirty straight games against him while he was learning. Those games show that I was a better player then, but are scant evidence that I am a better player now. Just scanning the results of the human-versus-human whole history ratings, I don't see any obvious distortions, except for the whole scale having been shifted down. In particular, my early success against chessandgo didn't push my current rating above his. Also I don't see many players with thin records floating way above the median. (Although I want to look at how Rabbit got to #22, since I don't recall hearing of him before.) I propose that the human-vs-human whole history ratings be used to seed tournaments from now on, starting with the Postal Mixer, but most especially that they be used to seed the next World Championship. Or if WHR comes with an uncertainty interval, then we could seed based on the WHR rating minus the lower uncertainty bound. A while ago I tried to implement something like the WHR, but my successive approximation took ages to converge, making it computationally infeasible as a substitute for game room ratings. What are the CPU requirements of WHR? How quickly can a new game result be added in? I'm quite interested in WHR being added to the game room if the computation cost is low enough. Certainly it seems like a better way to spend arimaa.com server CPU than the periodic recalculation of p8 ratings. Thanks again for coding this up, woh!
|
« Last Edit: Mar 12th, 2009, 8:31am by Fritzlein » |
IP Logged |
|
|
|
omar
Forum Guru
Arimaa player #2
Gender:
Posts: 1003
|
|
Re: Whole History Ratings
« Reply #26 on: Mar 13th, 2009, 6:43am » |
Quote Modify
|
Great job woh. This is fantastic. I added a link to your web page on the 'Top Rated Players' page. I agree with Karl that we should start using the WHR HH ratings for seeding of future tournaments.
|
|
IP Logged |
|
|
|
Tuks
Forum Guru
Arimaa player #2626
Gender:
Posts: 203
|
|
Re: Whole History Ratings
« Reply #27 on: Mar 13th, 2009, 9:47am » |
Quote Modify
|
you might want to revise it though, no matter how much convincing you do, you will not convince me "Rabbit" who happened to win 3 out of 3 human games has any chance against any of the top players in a non-postal match other than that, i like it, it presents an actual rating i want to be up at the top in
|
|
IP Logged |
|
|
|
mistre
Forum Guru
Gender:
Posts: 553
|
|
Re: Whole History Ratings
« Reply #28 on: Mar 13th, 2009, 11:10am » |
Quote Modify
|
Great contribution Woh! Your personal graph looks like the current stock market! LOL. Perhaps there should be a minimum number of games needed to be ranked - which would fix the "Rabbit" situation.
|
« Last Edit: Mar 13th, 2009, 11:11am by mistre » |
IP Logged |
|
|
|
omar
Forum Guru
Arimaa player #2
Gender:
Posts: 1003
|
|
Re: Whole History Ratings
« Reply #29 on: Mar 18th, 2009, 7:29pm » |
Quote Modify
|
Ever since woh posted his results of using WHR on the HH games, I've been tempted to improve my position on that list Which of course means I have to start playing more HH games. I plan to do that more once the events ease up. Even though I don't care too much about my rating and play experimental games against bots as rated games, I think most all the rated HH games I've played were taken seriously. Thus, I think the data going into the WHR rating system was pretty good and so the ratings we get out from that will probably be more accurate. I feel very comfortable with using these ratings to seed human tournaments. Also a key feature I like about WHR is the ability to retroactively unrate games; I'll explain in a bit why this is so good. I've started to view our current rating system as an unofficial superficial rating system which simply serves to provide immediate feedback of ratings to new users. Also for people who like to boost their ratings against bots it gives them a longer term goal and initially will help them get better. Eventually they will learn which ratings really matter and find yet another goal to try for. There is the possibility that people will eventually try to distort their WHR HH ratings also. It would not be that hard to do by creating multiple accounts and sacrificing the ratings of a few accounts to boost the rating of one account. I think we might eventually need to keep a flag with every game that indicates if the game should be used in WHR rating calculations or not. This will allow us to retroactively not include rated games that look suspicious. Eventually we could even mark legitimate rated HB games for inclusion in WHR. So woh how can we get daily updates of our WHR ratings? If you want I can setup to run the calculations on the arimaa.com server once a day.
|
|
IP Logged |
|
|
|
|