Arimaa Forum - Ratings distortion due to selection of opponents

Welcome, Guest. Please Login or Register.
May 4^th, 2024, 5:39pm

Home

Help

Members

Arimaa Forum « Ratings distortion due to selection of opponents »

   Arimaa Forum
   Arimaa
   General Discussion (Moderator: supersamu)
   Ratings distortion due to selection of opponents

« Previous topic | Next topic »

Pages: 1 2

Notify of replies

Send Topic

Author

Topic: Ratings distortion due to selection of opponents (Read 4383 times)

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Ratings distortion due to selection of opponents
« on: Sep 25^th, 2004, 9:00am »

Quote

Modify

I'm starting a new thread in order to emphasize that there is a more serious problem with the current rating system than ratings deflation. The discussion in the ratings deflation thread has produced some interesting ideas about how to anchor the ratings system with a pool of bots, but there is a more fundamental issue that needs separate treatment. In that thread, Omar says:

on Sep 21^st, 2004, 3:52pm, omar wrote:

* The ELO rating system would work fine as long as the players are not allowed to pick their opponents and the opponents are picked for them (as it happens in tournaments).

* When the players are allowed to pick their own opponents, the rating system can be abused by repeatedly playing the same or small group of opponents.

* When computers opponents are also added to the mix, it makes the problem even worse, because once a player learns how to defeat it they can do it again and again since the computer opponent will never figure out why it is losing and adapt itself (at least with the current computer opponents

).

* The meaning of a players rating should be how they have performed against the field. The more different opponents that a player has played the more meaningful and reliable the rating is. If only a few opponents have been played the rating is not very meaningful.

I think most of us can, from our own experience, feel the truth of what Omar says. If there is a bot you can consistently beat, you can push your rating up and up by playing it over and over. The inverse is also true: if there is a bot that beats you consistently, you will push your rating down and down by playing it. This process distorts the rating of the player who can beat the bot relative to the one who can't. Their ratings will predict a huge difference in skill, but when the humans play each other, they will be closer than the ratings indicate.

So Omar has made a new system, and, like a good computer scientist, has given you the code so you can test it for yourself and see that it is a better system: http://arimaa.com/arimaa/rating/testRatings.tgz

I, on the other hand, am a mathematician. I want you to look at the formulas so you can can convince yourself theoretically that Omar's new system must be a better system, regardless of how it performs in practice.

I have trouble prying things out of the perl, but I may have the gist of it.

The central idea is to not have every game affect your rating equally. Games you play against a frequent opponent should count for less. To implement this idea means that a new formula can't just look at the current rating of each of the two players in a game and adjust based on those, as in now the case. Instead we have to look back in the game history.

In the proposed system, in order to calculate your new rating after each game, we look back at all the games you have ever played, assign a relative weight to each, and then use the whole body of results to compute a new rating. Once you digest this basic procedure, the interesting question is how to weight each game.

Clearly historical games should be weighted less than more recent ones. It seems natural to multiply old games by something like (0.999)^d, where d is the number of days old a game is.

More subtly, it might make sense to for games that were earlier in sequence to be weighted less. For example suppose you play twenty games all on the same day. Maybe the last of the twenty games should count for significantly more than the first, even if they were not separated very much in time. Perhaps a second factor, multiplied on top of the time factor, would be (0.999)^s, where s is the number of games ago in the sequence.

Most importantly, however, N games about a single opponent should be weighted less than a single game against each of N different opponents. I believe Omar is using the formula that a single game against each of N opponents has a weight equal to N^2 games against a single opponent. So if I play one game each against 99of9, clauchau, and naveed, those games collectively will have a weight equal to nine games against speedy.

This last weighting is the crucial feature to differentiate it from the current system. To play over and over against one opponent has diminishing returns. Indeed, Omar has actually proposed a hard cap on the amount one opponent can affect one's rating.

My next post (maybe not today) will be about "finding cracks" in the proposed implementation, but I wanted to say more about the basic idea first, so that folks have a chance to think about it. Supposing that we were going to calculate the rating based on a weighted average of all past games, how would you assign those relative weights?

« Last Edit: Sep 25^th, 2004, 3:41pm by Fritzlein »

IP Logged

clauchau
Forum Guru

bot Quantum Leapfrog's father

Gender:

Posts: 145

Re: Ratings distortion due to selection of opponen
« Reply #1 on: Sep 25^th, 2004, 12:58pm »

Quote

Modify

I had (0.5)^s in mind among games involving the same pair of players. In other words, every time I play a game against Omar, my previous games against him are made worth half what they were worth.

Weights of (0.999)^s wouldn't be reactive and short-term rewarding enough to me. At least for the weights relative to pairs of players.

I think it would solve the problem of over weighting the games involving the same pair of players. My single game against Fritzlein would be worth 1 and my 93 games against bot_Speedy would be worth almost 2 (before we possibly further apply age weights).

« Last Edit: Sep 25^th, 2004, 1:00pm by clauchau »

IP Logged

maker
Forum Full Member

bot_tod's maker

Gender: male

Posts: 21

Re: Ratings distortion due to selection of opponen
« Reply #2 on: Sep 25^th, 2004, 3:03pm »

Quote

Modify

This is an excellent idea. Shocked

I love the thought that multiple games against a single opponent should create a lower rating increase or decrease. This allows small differences in ratings to actually mean something for different players. Also, it would stabalize the bot's ratings fairly well.

However, I don't believe that the current ideas about the time ratings are very good. I believe the're an improvement over the current system, but it shouldn't be so quadratic. Perhaps we could use a more linearized version of (.999)^quantity. Maybe this would be better (.999)^quantity + (.999) * quantity. This allows for a nice curve, but just not quite so abrupt of one. It allows more recent games to have a greater impact on the score, but also allows games not-quite-so-recent to count significantly.

Overall, I believe that this is an excellent idea. Grin

maker

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Ratings distortion due to selection of opponen
« Reply #3 on: Sep 25^th, 2004, 3:40pm »

Quote

Modify

I'm with you clauchau, although not to such an extreme. The way you suggest it, if you beat me eighteen times and then lose to me twice, your effective record against me would be 0.5 - 1.5. It looks like I am dominating you despite winning only 10% of our games. At most I could see a factor of (0.9)^s, and even that makes the most recent game count for at least a tenth of the total. It could still be a bit too reacitve.

The part I like, though, is having the sequential decay apply mostly (or entirely) within games against the same opponent. In fact, now that I think about, I might like it even as a replacement for the square root idea. On the other hand, the advantage of the square root idea is that you don't approach the maximum weight so quickly. If I play nine games with a deacy of (0.9)^s, that already makes the total weight of them 6.1 games, whereas with a square root it would only be 3 games. Hmmm...

Certainly I prefer either of these to the way Omar capped the influence of any given opponent by a formula that makes the total weight of all your games against a given opponent decrease after a certain number of games. I think the total weight should taper off, but never decline from playing more. A win should always be a plus, however slight, but with Omar's latest, a win against a frequent opponent can actually hurt you slightly. (This is crack #1 in his proposal, IMHO)

I think I might lobby for the application of four weighting factors. First apply all three multiplicative weights (in any order, since multiplication is commutative):
(0.999)^d where d is the number of days old the game is
(0.998)^g where g is the number of games old the game is in your own history
(0.95)^s where s is the number of games old the game is in your history against that opponent

Then for each opponent sum the (already lowered) weights against that opponent and divide all weights of games against that opponent by the square root.

Doing it this way insures that no matter how many games you play against a given opponent, all of those games together account for a weight of 4.39 times the weight of playing one game against a new opponent. So playing new opponents will always affect your rating more.

At the same time, a single game against a frequent opponent counts for at least 1/20 of your total against that opponent no matter how many times you've played them. (If you've played them a bunch, its weight will be about 0.2 times the weight of a game against a new opponent.) Not as volatile as clauchau suggests, but there is a balance between the excitement of fast-moving ratings, and the meaningfulness of more stable ratings.

Incidentally, the stability of a rating would be proportional to the sum of the weights of all the games after these factors have been applied. So the sum of the weights should intuitively match our idea of accuracy.

« Last Edit: Sep 25^th, 2004, 3:44pm by Fritzlein »

IP Logged

clauchau
Forum Guru

bot Quantum Leapfrog's father

Gender:

Posts: 145

Re: Ratings distortion due to selection of opponen
« Reply #4 on: Sep 25^th, 2004, 5:25pm »

Quote

Modify

Oh yes this is satisfactory.

I still wonder what to do with those weighted results. Does it lead to the equation which Omar's scripts solve?

IP Logged

99of9
Forum Guru

Gnobby's creator (player #314)

Gender:

Posts: 1413

Re: Ratings distortion due to selection of opponen
« Reply #5 on: Sep 25^th, 2004, 6:41pm »

Quote

Modify

Can you prove that with this system a person's rating will NEVER go down due to a win?

What if a while ago you had a very low rating, and won against a high-rated player. Now you have a rating higher than his, and you win again against him. The original game, with the high ratings difference suddenly receives less weight. The new win has hardly any ratings difference, so doesn't contribute much. Maybe your rating would go down?

I'm not that against this system. But it is untested and unproven, so I feel there may be loopholes or inconsistencies that make it perform worse than the current system.

By the way, I think your exponent based on the number of days is too high. After an entire year a game will still be worth more than 1/2. I'm sure most of us lost to shallowblue less than a year ago!! [but then again... I guess the games played exponent would dampen that out ... so maybe it's ok]

As an alternative we could use the current system with a different Rating Uncertainty against each different opponent... gradually it would tail off to 1 so you wouldn't be able to exploit a bot anymore. Then at least it would still be ELO and comparable with other systems.

« Last Edit: Sep 25^th, 2004, 6:43pm by 99of9 »

IP Logged

omar
Forum Guru

Arimaa player #2

Gender: male

Posts: 1003

Re: Ratings distortion due to selection of opponen
« Reply #6 on: Sep 25^th, 2004, 9:58pm »

Quote

Modify

Thanks for starting this discussion Karl. I think it will be very interesting to discuss this issue and try out possible solutions.

I would strongly suggest everyone taking part in this discussion to download this
http://arimaa.com/arimaa/rating/testRatings.tgz
and try out the scripts in it. The README file explains how to run the scripts. If you know a little perl you can easily try out different ideas.

Quote:

A win should always be a plus, however slight, but with Omar's latest, a win against a frequent opponent can actually hurt you slightly.

I think in my latest system (p7) I made sure that a rating never decreases no matter how many games you play against the same player. It levels off, but never decreases. I think that an earlier system (p4) had that problem.

IP Logged

omar
Forum Guru

Arimaa player #2

Gender: male

Posts: 1003

Re: Ratings distortion due to selection of opponen
« Reply #7 on: Sep 25^th, 2004, 10:30pm »

Quote

Modify

I just ran this command to double check:
testit p7

and here is what is shows.

N Rating
1 1612
2 1667
5 1728
10 1763
20 1785
50 1796
100 1798
1000 1798

N = Number of consecutive wins against a single 1000 rated player (the players rating is fixed at 1000 and does not change).

Now if each of those wins is against a different player here is what happens.

N' Rating
1 1612
2 1730
5 1878
10 1978
20 2060
50 2123
100 2136
200 2137
1000 2137

So against the same player it levels off at 798 points above the opponents rating. Against different players it levels off at 1137 above the opponents rating. But it never drops. Another important thing to notice that the effort it requires to get 798 points above an opponent. If it is the same opponent it take 50 games to get to that level, but if it is different opponents it only 5 games (and you actually get 878 points above). Cool.

IP Logged

omar
Forum Guru

Arimaa player #2

Gender: male

Posts: 1003

Re: Ratings distortion due to selection of opponen
« Reply #8 on: Sep 25^th, 2004, 11:22pm »

Quote

Modify

Quote:

I still wonder what to do with those weighted results. Does it lead to the equation which Omar's scripts solve?

I think the equation you are refering to is:

k1*(w1-W(r1-RP)) + k2*(w2-W(r2-RP)) + ... + kN*(wN-W(rN-RP)) = 0

where:
r1 to rN are the ratings of the opponents
w1 to wN are the results of the games (0, 0.5, 1 for lose, draw, win)
k1 to kN are weights assigned to each game
W() is the usual Elo winning expectancy formula.

You have to accept this equation as a given. Each term in the equation represents a game. We know the rating of the opponents and the results of the games and so we are trying to find what rating (RP) should be assigned to this player so that it best matches the players performance in these games. If we accept this equation then we get a bunch of weights that need to be assigned to the games. Thats what we are discussing now. How should we assign the weights to these games.

BTW. I didn't come up with this formula. It is refered to as a performance rating formula and is commonly used to determine how a players performed in a tournament. In such calculations the weights are all set to 1.

However I have never seen this equation used as the main equation for the rating system with the weights set to different values based on things like how old the game is, how many games were played after this game, how many games were played with this same player, etc.

The main equation of most rating systems is a simple formula that computes the new rating based on the old rating, opponents rating and the game result. This is basically how the current Arimaa rating system is also.

Having a simple main equation for the rating system would have made it feasible to compute the ratings in Elo's days when computers were not around. It would have been next to impossible to use the above equation as the main equation of a rating system if computers had not been available (especially when the weights are different). I will venture to guess that Elo would have perfered to use the above equation if computers were around in his days

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Ratings distortion due to selection of opponen
« Reply #9 on: Sep 27^th, 2004, 8:59am »

Quote

Modify

I share 99of9's concern that any untested system may have loopholes and inconsistencies that we don't anticipate. He asks, for example, whether it is certain that a rating really will never go down after a win. This is a good question because it would be certain to never go down if the only weighting factors were the exponential decay by days, and by sequence of games. Since your entire history of games is losing its weight in the same proportion, the realtive importance of all past results stays the same. Then, yes, a win will always boost your rating, however slightly.

When I proposed having games against an individual opponent lose weight faster than other games, I didn't realize that I was destroying this property. Now it could happen that you have excellent results against a frequent opponent (perfect or near-perfect), but poor results against the rest of the field. When you play that frequent opponent again, your good results against him are knocked down by a greater factor than your poor results against the rest of the field, which can hurt you slightly. Since you are entering a new good result against him, that will more than compensate for the loss, unless his rating has suddenly drastically declined from the times you beat him before. But, yes, in that one case it is just barely theoretically possible for a win to hurt your rating.

In practice this seems extremely unlikely. One can probably assume that players won't gain or lose 1000 points from where they were before. It would never be like the silly situation with the world tennis rankings where someone can win the French Open and still drop from first to second in the rankings. I doubt that a win would cost anyone rating points once in a thousand games. Still, perhaps the concern that it could happen is enough to make us want to scrap the idea of game weights deacying at different rates.

In any case, I've discovered what I think is a bigger potential loophole, which I will write about in my next posting since I have to run off to class now.

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Ratings distortion due to selection of opponen
« Reply #10 on: Sep 30^th, 2004, 11:43am »

Quote

Modify

OK, time for potential problem #2 with Omar's proposal. After the games have all been relatively weighted, the computation of the rating answers the question, "If my rating had been X for all of those games, what value of X would have predicted that I would win as many games as I actually won?"

For example (unweighted just for ease of computation) suppose I have a win against a 1414 player, a win against a 1636 player, and a loss against a player rated 1775. My record is 2-1. If my rating had been 1751, you would have predicted I would win
0.874 against the 1414 player
0.660 against the 1636 player
0.466 against the 1775 player
for a predicted total of 2 wins. Thus 1751 is a reasonable guess at my rating.

This is a nifty and conceptually simple calculation (although computationally tricky), but it has one problem. If a player has won (or lost) all their games, the only rating that would predict a perfect score is infinity.

(oops, gotta run, more later on the disadvantages of omar's solution to the problem.)

IP Logged

omar
Forum Guru

Arimaa player #2

Gender: male

Posts: 1003

Re: Ratings distortion due to selection of opponen
« Reply #11 on: Oct 5^th, 2004, 6:40am »

Quote

Modify

on Sep 30^th, 2004, 11:43am, Fritzlein wrote:

If a player has won (or lost) all their games, the only rating that would predict a perfect score is infinity.

But there is fictitious draw game that is always added which eliminates this problem.

IP Logged

omar
Forum Guru

Arimaa player #2

Gender: male

Posts: 1003

Re: Ratings distortion due to selection of opponen
« Reply #12 on: Oct 5^th, 2004, 6:46am »

Quote

Modify

I think problem #1 was supposed to be that a win against a frequent opponent can hurt your rating. But as I noted earlier p7 does not have this problem.

So just to keep the record straight Karl, both the problems you mentioned are not present in p7.

Omar

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Ratings distortion due to selection of opponen
« Reply #13 on: Oct 5^th, 2004, 10:42am »

Quote

Modify

Sorry I didn't finish my thought. Yes, problem #1 was supposed to be that a win can reduce a player's rating. It's good that that is not present in your most refined approach -- I hadn't checked the equations myself.

Problem #2 is not that infinite ratings are possible in your proposal, the problem is with the fictitious draw against a zero-rated player, which is added to prevent an infinite rating. If the draw is against a zero-rated player, it will have the tendeny to deflate the entire rating system to an average rating of zero.

Example: A new player joins and loses twice to Arimaazilla. I'm not sure what you intended to go in the record of Arimaazilla for the first game, but for the second game Arimaazilla gets credit for beating a player with a negative rating, i.e. almost no credit at all.

Then suppose the new player wins the next two. This will result in the new player having a rating almost equal to Arimaazilla, which is fine, but what happens to Arimaazilla? The bot get penalized for a loss to a negative-rated player and to a still-rather-low-rated player. The net is a substantial penalty to Arimaazilla.

The situation becomes even worse if two new players each lose to Arimaazilla and then play each other for a while. They each get results in their records of losing to negative-rated opponents, solidifying their ratings at that low level.

Furthermore, an unfortunate side effect might be that established players, understanding that new players are generally underrated, avoid playing against new players because they don't want their own ratings to take a hit on average. If I have a rating of 2000, and I play against someone with a rating of, say, -200, I need to have true winning odds of about 300000 to 1 to make it a fair proposition in terms of hurting or helping my rating. If my true winning odds are only 1000 to 1, then I will, on average, lose rating points by playing that opponent.

Because of the tendency of all ratings to gravitate toward the rating of the fictitious draw, I would strongly suggest having that draw be against a 1500-rated player, or a rating that we want to be the average rating.

Omar, I think perhaps you were focusing on how much someone has to work to get a high rating, rather than focusing on the more typical case of fairly weak players entering and trying to work their way up the ladder. If it is a major concern that it is too easy to get a high rating when the only "ballast" is a fictitious draw against a 1500-rated player, then let's add two or three fictitious draws.

IP Logged

omar
Forum Guru

Arimaa player #2

Gender: male

Posts: 1003

Re: Ratings distortion due to selection of opponen
« Reply #14 on: Oct 9^th, 2004, 1:40pm »

Quote

Modify

Quote:

Example: A new player joins and loses twice to Arimaazilla. I'm not sure what you intended to go in the record of Arimaazilla for the first game, but for the second game Arimaazilla gets credit for beating a player with a negative rating, i.e. almost no credit at all.

Then suppose the new player wins the next two. This will result in the new player having a rating almost equal to Arimaazilla, which is fine, but what happens to Arimaazilla? The bot get penalized for a loss to a negative-rated player and to a still-rather-low-rated player. The net is a substantial penalty to Arimaazilla.

Actually this is not as much of a problem as it may seem. I actually tried it out:

gr bot_Arimaazilla | p7

shows that Arimaazilla's current rating is 1506 using the p7 rating system (at least at the time of this writting). When it wins two games against a 0 rated player it's rating does not go up at all. Likewise the 0 rated players rating also does not go down much.

Now here is where this rating system is so different than the rating system that we are used to. When a player has not played many games, this rating system lets the ratings move very fast. So in the next two games when new player wins his rating goes up very fast but Arimaazilla's rating does not go down as much. Lets see what happens to Arimaazilla rating after losing the first game.

gr bot_Arimaazilla | rep '-0 newPlayer' 1 - | p7

shows that it would go down to 1472. Now lets see what happens to the new players rating after he wins the first game.

rep '+1506 bot_Arimaazilla' 1 '-1506 bot_Arimazilla' 2 | p7

The new players rating pops up to 1453 from -3. After the second game the ratings would be: 1498 for the new player and 1471 for Arimaazilla. So it is not really that bad.

I don't mind using 1500 as the fictitious draw rating rather than using 0. The actual number does not matter too much. I think it is more important to keep that number fixed and not change it.

In your example you are assuming a situation where we don't have a bunch of fixed rated 'dummy bots' for new players to play against. Once we have such bots we can make it so that only after playing about 10 rated games with such bots the other rated games begin to count. Otherwise the player can still play against who ever they want, but the games won't be rated until the player completes the provisional 10 games. Im assuming we will have fixed rated dummy bots up to the level of shallowBlue or maybe even something between shallowBlow and Arimaazilla. Thus the new players will quickly bring up their rating based on their performance. Since this rating system moves the ratings so fast in the begining it works best if we have the new players play some provisional games before counting their other rated games.

So in this situation we will not have the problem of established players avoiding new players or new players establishing low ratings for by only playing other new players. And it will not make much difference what we chose the fictitious draw rating to be. Which is good because we don't want that to be a significant factor in the rating system. So we don't have to bias it at all.

« Last Edit: Oct 9^th, 2004, 1:44pm by omar »

IP Logged

Pages: 1 2

Notify of replies

Send Topic


« Previous topic \| Next topic »