Arimaa Forum (http://arimaa.com/arimaa/forum/cgi/YaBB.cgi)
Arimaa >> General Discussion >> Arimaa rating deflation
(Message started by: fotland on Oct 11th, 2003, 2:44pm)

Title: Arimaa rating deflation
Post by fotland on Oct 11th, 2003, 2:44pm
It seems like the established players have gotten a lot stronger lately.  Certainly the ratings of the bots have been pushed way down, and it's not easy for me to keep speedy improving as fast as the top players.

This has caused the ratings to be much tougher.  A 1500 player is much stronger today than 6 months ago.

I'm concerned that since new players come in at 1500, they might get discouraged that they end up with much smaller ratings after a few games.

My suggestion is to either have new players come in with a lower provisional rating, like 1400 or 1300, or add 100 or 200 rating points to everyone.

-David

Title: Re: Arimaa rating deflation
Post by omar on Nov 19th, 2003, 12:43am
The choice of using 1500 for the initial rating was kind of arbitrary. But even if I set the initial ratings to 1300 the beginners will still lose to the easy bots until they get used to basic ideas of the game (and still get discouraged).

I really think that beginners would be able to consistently beat the easy bots if they went through an introduction to the game and learned the basic principles and did some practice puzzles. Im working on such pages as time permits.

The one thing that bothers me about this whole rating system is that it's not anchored to anything. What does a rating of 1500 mean anyways? Only rating differences have a meaning which can be translated to probability of outcomes. But the absolute rating values don't really have any meaning; so the whole scale could be shifted to whatever we want.

I've been thinking of anchoring the scale so that a program which plays a completely random game (i.e. generates the list of all possible moves and randomly selects one) is defined as having a rating of 0. This program would be allowed to play rated games, but its rating would never change; only the ratings of it's opponents would change. Layers of progressively better (but still quite nieve) programs would establish their ratings based on this program and each other. The beginners could then be given ratings comperable to the better of these programs. The intermediate players would establish their ratings based on the beginners and each other and so on up to the advanced players.

This would provide a floor so that ratings can't get deflated. Also it would keep the ratings scale from drifting over time.

Omar

Title: Re: Arimaa rating deflation
Post by fotland on Nov 19th, 2003, 10:17pm
I think your random bot would be worse than you think :)  It would never win a single game, so it would cause the rest if the ratings to inflate forever.

Is it really a problem that the ratings are arbitrary and not anchored?

You have an anchor anyway, since shallow blue has a rating that never changes.

Title: Re: Arimaa rating deflation
Post by 99of9 on Nov 20th, 2003, 4:51am
You're right, random play is seriously bad!  But I think that's why we'd need other naiive bots.   And actually, other ratings wouldn't inflate forever anyway, as long as the system was similar to the one now - even 100% win reaches an equilibrium.

Shallowblue doesn't really have a proper rating, it's only played a few rated games.  And I don't think it's really fair to say that something not even participating in the ratings is somehow "anchored".

But we do kindof have an anchor for our whole distribution.  Namely the average rating should always stay at 1500 (as long as newbies always come in with 1500).  Now, fair enough, this means a new player has to drop before (s)he can gain, but changing the incoming rating to 1200 wouldn't help because then, over the time the average rating would become 1200 (after enough new users had been added).  Then the same problem could occur with newbies dropping rapidly below 1200 at first.  This is always going to be the case in a conservative ratings scheme.

So I think Omar's suggestion has merit.  The random bot at 0 would be nice for theoretical prettiness, but that's might take a long time for the ratings to equilibrate, so maybe it's just better to set a bot we all know - say arimaazilla to 1500.


Title: Re: Arimaa rating deflation
Post by omar on Nov 28th, 2003, 10:57pm
Well I've played a random bot against itself and the games usually end in about 30 to 60 moves, which is not that different than normal games.

I have not done any experiments to see how a random bot compares to a 1 ply bot. But even if a 1 ply bot won 100% of the games we could always make a weaker bot from the 1 ply bot by having it look 1 ply with a fixed probability and select random moves otherwise. By using different values for this probability we can get a range of bots between the random bot and the 1 ply bot. Similarly we could produce a range of bots between a 1 and 2 ply bot. I have not had the time to do any such experiments yet. If anyone is interested to do them and report the results it would give us a good idea of how many layers of neive bots we might need before we get to bots that play like human beginners. Keep in mind that shallowBlue is a 1 ply bot and Arimaazilla and Arimaanator are 2 ply bots (differing only in the evaluation function). The 2 ply bots difinetly pose a good challenge for beginners.

It could turn out that in an anchored rating system the ratings of average players may be much higher then what they are now. But they would not increase indefinetely.

An anchored rating system which does not drift over time and is independent of the current population of players allows somewhat reasonalbe comparisons of player ratings from different time periods. In chess for example people often speculate about how Fischer would do against Kasparov. These kind of comparisons could be done if the chess rating system had been anchored.

Also if other games adopted an anchored rating system it may allow us to make comparisons between games of their level of complexities. For example the ratings of the best Go players may be much higher than the ratings of the best Chess players. Or maybe they would be the same; who knows. But still it would be an interesting comparison.

Omar

Title: Re: Arimaa rating deflation
Post by fotland on Nov 29th, 2003, 1:55am
I don't think it's possible to make a stable rating system
unless there is an unchanging plater who plays a lot to be the anchor.  NNGS uses a group of older players who aren't improving as anchors.  But arimaa doesn't have that yet.   I wouldn't recommend anchoring from the bottom, since a random player is so much weaker.  It would push up all the ratings, and cause a lot of instability until ratings stabilized at the new levels.

The go rating system is anchored at the top, at 9 dan.  This works becuase it is an old game, and the very strongest players are very close.


Title: Re: Arimaa rating deflation
Post by 99of9 on Nov 29th, 2003, 7:16am
But that Go system cannot really be called anchored, just upper bounded.  If a new amazing (computer?) player that never lost came along, he would be 9 dan, and everyone else would deflate.

Title: Re: Arimaa rating deflation
Post by clauchau on Nov 29th, 2003, 12:53pm

Quote:
we can get a range of bots between the random bot and the 1 ply bot.

One problem is - we can get an ordered chain of millions of bots between any two fixed bots such that any bot in the chain always loses against the next bot in the chain...

Title: Re: Arimaa rating deflation
Post by MrBrain on Nov 29th, 2003, 2:24pm
To have a "fixed" rating scale, one needs only to have one robot whose style never changes.  A completely random bot seems to me to be the best standard that we could implement.  The bot should always play rated games, but its rating should be fixed and never change.

But what does "completely random" mean?  There are two possibilities:
1.  Each possible move order that leads to a legal position (a change in position from the previous move) would be chosen with equal probability.
2.  Each possible legal position that could result after any move order would be chosen with equal probability.  Then any move order that achieves this position could be chosen.

For the first move of the game (placement of pieces), this distinction is moot, as either #1 or #2 would lead to the same starting positions with the same probability.  However, for subsequent moves, this distinction does make a difference.  For example, consider a move where four different animals take one step forward, compared with a move where the elephant takes four steps forward.  Using #1, the four-animal move would be 24 times as likely (4!) as the four-space elephant move, since there are 24 different move orders that achieve the end position of the four-animal move, but only one move order that achieves the 4-space elephant move end position.  With #2, both end positions would be equally likely.

So I would recommend making a bot "bot_random" that follows strictly #1 or #2, and setting its rating to a low value (1000 seems too high to me, maybe 600 is better).  Then, allow it to play rated games, but have its rating never change.

By the way, which of #1 or #2 was implemented before?  Were either?  (It seems to me that it would be easy to implement something that approximates either #1 or #2, but more difficult to implement #1 or #2 strictly.)

Title: Re: Arimaa rating deflation
Post by fotland on Nov 29th, 2003, 3:00pm
I don't think you realize how very weak a random bot would be. Someone did some experiments with random go bots a few years ago, thinking that they woul dbe 30 or 35 kyu, but they are much weaker.

If you set a random bot to 600, you might push up the entire rating system by several thousand points, and it would take forever to stabilize at the new level.

David

Title: Re: Arimaa rating deflation
Post by MrBrain on Nov 29th, 2003, 3:04pm
Well someone recommended 1000, and I thought that seemed too high.  That's why I said lower is better.  Perhaps a rating of 0 would aesthetically make the most sense, because any bot that does worse than that is playing worse than random, and therefore deserves a negative rating!

Title: Re: Arimaa rating deflation
Post by 99of9 on Nov 29th, 2003, 5:59pm
Random bot would indeed be very bad.  I'd estimate, compared to the current ratings system, that it plays at a rating of approx -2000.  I tried making a random bot when I first made Gnobby, and it regularly walked pieces directly into traps.

As David points out, this makes any direct comparison with humans unreasonable.  Therefore to make it workable, we'd need intermediate bots.  Claude, I'd suggest that all intermediate bots were somewhat stochastic to prevent the problem you suggest.

If the ratings were to be reset, I agree that random should be set at 0.  Of Mr Brain's methods, I'd choose #2.

Title: Re: Arimaa rating deflation
Post by MrBrain on Nov 29th, 2003, 7:04pm
I'm not sure you understand how low a rating of -2000 is.  Even a random bot wouldn't be that low, as opponents would sometimes resign, lose on time, etc.  A rating of -2000 would mean that it would lose virtually every time to a player with rating -1200, and that player loses every time to a player with rating -400, and that player loses every time to... etc.  This is not realistic.  I think a rating of 0 would be a good place to start.  And you wouldn't really have to reset anything.  The ratings of existing players would drift to match the fixed system.  What might be needed, however, are some bots that are somewhere between random and poor.  Right now, shallowblue would beat a random bot just about every time, meaning that its rating would never go down to the level that you'd expect from random=0.

Title: Re: Arimaa rating deflation
Post by MrBrain on Nov 29th, 2003, 7:08pm
Actually, if you had shallowblue play rated games both against humans and against the random bot, I believe that you'd get a stable rating for shallowblue.  Right now, shallowblue's rating is too high, compared with other players.  I think a rating of about 800 to 1000 for shallowblue seems about right.  And that seems about the right number of points above random (0) for its strength to me.

Title: Re: Arimaa rating deflation
Post by MrBrain on Nov 29th, 2003, 7:29pm
Thinking about this some more.  Yes, I may be underestimating the poorness of random.  However, I think I prefer option #2 as well, which to me seems like it would be slightly stronger than #1, since on average it will tend to make moves that achieve more than with method #1.  The question is, how many "levels" of bots would be between random and shallowblue?  I believe that at worst, there would be a bot that could beat random 99% of the time, while it could beat shallowblue 1% of the time.  That would give shallowblue a rating of about 1600.  That would then be somewhat of a ratings inflation.

Another question then becomes, what rating to you give to new players?  Perhaps the current average of all players would make sense.

Another concern is the provisional ratings formula for new players.  If they were to play a bot more than 400 points below their current rating, they would lose points even after a win.  (USCF recently changed their formulas to avoid such weirdnesses.)  A solution might be to disallow rated games in such situations.  Or make such games automatically unrated.  Just a thought.

Title: Re: Arimaa rating deflation
Post by 99of9 on Nov 29th, 2003, 8:01pm

on 11/29/03 at 19:29:18, MrBrain wrote:
The question is, how many "levels" of bots would be between random and shallowblue?  I believe that at worst, there would be a bot that could beat random 99% of the time, while it could beat shallowblue 1% of the time.  


That's what I'm saying when I say that random would be approximately -2000...  I think there are 4 levels between random and shallowblue where each can beat the ones below.  Clauchau pointed out that there are actually an infinite number, but I'm talking about stochastic ones.  But anyway, it's only a guess based on seeing how random played a couple of times.

examples of the levels in between them are:

0) Random

1) Random, no suicide (NS)

2) Random, NS, Rabbits Forward (RF)

3) Greedy Killer, Otherwise Random NS

4) Greedy Killer, RF, Otherwise Random NS

5) Shallowblue

Title: Re: Arimaa rating deflation
Post by MrBrain on Nov 29th, 2003, 8:20pm
"An infinite number" would indicate that we are saying that a bot will beat another bot just a simple majority of the time.  But 51% victories does not a 400-point difference make.  I think if you defined a "whole new level" to mean winning 99% of the time, then I don't believe there would be more than 2 levels between random and shallowblue.

By the way, I created a bot that is worse than random, "bot_brain".  It plays by moving its elephant to the middle, and then in subsequent moves, moving its elephant one space clockwise, or if that's blocked, counterclockwise, and if that's blocked too, it resigns.  Yet even though this bot should, according to what you're saying, be rated less than -2000, it beat bot_occam, because bot_occam lost on time (don't know why).

Title: Re: Arimaa rating deflation
Post by 99of9 on Nov 30th, 2003, 5:35am
Yes, it is possible to have a very large number of bots in between that beat all below them 100% of the time.  I think Clauchau was thinking of deterministic bots, where it's either 100%, 50% or 0%.

Title: Re: Arimaa rating deflation
Post by MrBrain on Nov 30th, 2003, 8:18am
I don't think there can be as many levels in arimaa as there are in go.  Even with a random bot, there's a good chance that it will do something good (move a rabbit forward).  In go, it seems that a random player would have almost no chance of doing anything constructive.  And from looking at the levels you have there, it's not clear to me that 1 would always beat 0. It seems that sometimes, 0 would play as well as 1 just by luck (having a friendly piece next to a trap before it walks into one).  Also, I believe that 3 would play better against 2 than it does against 1 since the rabbits moving forward would quickly be eaten.  Also, it's not clear to me that 3 would be any worse than 4, and could actually be better.

Title: Re: Arimaa rating deflation
Post by 99of9 on Nov 30th, 2003, 10:09am

on 11/30/03 at 08:18:12, MrBrain wrote:
And from looking at the levels you have there, it's not clear to me that 1 would always beat 0. It seems that sometimes, 0 would play as well as 1 just by luck (having a friendly piece next to a trap before it walks into one).  Also, I believe that 3 would play better against 2 than it does against 1 since the rabbits moving forward would quickly be eaten.  Also, it's not clear to me that 3 would be any worse than 4, and could actually be better.


I'm convinced that 1 would beat 0 in 99% of cases.  Games between these 2 would be quite long, and the occasional material advantage would grow and grow.  Defending your own trap against suicide is not sufficient.  To get to the other side you have to go past the opponent traps...

I agree with you that the results between 3 and 4 are not clear - it was hard to make up those examples, because I couldn't really visualise the game - however I do know that when Gnobot was a primitive 1-ply, it still easily lost to Shallowblue.

Title: Re: Arimaa rating deflation
Post by MrBrain on Nov 30th, 2003, 10:15am
Yes, I think you have convinced me (although the examples you gave are not conclusive) that random would be significantly lower than 0 on the current scale.

On the other hand, 0 for completely random (meaning something different than what Omar suggested, i.e. random position selection as opposed to random move selection) is aesthetically very pleasing.  So despite the fact that it would significantly impact the rating scale, I think it should be done at some point.

A couple of things to note, however.  Sometimes bots lose on account of interface issues, on time, etc.  This may account for some unexpected results.  Also, it is also possible to have a "cycle" of bots (1 beats 2 a majority of the time, 2 beats 3, and 3 beats 1).  Such "cycles" would have a tendency to keep the ratings closer together than with a "chain" of bots who always beat the bot next in line.

Title: Re: Arimaa rating deflation
Post by omar on Nov 30th, 2003, 10:16pm
By random I was thinking of generating all valid board positions and then selecting one at random. Mostly because this is the first step before applying an evaluation function to each of the positions to get a 1 ply bot.

Also for the random initial setup. I think it would be OK for bot_random to use a semi-random setup lik most of the good bots do. For example have it select one from a set of predefined intial positions or fix the postion of the rabbits to the back row and randomly place the stronger pieces. This is just so that the initial setup does not become a big negitive factor in it's performance.

Fixing the rating of bot_random bot to 0 I think makes the most sense.

The next step would be to make a 1 ply bot that has a pretty good evaluation function and see what percent of the games it wins.

I think the games between such bots should be played offline using a




Title: Re: Arimaa rating deflation
Post by omar on Dec 1st, 2003, 12:29am
My keyboard at home stopped working. I rebooted the compter and it didn't help. Problem is Im in the middle of a game with bot_Bomb. So I rushed over to the office to use the computer here. Good thing I had a lot of reserve time left.

I think the games between such bots should be played offline using a referee script in between which passes the move from one bot to the other. This avoids problems with games being lost due to network or server problems. Also it is going to require a lot of games to get good statistical numbers.

For the bots in between 0 and 1 ply I think we should use the 1 ply bot and pass it some parameter that specifies what percent of the moves it should look 1 ply deep. I perfer this over methods that use different evaluation functions because it is not specific to Arimaa and could be used with other games as well.

How the bot decides at each move if it should look 1 or 0 ply could be done stocastically or deterministically. The bot could do this stocastically by randomly picking a number before each move and comparing it to the given percentage to decide if it should look 1 ply or 0 ply deep. One way of doing it deterministically is to keep a running total of the number of moves it looked 1 ply and the number of moves it looked 0 ply and chosing the next one to bring the ratio closer to the given percentage. Im a bit inclined to use the deterministic method.

Im guessing that a bot that looks at 1 ply on 10% of the moves would only be slightly stronger than the random bot. In a 40 move game it would have looked 1 ply only on about 4 moves. Then a 20%, 40% and 70% bot would gradually get us up to a full 1 ply bot. Then a similar set of bots could get us from 1 ply to 2 ply. So we would only need about 10 bots to get to 2 ply bots (which we know is a good match for beginners). But this number is just a guess.

Once we have established a rating for the 2 ply bot we could then adjust the ratings in the gameroom and also have a good estimate of what rating to bring in new players. For example if the 2 ply bots rating in the gameroom was 1400 but among the simple bots it has a rating of 2400, then we just add 1000 to all the ratings in the gameroom and bring in new players at around 2000.

I've been thinking about this for awhile, but just have not had the time to create such bots and run the games. Maybe if we all work together on this we can get it done. Any takers?

Omar


Title: Re: Arimaa rating deflation
Post by fotland on Dec 1st, 2003, 2:15am
You can do this experiment with ariminator.  In the startup dialog you can see a box labeled "chance of blunder".  You can set the depth to 4 steps and the chance of blunder to 90%, and it will only consider 10% of the legal moves at the first step.

David

Title: Re: Arimaa rating deflation
Post by 99of9 on Dec 1st, 2003, 4:27am
This:

on 12/01/03 at 02:15:23, fotland wrote:
it will only consider 10% of the legal moves at the first step.


is a bit different to this:

Quote:
a bot that looks at 1 ply on 10% of the moves


Omar is talking about each time the bot steps up for its turn, it has a 10% chance of being a 1 ply bot, and a 90% chance of being a 0 ply bot.

Title: Re: Arimaa rating deflation
Post by MrBrain on Dec 1st, 2003, 8:17am

on 11/30/03 at 22:16:04, omar wrote:
Also for the random initial setup. I think it would be OK for bot_random to use a semi-random setup lik most of the good bots do. For example have it select one from a set of predefined intial positions or fix the postion of the rabbits to the back row and randomly place the stronger pieces. This is just so that the initial setup does not become a big negitive factor in it's performance.

No, I strongly disagree.  If our standard is to be a truly random robot, then make it truly random.  There should be no predetermined strategy considerations or intelligence in the play of this bot whatsoever, either during the game, or for initial setup.

If you like, you could make an intermediate bot, slightly stronger than the truly random bot, which plays random, but has predetermined initial setups.  I think the difference in strength between these two bots would be negligible, however.

Title: Re: Arimaa rating deflation
Post by MrBrain on Dec 1st, 2003, 8:20am

on 11/30/03 at 22:16:04, omar wrote:
By random I was thinking of generating all valid board positions and then selecting one at random. Mostly because this is the first step before applying an evaluation function to each of the positions to get a 1 ply bot.

Yes, this is my method #2 which we all seem to agree on.  However, great care must be taken when implementing this method so that the each unique position different than the current position is generated once and only once.

Title: Re: Arimaa rating deflation
Post by clauchau on Dec 1st, 2003, 8:45am
I'll contribute to the experiments.

Toby's, Omar's and David's ideas of partially random bots seem all interesting to me. Toby's is my favorite to explore Arimaa. I'll add essential variants that are as many improvements as human beginners are likely to go through:
    • the Stepper thinks at the step level only. He randomly picks up a step among all the valid steps or all the valid steps that maximize his good feeling about his position, and he repeats this four times to achieve his move;

    • while the Mover is aware of the whole set of positions she can reach with a valid move (Brian, most bots here already do that);

    • the Infiltrator is happy with one rabbit going as far as possible, so she views the board as an integer from 1 to 8 - the row of her farthest rabbit;

    • while the Flooder is happy with getting many rabbits as far as possible, so he views the board as some 8-uple of integers - how many rabbits he has got on the last row, how many on the row before, and so on;

    • the Attacker focuses on the advantages of his position only and maximalises the resulting value, for example his rabbits' rows or his material. An Attacking Infiltrator might be called a Pro-Infiltrator. In general an Attacker tends to play short games;

    • while the Defender strives to counter her opponent's advantages only and does not really tries get closer to win. A Defending Infiltrator might be called an Anti-Infiltrator. In general a Defender tends to play long games;

    • and the Symmetrical is equally happy with maximazing her good feeling or minimizing her opponent's good feeling (take the difference "her value minus her opponent's value").

Unlike Omar, I'd rather call C++ playing functions in turn in the same program to share the data and not wave the whole board around between every move. That should be fast. Thousands of games would be as easy to collect as drinking a cup of tea meanwhile.

Title: Re: Arimaa rating deflation
Post by 99of9 on Dec 1st, 2003, 10:28am

on 12/01/03 at 00:29:10, omar wrote:
For the bots in between 0 and 1 ply I think we should use the 1 ply bot and pass it some parameter that specifies what percent of the moves it should look 1 ply deep. I perfer this over methods that use different evaluation functions because it is not specific to Arimaa and could be used with other games as well.


No matter what, it's not comparable with other games, because of the specific evaluation function you implement in your 1 ply eval.  If that is a good one, then your 100% 1 ply bot will be good, if that is a bad one, then your 100% 1 ply bot will be bad.

I admit that what you are suggesting is aesthetically nice, but I don't think comparison between games will be that easy.

Instead I would go for Eval functions which are as simple and easily described as possible, so that there are as few arbitrary parameters as possible.  (eg the concept of "no suicide" needs no parameters, the concept of "greedy for material" doesn't have many parameters).

Perhaps just go with Omar's approach using a purely materialistic evaluator (with win conditions also in eval)?  At sufficient depth (approx 50% 2 ply) this would eventually surpass shallowblue.  But don't expect to be able to compare these intermediate bots with their equivalents in other games.  Maybe it's still fair to ask how far above random are humans though.

Title: Re: Arimaa rating deflation
Post by MrBrain on Dec 1st, 2003, 1:21pm
How about using the inherent scoring function in the game itself?  This takes into account material and rabbit position.  The score of a function could be s1-s2 where s1 is the score of our position and s2 is the score of the opponent's position.  (This approach is simply utilizing something that's in the rules of the game already.  There is no abitrary decision that has to be made about any kind of evaluation function.)

Now as far as the bots between 0 and 1 ply go, 0 ply is equivalent to choosing one position randomly out of all positions, evaluating, and then choosing the best one.  Since there's only one position that we've chosen, the best one is obviously that one no matter what the evaluation function says about it.

Now, a significantly stronger bot would be one that chooses just 2 positions randomly.  This bot would already have the ability to avoid suicides somewhat, and to advance rabbits.  It's unlikely that of two random positions, that the one with the higher score is one where a suicide occurred, or one in which an opponent's rabbit would be pulled further towards the goal.  (Omar, I noticed that your sample code did a lot more suicides and "helping" of opponent's rabbits than would be expected from the description (choosing 10 positions and evaluating to find the best one).  My guess is that some sign was switched somewhere in the code or something of that nature.)

Title: Re: Arimaa rating deflation
Post by 99of9 on Dec 1st, 2003, 4:10pm

on 12/01/03 at 13:21:45, MrBrain wrote:
How about using the inherent scoring function in the game itself?  This takes into account material and rabbit position.  The score of a function could be s1-s2 where s1 is the score of our position and s2 is the score of the opponent's position.  (This approach is simply utilizing something that's in the rules of the game already.  There is no abitrary decision that has to be made about any kind of evaluation function.)


Yes, I think that's a good idea.

Personally I think this scoring function overvalues rabbits heaps, but hey, if it's in the game already, we might as well use it.

Title: Re: Arimaa rating deflation
Post by MrBrain on Dec 2nd, 2003, 8:22am
Actually, thinking about it some more, it's really not that important to me what kind of evaluation function we use for intermediate bots.  The only really important thing is that we have a very firm definition for a bot of rating 0, and I think we've done that.  "Chooses randomly from all possible new positions with equal probability, including the setup move."  As long as that is satisfied, intermediate bots can do whatever they want, really.

Title: Re: Arimaa rating deflation
Post by omar on Dec 2nd, 2003, 1:44pm
Can someone send me a random bot written in C or C++. I'd like to run some initial experiments by playing it against shallowBlue offline. Thanks.

Omar

Title: Re: Arimaa rating deflation
Post by 99of9 on Dec 2nd, 2003, 3:58pm
I can when I get back, but that's a couple of weeks.

Title: Re: Arimaa rating deflation
Post by MrBrain on Dec 2nd, 2003, 4:27pm
Make sure it has a random initial setup to start.  Perhaps then the very next intermediate bot could be one with a predetermined setup.  Although, I would predict that there would be very little discernable difference in strength between the two bots.  Aesthetically, though, the totally random setup makes the most sense for a rating of 0.

Title: Re: Arimaa rating deflation
Post by 99of9 on Dec 2nd, 2003, 5:31pm
Yes, good point... random's not in at the moment.

Title: Re: Arimaa rating deflation
Post by clauchau on Dec 5th, 2003, 11:44pm
Here is yet another way to weaken and randomize a bot: add a weighted random value to every board value. For example have
value := 0.95*value + 0.05*rand(1)*(max_value - min_value);



Title: Re: Arimaa rating deflation
Post by clauchau on Dec 14th, 2003, 7:43am
Here are the outcomes of a million games I ran between bots playing elementary strategies.

The most elementary strategy consists in randomly picking up a valid step among the immediately available steps, making the selected step on the board, and repeating this four times for every move.

random stepper / random stepper

Gold won 50.3%,  Silver won 49.7%

The winner reached the goal 62.5%
The loser moved an opposing rabbit to the goal 26.0%
The loser was unable to move 11.5%
Loss by 3-times repetition: 0 (no such loss).

Shortest game: 11 half-moves
Mean length: 76.9 half-moves (standard deviation = 21.7)
Longest game: 194 half-moves

(figures based on 100,000 games run in 4 minutes)

Ok, I need time to sum it all and provide Omar with bots. See you later.

Title: Re: Arimaa rating deflation
Post by clauchau on Dec 14th, 2003, 11:50am
As we know, a more elaborate player is able to look ahead at his next steps and consider his move as a whole. Unfortunately it takes about 600 times more time to make a move, so multiple experiments and acurate data are harder to get.

random mover / random mover

Gold won 46%, Silver won 54%

The winner reached the goal 79%
The loser moved an opposing rabbit to the goal 19%
The loser was unable to move 2%
Loss by 3-times repetition: 0 (no such loss)

shortest game: 33 half-moves
mean length: 71.5 half-moves (standard deviation = 18.9)
longest game: 131 half-moves

mean number of different reachable positions = 16380 (st. dev. = 10950)
max number of different reachable positions = 90023

figures based on 500 games (12 minutes).

random mover / random stepper

the mover won 54%, the stepper won 46% (no connection with the previous 46% 56% figures)

figures based on 1500 games (19 minutes)

Title: Re: Arimaa rating deflation
Post by MrBrain on Dec 15th, 2003, 9:05am
Very interesting!  As I predicted, the random mover is stronger than the random stepper.  Again, I believe this is due to the fact that random moves more often accomplish something meaningful than moves composed of random steps; e.g. a rabbit moving 4 steps forward is more likely with the random mover than with the random stepper.

Title: Re: Arimaa rating deflation
Post by 99of9 on Dec 15th, 2003, 10:08am
Can I just check... are these for initial random positions?

Title: Re: Arimaa rating deflation
Post by MrBrain on Dec 15th, 2003, 10:53am
I hope so.  If not, I would recommend that these same cases be re-run for random start positions.

Title: Re: Arimaa rating deflation
Post by MrBrain on Dec 15th, 2003, 10:56am
Claude, how do you handle the case in the random stepper if the fourth step leads to the same position as before the move started?  Do you wipe out this move and do four new random steps?  Or do you just force the fourth step to be something different?

Title: Re: Arimaa rating deflation
Post by clauchau on Dec 16th, 2003, 7:19am
Yes the initial position is uniformly random in those games. We agree this seems the right thing to do, both for the Stepper and the Mover. It amounts to the same either we see it as randomly setting one piece after the other or picking up a position among the 64,864,800 legal initial Gold or Silver positions.

To me the first move is much like other moves. So for the bots that like having a rabbit as far as possible, I force the initial position to have at least one rabbit on the second row. For the bots that like having as many rabbits as far as possible, I force the eight rabbits to fill in the second row while setting the other pieces randomly on the first row.

Aha, Brian, you are my kind, seeing any single possible flaw. I considered any fourth step that gets us back to the position before the move started as illegal and I excluded such a step. It cannot be picked up when the bot looks for a fourth step. If no other fourth step was available, the move is made of three or fewer steps.

Now, in moves consisting of less than 4 steps - which I allow the Stepper to make only when no further legal step is available - I considered any move that gets us back to the position before the move started as a loss (for not being able to move) instead of asking for other steps. I slightly like it better like that but it deserves a distinct figure to know how often that happens and I'll get that right.

Title: Re: Arimaa rating deflation
Post by MrBrain on Dec 16th, 2003, 8:46am
That all sounds very good.  Any chance you could get the random mover, possibly called "bot_random", on line and start playing against other bots?  (I suppose Omar would first have to implement a change to the rating system so that bot_random's rating stays at 0 no matter what, but the opponent's rating moves accordingly.)

Title: Re: Arimaa rating deflation
Post by fotland on Dec 16th, 2003, 9:35am
bot_random shouldn't have a fixed rating, or it will distort the rating system.  The problem is that bot_random will lose every game and end up 700 points lower than the lowest bot, but that won' be its true rating.

Title: Re: Arimaa rating deflation
Post by MrBrain on Dec 16th, 2003, 10:31am
But that's the whole purpose of bot_random.  To have a fixed reference point for a rating of 0.  Why else would we even be making a random bot?  If you go back to the start of this topic (actually, the first post after your initial post), you'll see that the whole reason we started talking about a random bot was that we wanted some anchor for the rating system.  What better way to do this than random=0?

Title: Re: Arimaa rating deflation
Post by clauchau on Dec 16th, 2003, 1:43pm
Yep, and that's why we'll also have intermediate bots.

Here are some first results about elementary bots that now acknowledge the goal winning condition.

The Stepping Ultimate Lookout ;) makes random steps, except that it steps onto the goal with a rabbit if some step ever achieves that immediately (without ever caring first to get rabbits closer) and it never pulls or pushes one of the opponent's rabbits onto the opposing goal.

Stepping Ultimate Lookout / Random Stepper

SUL won 62.7% and RS won 37.3% of 100,000 games

That's not much of an improvement but I was curious about it.

The Stepping +Infiltrator -Infiltrator makes random steps among the steps maximazing

16*(advancement of the most advanced rabbit) - (advancement of the opponent's most advanced rabbit)

where advancement = 8 on the goal.

Stepping +Infiltrator -Infiltrator / Random Stepper

S+I-I won 97.7% and RS won 2.3% of 100,000 games

Now the Stepping +Flooder -Flooder focusses on getting as many rabbits as possible onto the goal, then onto the row before, etc., then on the first row, then on getting as few of his opponent's rabbits as possible on the opposing goal, then as few on the row before, etc.

Stepping +Flooder -Flooder / Stepping +Infiltrator -Infiltrator

S+F-F won 91.5% and S+I-I won 8.5% of 100,000 games.

Wins by Goal reached: 99.75%
Loss by pulling or pushing on the opposing goal: 0 (none)
The loser was unable to move: 0.25%
Loss by 3-times repetition: 2 games

shorter game = 6 half moves
mean length = 33.1 half moves (sd = 14.3)
longest game = 188

There is more, but that's the most important results. I didn't get any elementary Stepping bots stronger than that Flooder. In particular the official scoring function makes a weaker stepping bot (and moving bot as well).

Title: Re: Arimaa rating deflation
Post by MrBrain on Dec 16th, 2003, 2:35pm
Nice results so far!  I'd be also interested in seeing your first non-random bot play against the random mover, since this is the agreed 0-rating floor bot.

Title: Re: Arimaa rating deflation
Post by MrBrain on Dec 16th, 2003, 2:37pm
From my knowledge of how chess ratings relate to win probabilities, I'd estimate (very preliminary back-of-the-envelope calculation) that your Stepping +Flooder -Flooder bot would be about 1200 rating points better than the random stepper.

Actually, if we can do this kind of analysis before we anchor the rating system at 0 for bot_random, we should be able to estimate a one-time adjustment to all current ratings.  For example, if we find that a person rated around 1500 would instead be 2300 with a random=0 anchor, then we can simply add 800 points to everyone's ratings.  This will prevent a long and inaccurate period where people's ratings drift at different rates depending on how much they play.

Actually, if we find that the adjustment would be really great (like more than 2000 points), it may be aesthetically pleasing to both scale AND shift the entire rating system.  For example, instead of having mean ratings be 3800, we could change the scaling factor in the ratings formulas from 800 to 400 so that a difference of 100 points then would be what a difference of 200 points is now.  But again, that's just a preference, not a necessity.  Some sort of shift will probably be necessary though if we don't want a long period of inaccurate ratings.

Title: Re: Arimaa rating deflation
Post by 99of9 on Dec 16th, 2003, 2:41pm
Keep it coming!  This is all very interesting.

I'd also be interested in how a complete materialist would do in this scheme of things... eg implementing the 99system at each step, without any focus on pushing rabbits forward.  I expect this would lose to flooder, but I'd be interested nonetheless.

It'd be worth playing some of those bots you've made against the random_mover, since that's what most people think should be the one fixed to 0.  Then we can start arguing about ratings for the intermediate bots.

In fact a full crosstable of percentage wins for all pairs of bots you make is probably the best thing to calculate ratings from.

Title: Re: Arimaa rating deflation
Post by fotland on Dec 18th, 2003, 12:22am

on 12/16/03 at 10:31:24, MrBrain wrote:
But that's the whole purpose of bot_random.  To have a fixed reference point for a rating of 0.  Why else would we even be making a random bot?  If you go back to the start of this topic (actually, the first post after your initial post), you'll see that the whole reason we started talking about a random bot was that we wanted some anchor for the rating system.  What better way to do this than random=0?


I understand the desire to have a fixed reference point, but I think that a random player is 5000 or 10000 points weaker than the strong players.  I don't think we want to radically change the ratings of the current players, and wait for them to restabilize.  My suggestion is that initially the random player should float, to find out what its natural rating is, then make it the anchor at that rating.

But you know that I think the whole idea is silly :)  Because there will be so many stages of intermediate players between the random player and the worst human, that the system will never stabilize.

Title: Re: Arimaa rating deflation
Post by MrBrain on Dec 18th, 2003, 10:30am
I think you are severely overestimating the number of levels between random and regular players.  As the preliminary analysis has shown, there's about a 1200 point difference between random and a bot that accomplishes a concrete strategical goal.  I would estimate (without the benefit of seeing its play) that this bot is about 1200 at most worse than shallow_blue.  Allow another 600 points for an average player puts us at about 3000.  So at worse, we may need to, as I suggested before, scale the rating system so that a 100 point difference means about what a 200 point difference does now.

Title: Re: Arimaa rating deflation
Post by MrBrain on Dec 18th, 2003, 10:35am

on 12/18/03 at 00:22:03, fotland wrote:
My suggestion is that initially the random player should float, to find out what its natural rating is, then make it the anchor at that rating.

What's the difference between what you're saying, and figuring out what the natural rating would be through experimentation (what Claude is doing) followed by a one time rating adjustment?  There is none, except with the second approach, you end up with random=0, which makes sense.


on 12/18/03 at 00:22:03, fotland wrote:
there will be so many stages of intermediate players between the random player and the worst human, that the system will never stabilize.

That's the purpose of the one-time rating adjustment.  We go right to what we think is the best difference and start from there.  There won't be long-term drifting.

Title: Re: Arimaa rating deflation
Post by 99of9 on Dec 18th, 2003, 10:48am

on 12/16/03 at 13:43:45, clauchau wrote:
SUL won 62.7% and RS won 37.3% of 100,000 games

That's not much of an improvement but I was curious about it.

S+I-I won 97.7% and RS won 2.3% of 100,000 games
S+F-F won 91.5% and S+I-I won 8.5% of 100,000 games.


I quite like the idea of bots with a fair degree of overlap, where the win ratio is near 70%.  (whether by randomisation or by very small increments in bot algorithm).  Otherwise if the win ratio is up near 100%, it's difficult to be sure of the relative ratings.

Title: Re: Arimaa rating deflation
Post by 99of9 on Dec 18th, 2003, 10:54am

on 12/18/03 at 10:30:40, MrBrain wrote:
I think you are severely overestimating the number of levels between random and regular players.


Actually David's estimate of [(Strong Human - Random)=~5000 (to 10000)], is not that far off my estimate of [Random Rating on Current Scale = -2000], since strong humans can have a rating over +2000.

But anyway, a more precise answer will eventually be established by Clauchau's bots.

99

Title: Re: Arimaa rating deflation
Post by 99of9 on Dec 18th, 2003, 11:05am

on 12/14/03 at 11:50:32, clauchau wrote:
the mover won 54%, the stepper won 46%


If we define Random Mover as our 0, Random Stepper therefore has a rating of approximately -28.

Title: Re: Arimaa rating deflation
Post by 99of9 on Dec 18th, 2003, 11:17am

on 12/16/03 at 13:43:45, clauchau wrote:
SUL won 62.7% and RS won 37.3% of 100,000 games
S+I-I won 97.7% and RS won 2.3% of 100,000 games
S+F-F won 91.5% and S+I-I won 8.5% of 100,000 games.


That gives SUL a rating of about 62 (90 higher than RS).

S+I-I is approximately at 623 (651 higher than RS)

S+F-F is approximately at 1036 (413 higher than S+I-I).


To be honest I think we're still quite a way from Shallowblue, because at the moment, in games of Shallowblue vs S+F-F, shallowblue will simply eat every rabbit that S+F-F sends forward.  This flooding mechanism may be good against bots that don't try to trap it, but as soon as you put any trapping plan into action, flooder is dead.

By the way:   S+F-F is only different to S+I-I when the lead rabbit cannot make progress.  In that case S+F-F sends another rabbit forward, whereas S+I-I simply makes a random move.  Notice that this small difference in strategies resulted in a few hundred ratings points!!

Title: Re: Arimaa rating deflation
Post by MrBrain on Dec 18th, 2003, 12:24pm
Well, perhaps the rating difference is more than I expect (but I am almost positive much less than 10000).  But yes, we will definitely find out from the experiments.  I am very excited to see the results!

Title: Re: Arimaa rating deflation
Post by fotland on Dec 19th, 2003, 1:09am
Does anyone have an estimate of shallow blue's actual rating, since its currently frozen?  I'm confident that ariminator will win very close to 100% against it, so
perhaps shallow blue's rating is actually about 500 on the current scale.  Maybe Omar could let it float and we could see where it ends up.

A bigger ratings issue with using bots is that they don't learn and people do.  People will discover their weaknesses, and exploit the same weakness over and over.  This causes distortion in the relative human ratings.  Of course we already have this problem, but I don't think fixing the bot ratings will help it.

Finally, many people are familiar with chess ratings, yahoo ratings, etc.  If we shift the whole rating system up thousands of points and popular the familiar ratings with many bots, it will look a little odd :)

Still, I'm very interested in the results of the bot experiments.  I bet I could write 3 bots where bot1 beats bot2 close to 100%, bot 2 beats bot3, and bot3 beats bot1.  Would that be enough to demonstrate the futility of using bots to make a more stable rating system? :)


Title: Re: Arimaa rating deflation
Post by 99of9 on Dec 19th, 2003, 3:03am

on 12/19/03 at 01:09:19, fotland wrote:
Still, I'm very interested in the results of the bot experiments.  I bet I could write 3 bots where bot1 beats bot2 close to 100%, bot 2 beats bot3, and bot3 beats bot1.  Would that be enough to demonstrate the futility of using bots to make a more stable rating system? :)


People have alluded to this, it will be interesting to see if any of Clauchau's bots do it.  I don't think a set of bots specifically designed for this is interesting, unless they are fairly generic strategies.  Nor do I think this does break the system, we should include these cycles in our analysis.  In the end we should probably put all Bot-Bot results in a big matrix and do some diagonalization or something.  I agree the rankings for the higher bots will not be watertight, but they will be approximately correct.

Title: Re: Arimaa rating deflation
Post by clauchau on Dec 20th, 2003, 5:15am
K = every friendly piece on the board is worth one point
S = Arimaa score R + P*(C+1)

+X -Y means X as viewed by the playing player is maximized above all. In case of steps or moves with equal value, Y as viewed by the opposing player is minimized.

Below are percentages of wins. Every figure is based on 100,000 games when only Stepping strategies are involved. When one or two players has some Moving strategy, only 1000 games have been sampled.
S M S+K-K S+I S+I-I M+I-I S+F-F S+S-S M+F-F M+S-S
S
M 54
S+K-K 82.6 81
S+I 93.0 59.5
S+I-I 97.7 98.5 67.1 71.7
M+I-I 99.9 99.9 84
S+F-F 99.9799.90 91.5 66
S+S-S 99.98 95.6 72 64.4
M+F-F 97.0 74 59 52
M+S-S 72 66 53.5

Title: Re: Arimaa rating deflation
Post by omar on Dec 20th, 2003, 9:40am
Wow Claude, you've done a really impressive job of collecting some good stats on the random bots. Thanks so much for doing this.

Is it possible that you could send me a copy of your program so I can also try out some experiments. Actually I was thinking that we really should keep a repository of the programs we use for the random bots and the other simple bots we use for anchoring the rating system. I can make it available under the download section of the Arimaa site so that others can also look at the code and experiment with it. It would be great if you could contribute this.

Omar

Title: Re: Arimaa rating deflation
Post by omar on Dec 20th, 2003, 9:45am

on 12/16/03 at 08:46:24, MrBrain wrote:
That all sounds very good.  Any chance you could get the random mover, possibly called "bot_random", on line and start playing against other bots?  (I suppose Omar would first have to implement a change to the rating system so that bot_random's rating stays at 0 no matter what, but the opponent's rating moves accordingly.)


Actually I would not need to make any changes. If I just set the RU to zero (meaning there is no uncertianty about the players rating) then the rating and RU of that player will not change. Pretty nice how the equations just work out that way :-)

Omar

Title: Re: Arimaa rating deflation
Post by omar on Dec 20th, 2003, 10:52am
I have a simple question. What is the rating of the perfect tic-tac-toe program if the rating of the random program is set to zero and we use the Arimaa rating equations to establish the rating scale:
 http://arimaa.com/airmaa/rating/

Would we be able to independently come up with the same value. Of course we would all come up with a different value due to sampling difference, but how different would it be. Would they be say within 100 rating points of one another, or would they be way off.

I think this is worth investigating to learn more about anchored rating systems before we use it in more complex games.

Omar


Title: Re: Arimaa rating deflation
Post by omar on Dec 20th, 2003, 10:53am
Had a typo in the link; it should be:
 http://arimaa.com/arimaa/rating/

Omar

Title: Re: Arimaa rating deflation
Post by 99of9 on Dec 20th, 2003, 11:51am
Does the perfect tic-tac-toe program know that it's opponent is playing random?  If so it might play differently.

Title: Re: Arimaa rating deflation
Post by omar on Dec 20th, 2003, 1:23pm

on 12/20/03 at 11:51:03, 99of9 wrote:
Does the perfect tic-tac-toe program know that it's opponent is playing random?  If so it might play differently.


No, it should not make any assumptions about the opponent other than the opponent will try to make the best possible move.

Omar

Title: Re: Arimaa rating deflation
Post by omar on Dec 20th, 2003, 1:49pm
Claude was nice enough to contribute his random bot program. I have put it up on the Arimaa site:
 http://arimaa.com/arimaa/download/randomBot/claude/r.cpp

I haven't had a chance to try it out yet.

Here's the notes that Claude sent me with the program:

Here is the c++ program I use to get statistics
among bots. Unfortunately I get 3 errors when compiled
with Gnu gcc, although it does look fine and get compiled
with Borland c++ compiler. Tell me if you figure out my
mistake. In any case, feel free to include it on your
download page and allow people to use and alter it.


Title: Re: Arimaa rating deflation
Post by clauchau on Dec 21st, 2003, 1:24pm

on 12/20/03 at 10:52:26, omar wrote:
Would we be able to independently come up with the same value. Of course we would all come up with a different value due to sampling difference, but how different would it be. Would they be say within 100 rating points of one another, or would they be way off.


Calculations involving confidence intervals in sampling statistics show that

if you sample 100 games between bots X and Y and get a certain proportion x% of wins for X, then you are 99.9% sure the real proportion is between (x-17)% and (x+17)%;

if you sample 1,000 games between bots X and Y and get a certain proportion x% of wins for X, then you are 99.9% sure the real proportion is between (x-5.2)% and (x+5.2)%;

if you sample 10,000 games between bots X and Y and get a certain proportion x% of wins for X, then you are 99.9% sure the real proportion is between (x-1.7)% and (x+1.7)%;

if you sample 100,000 games between bots X and Y and get a certain proportion x% of wins for X, then you are 99.9% sure the real proportion is between (x-0.52)% and (x+0.52)%;

if you sample 1,000,000 games between bots X and Y and get a certain proportion x% of wins for X, then you are 99.9% sure the real proportion is between (x-0.17)% and (x+0.17)%.

Being 99.9% sure means the real proportion might be out of that interval but only one sample over 1000 could then have yielded an estimation x% that far from the real proportion. In other words, if you trust the intervals given above, you are going to match the truth 999 times over 1000 and be deceived once every 1000 times or so.

When the sampled proportion x% is close to 100% you can trust it more closely. For x% = 85% you can replace any 17 above by 12 and 52 by 37 but it gets much better when closer to 100%.

As a result, the difference of ratings between bots X and Y lies -- with a 99.9% level of confidence -- in an interval [A,B] whose length B-A is about

240 points if x%=50% (430 points if x%=85%) when you sample 100 games;

73 points if x%=50% (103 points if x%=85%) when you sample 1,000 games;

23 points if x%=50% (32 points if x%=85%) when you sample 10,000 games;

7.2 points if x%=50% (10.1 points if x%=85%) when you sample 100,000 games;

2.3 points if x%=50% (3.2 points if x%=85%) when you sample 1,000,000 games.

(hmm, I hope this was clear).

Title: Re: Arimaa rating deflation
Post by fotland on Dec 21st, 2003, 1:35pm
Thanks for the confidence data, Claude.  Doesn't this mean that there is no hope for a truly stable rating system?  Even with 100 games, the high confidence interval is still 240 points.  It's unlikely that two people will play that many games between them.

From the results of the world championship it seems that the top 4 players are very similar in strength, but from month to month their ratings have a spread of over 100 points.  It looks now that this kind of spread is inherent in the sampling process, due to a small number of games played, and that no rating system can do better.

David

Title: Re: Arimaa rating deflation
Post by clauchau on Dec 21st, 2003, 2:06pm
If we lower the confidence level to 90% instead of 99.9% -- so we accept to be deceived once in ten times -- then 240 turns into 116, which is still pretty large. But most serious people will play 100 games, some even against fixed bots, and get to know their rating within that range. Not too bad.

For 1,000 games, 73 turns into 36.  Looks like the indeterminacy level around Chess master ratings?

Title: Re: Arimaa rating deflation
Post by MrBrain on Dec 21st, 2003, 9:08pm

on 12/20/03 at 10:52:26, omar wrote:
I have a simple question. What is the rating of the perfect tic-tac-toe program if the rating of the random program is set to zero and we use the Arimaa rating equations to establish the rating scale:

Actually, this is not as simple a question as it sounds.  The reason is that tic-tac-toe is a theoretical draw, and there are in many cases several drawing moves.  For example, in the first move, any square draws.  If the first player picks center, the second must pick a corner; if the first player picks corner, the second must pick center.  However, if the first player picks a side box, both center and the opposite side box draw.

So the question of whether the perfect player knows it's playing against a random bot is actually relevant.  In most cases, there will be a "better" move that increases the chances of winning against a random player.  However, if the "perfect player" believes it's playing against a typical human, for example, it may be best to play a particular opening that humans will fall for most frequently.

For example, if playing first, then corner, center, opposite corner is a good opening trick.  If the second player then picks one of the two remaining corners, the first player will win.  Another example is to pick a side box as first player since this is a somewhat unusual opening.

Anyway, in tic-tac-toe, it may make the most sense to define a "perfect player" as one that maximizes its winning chances against a random opponent, since tic-tac-toe is a simple game.  However, it would be easy to see that if we extended this definition to a complex game such as Arimaa, this definition would be ridiculous.

We could instead define the perfect tic-tac-toe player as one that chooses randomly from all equally optimal moves.  However, this may not give the best chances of winning against either a random opponent or a human opponent.  In Arimaa, such a definition would lead to an extremely strong player.  But in a game such as Arimaa, you also might have to consider that a move winning the game quicker is "superior" to one that forces a win in a larger number of moves.

Title: Re: Arimaa rating deflation
Post by omar on Jan 20th, 2004, 9:03pm
Things have been busy latly and I have not been able to think about this much. But in the long run I still want to eventually go to an anchored rating system once we get more familiar with them.

Omar

Title: Re: Arimaa rating deflation
Post by Fritzlein on Sep 15th, 2004, 9:37pm
This thread had many interesting posts, including fascinating statistics produced by clauchau.  I would like to add my voice, however, to those who are skeptical of the whole project of providing an "absolute" scale to the ratings.

Assigning a rating of 0 to a random mover (or stepper) makes sense to me.  That's absolute, and reasonable.  Now suppose I create a second algorithm VeryDumb which beats the random bot 3/4 of the time.  By the rating formula, VeryDumb should be rated 191.  This also can be considered absolute.

Then suppose I create the bot TotallyNuts, which beats VeryDumb 3/4 of the time and the random bot 4/5 of the time .  By the rating formula TotallyNuts should be rated 241 points above the random bot, and 191 points above VeryDumb.

What should the rating of TotallyNuts be, then, 241 or 382?  If you say 382, then it isn't absolute.  If you say 241, it is still absolute, but what good is it?  You can't tell from the ratings of VeryDumb and TotallyNuts how they will do against each other, only how well each does against the random bot.  If we anchored ratings that way, they would be essentially meaningless.

The problem is that ratings aren't truly transitive.  You can't infer from A's results against B and from B's results against C, exactly what A's results against C will be.  With humans this is a slight problem, but with bots it can become a huge problem.  Indeed, at least two people have pointed out that with deterministic bots, the ratings formula doesn't work at all.

I could give many examples of ways in which ratings aren't transitive, but I will spare you unless someone asks.  The important point is that if you don't have transitivity, then the notion of putting the ratings on an absolute scale becomes meaningless.  The rating of TotallyNuts depends on whether VeryDumb is in the playing pool or not.  And if the ratings depend on who is in the playing pool, then they are by definition only on a relative scale.

A far more pressing concern than an absolute scale for ratings is making sure that, while they are necessarily relative to the field, the ratings are accurate against the field as a whole.  One shouldn't be able to find a favorable matchup and exploit it.  I, for example, have gotten a rating of 1950 without beating a single human opponent.  I know how to beat bots, that's all, so I have a ridiculously inflated rating.  The meaningfulness of the ratings would be enhanced far more by forcing people to play a variety of opponents than it would by trying to anchor it "absolutely".

Title: Re: Arimaa rating deflation
Post by omar on Sep 16th, 2004, 6:55pm
Actually if transitivity is not preserved then any scale does not make sense; regardless of weather it is absolute or floating. A scale can only be defined if some kind of transitivity exists.

But we do have a rating scale for our floating rating system; and we also know that transitivity is not strictly preserved; so what then do the ratings on our floating scale mean?  The only thing those ratings reflect is how we've performed against the players we've played. If someone played all their games against the same person, then their rating (regardless of the type of scale) would be way off from what it should be if they had played "against the field". So even on a floating scale we still have the same problem if we look at the ratings between any two players to accurately tell us how the players will perform against each other. We used the probabilities to go forward and compute the ratings, but we can't rely on those ratings to go in the reverse direction and accurately tell us the probabilities.

I know you have to agree to all this because everything I've said so far is stuff that I learned from you (Fritzlein) in our email conversations :-)

So what is this notion of an absolute rating scale. Maybe using the word 'absolute' is what is causing the problem. It might be better to call it an 'anchored' rating scale.

Now to see what I mean by an 'anchored' rating scale imagine this experiment. Suppose that we introduce a random bot and several other non-random (but really dumb) bots into our current floating rating system in the Arimaa gameroom. All the bots come in with an initial rating of 1500 and start losing lots of games (even against shallowBlue). So the ratings of these bots sink pretty low. But they win some games against each other and also some of them win some games against shallowBlue (playing rated games) and so eventually their ratings become ordered and stabilize. Now take whatever rating the random bot has and reset that to zero and shift all the other players (bots and humans) ratings based on that difference. Now we've got an anchored rating scale with the random bot having a rating of zero.

But whenever the random bots rating drifts from zero we have to readjust everyones ratings; not good. So to avoid doing this we can just fix the random bots rating to zero and let the other players ratings change. But this causes the rating scale to be adjusted much more slowly. To speed things up we can have the random bot and all the other low rated bots play lots and lots of games against each other and then fix all their ratings so that do not change. Then there is much more chance of players with non-fixed ratings playing against the fixed (or anchored) players and the rating scale getting adjusted faster.

In a way we can think of our current rating system as being anchored around an average player having a rating of 1500, because that's the rating new players come in with. If we had chose to let the new players to come in with a rating of 10,000 our ratings would now be scattered around 10,000. So we just want a system where the ratings are anchored based on a random bot having a rating of zero. I don't think there is anything wrong or impossible about doing that. But we can't strictly take any players rating to mean that the difference in rating against the random bot can give an accurate measure of the probability with which that player can beat the random bot. It didn't really mean that in our current scale anchored with the average player at 1500 and it won't really mean that in a scale anchored with the random player at zero.

So there are no new problems introduced by anchoring at zero using a random bot instead of at 1500 as the average player. But it does eleminate the problem of drift because we are anchoring based on a player that will not get any better or worse as it plays and will also never retire from playing.

Now the issue about ratings reflecting a players ability "against the field" and not just against a few selected players is a real problem, but it is independent of the type of rating scale. This needs to be dealt with seperately and trying to fix it will require changing the rating formulas altogether. But I think in our email conversation we may be making some progress on this issue.

Omar

Title: Re: Arimaa rating deflation
Post by 99of9 on Sep 17th, 2004, 5:53am
Here's some raw data to throw in the mix.

I revved up Clauchau's program, and played out the rest of the duels to get winning percentages of all the pairs.  Then I fiddled with the "ratings" of each of these bots (with random_mover anchored at 0), until I got a good fit for these winning percentages against the predicted winning percentages based on "ratings" (using the formula for the arimaa rating).  

Here's how good the fit was.  [Fitting could be done even better with a program, I just did it by hand in a spreadsheet]
http://users.ox.ac.uk/~hudson/predict.png

The X axis is the actual proportion of games won by a bot in a particular duel, the Y axis is the predicted proportion of games won by that bot in that duel if both bots are rated as shown below.

Remember these are a group of bots with quite diverse (but very dumb) styles.  They are all somewhat stochastic.

The ratings came out as:

  0      M
-30      S
308      S+K-K
343      S+I
478      S+I-I
794      M+I-I
890      S+F-F
964      S+S-S
985      M+F-F
1044      M+S-S


So I think that all up, this bodes reasonably well for an anchored ratings system.  Clauchau has managed to span 1000 ratings points with fairly "predictably" performing bots (ie where a single number rating represents their chances of success fairly well).  There are a couple of gaps people may want to plug, or we can keep extending it with better bots and hopefully get to shallowblue one day :-).

One thing to note is that my fitting was done based on percentage wins, not on relative rating.   This is because I did not want to make too much distinction between a 99.9% and a 99.8% win ... even though by the ratings formula these would have quite different ratings differences.   Basically my fitting system weights games between bots of similar standard more than it does bots of way different standard.  I think that's a good thing.

Title: Re: Arimaa rating deflation
Post by 99of9 on Sep 17th, 2004, 10:04am

on 09/17/04 at 05:53:30, 99of9 wrote:
[Fitting could be done even better with a program, I just did it by hand in a spreadsheet]


Well now I've written a program to do some simulated annealing to determine the best fit ratings.  Here's the output for that same dataset.

# Rating:     0.0 M
# Rating:   -23.6 S
# Rating:   311.3 S+K-K
# Rating:   362.9 S+I
# Rating:   491.4 S+I-I
# Rating:   806.9 M+I-I
# Rating:   913.0 S+F-F
# Rating:   986.1 S+S-S
# Rating:  1007.1 M+F-F
# Rating:  1073.9 M+S-S


So I wasn't that far off by hand...

Title: Re: Arimaa rating deflation
Post by 99of9 on Sep 17th, 2004, 10:04am
Here's the program if anyone wants to do this for themselves.  All you have to do to include a new bot in the ratings is to add another line to the results crosstable (in the input file results.txt), and set it to run.


/*
  *********************************************************************************
  RateArimaaBots.c

  Program to rate bots by the Arimaa rating scheme.
  The first bot in your list will automatically take the rating value of 0.

  ---------------------------------------------------------------------------------
  Input file should be a results crosstable of the format:
  %d\n                 <number of bots>
  %s %s %s %s ...\n    <contents of this line do not matter>
  %s\n                 <name of bot 0>
  %s %f\n              <name of bot 1 and performance (up to 100) against bot 0>
  %s %f %f\n           <name 2 and performance against bot 0 and bot 1>
  %s %f %f %f\n        <etc>
  ---------------------------------------------------------------------------------
  A filled crosstable is also handled correctly if that input format is easier,
  but there is no error checking to ensure that A vs B adds up to 100.

  Version 0.0
  Toby Hudson (toby<AT>hudsonclan.net)
  This program is available without warranty for anyone to use for any good purpose.
  Please credit the author in any publications or derivative works.
  **********************************************************************************
*/

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define MAXBOTS 20
#define MAXNAMELENGTH 40
#define NGEN 1000001
#define MAXDELTA 1.0

double kT;

double predict (double ratingB, double ratingA) {
 return 100*(1.0/(1.0+pow(10.0,((ratingA-ratingB)/400.0))));
}

double rmserror (int n,
            double rate[MAXBOTS],
            double res[MAXBOTS][MAXBOTS],
            double pred[MAXBOTS][MAXBOTS]) {
 double err = 0.0;
 int i, j;
 for (i=0; i<n; i++) {
   for (j=0; j<i; j++) {
     pred[i][j] = predict(rate[i], rate[j]);      
     err += (res[i][j]-pred[i][j])*(res[i][j]-pred[i][j]);
   }
 }
 err /= ((double)n*(n-1)/2);
 err = sqrt(err);
 return err;
}

void ReadIn (int *n,
          double res[MAXBOTS][MAXBOTS],
          char nam[MAXBOTS][MAXNAMELENGTH]) {
 FILE *fp;
 char ch='x';
 int i, j;
 
 fp = fopen("results.txt", "r");
 
 fscanf(fp, "%ld\n", n);
 printf("# Rating %d bots\n", *n);
 
 while (ch!='\n') ch = getc(fp);
 
 for (i=0; i<*n; i++) {
   fscanf(fp, "%s", nam[i]);
   for (j=0; j<i; j++) {
     fscanf(fp, "%lf", &res[i][j]);
   }
   ch = 'x';
   while (ch!='\n') ch = getc(fp);
 }
}

double RateBots(int n,
           double rate[MAXBOTS],
           double res[MAXBOTS][MAXBOTS],
           double pred[MAXBOTS][MAXBOTS]) {
 int gen;
 int i;
 double delta;
 double initerr;
 double finerr;
 double minerr;
 double bestrate[MAXBOTS];
 int bot;

 for (i=0; i<MAXBOTS; i++) {
   rate[i] = 0.0;
 }

 initerr = rmserror(n,rate,res,pred);
 minerr = initerr;

 for (gen = 0; gen<NGEN; gen++) {
   kT = 1000.0 / (double)gen;

   // choose a bot and how much to perturb by
   delta = MAXDELTA*(((1.0*rand())/RAND_MAX)-0.5);
   bot = 0;
   while (bot==0) bot = rand()%n;

   rate[bot] += delta;
   
   finerr = rmserror(n,rate,res,pred);
   
   if ((1.0*rand()/RAND_MAX)<exp(-(finerr-initerr)/kT)) {
     // accept move
     initerr = finerr;
     if (initerr<minerr) {
     minerr=initerr;
     for (i=0; i<n; i++) {
       bestrate[i] = rate[i];
     }
     }

   } else {
     rate[bot] -= delta;
   }

   //if (gen%10000==0) printf("%10d %10.5f %10.5f\n", gen, initerr, minerr);
 }

 for (i=0; i<n; i++) {
   rate[i] = bestrate[i];
 }
 return minerr;

}

int main () {
 FILE *fp;
 int NumBots;
 char Names[MAXBOTS][MAXNAMELENGTH];
 double Result[MAXBOTS][MAXBOTS];
 double Predict[MAXBOTS][MAXBOTS];
 double Rating[MAXBOTS];

 int i;

 ReadIn(&NumBots, Result, Names);
 RateBots(NumBots, Rating, Result, Predict);
 
 for (i=0; i<NumBots; i++) printf("# Rating: %7.1f %s\n", Rating[i], Names[i]);
}


Here's the current input file:

10
Actual  M       S       S+K-K   S+I     S+I-I   M+I-I   S+F-F   S+S-S   M+F-F   M+S-S
M
S            46.00
S+K-K   81.00   82.60
S+I     94.34   93.00   59.50
S+I-I   98.50   97.70   67.10   71.70
M+I-I   99.90   99.90   94.20   97.24   84.00
S+F-F   99.95   99.97   89.02   96.53   91.50   66.00
S+S-S   99.98   99.98   95.07   97.93   95.60   72.00   64.4
M+F-F   100.00  99.99   96.09   99.08   97.00   74.00   59.00   52.00
M+S-S   100.00  100.00  99.03   99.80   98.30   85.20   72.00   66.00   53.50

Title: Re: Arimaa rating deflation
Post by Fritzlein on Sep 17th, 2004, 8:21pm
This is a fun conversation!  I wish I had gotten involved earlier.

Omar, you make a clear and persuasive argument.  I accept the vocabulary change of "anchoring" the system as opposed to having an "absolute" scale for the ratings.  It makes sense to anchor the system rather than letting it drift.  If you let it drift it will probably (under the current system) deflate over time.  Moreover, as long as the system is going to be anchored, fixing the rating of a random mover at zero is at least as sensible as fixing the average rating at 1500, or any other anchoring idea I know of.

Still, the ratings scale will depend on the pool of players, and especially on the pool of bots used to anchor it.  A different pool of bots would anchor it in a different way, even if that pool of bots also had the random mover fixed at a zero rating.  The lack of transitivity insures that all ratings are meaningful only relative to the playing population.

The more I think about the lack of transitivity, the more I question the fundamental meaning of ratings.  I am beginning to think that the most basic formula, namely "WP = 1/(1+10^(RD/400))", needs to be deprecated from its current central role.  Follow me through an example, and see if you don't arrive at the same conclusion I do.

Suppose that 99of9 can beat Belbo about 75% of the time and speedy about 75% of the time.  Suppose that I make a special study of speedy, and learn to beat it 99% of the time, but this specialty knowledge isn't transitive, so I can only beat Belbo 40% of the time.  (I know both the 99% and the 40% are far too high, but bear with me for the sake of the example.)

The question is, based on these percentages, who deserves a higher rating, me or 99of9?  If you start from the formula of "WP = 1/(1+10^(RD/400))", you would say that 99of9 deserves to be 191 points higher than Belbo and 191 points higher than speedy, while I deserve to be 70 points lower than Belbo but 798 points higher than speedy.  To combine these numbers somehow, we could take the average rating of Belbo and speedy, put 99of9 above it by (191+191)/2, and put me above it by (-70+798)/2.  I end up 173 points higher than 99of9, which is absurd.

Intuitively, it makes no sense whatsoever that 99of9 would be rated lower than me, given that he wins 1.5 out of every two games against the same opposition I beat 1.39 games out of 2.  This issue really comes to a crisis when there are winning percentages of 99.98, 99.99, and 100.00.  My intution (apparently in agreement with 99of9) is that the difference between 99.98 and 99.99 should count for exactly as much as the difference between 50.00 and 50.01.  Yet the current rating system disproportionately rewards lopsided results.  The key to getting a high rating at present is to find a bot you can beat, and beat it again and again and again.

If we can free ourselves for a minute from the shackles of "WP = 1/(1+10^(RD/400))", let's consider an alternate definition of playing strength.  Let's say that the better player is, by definition, the one who wins more games (on average) against the same opposition.  Everyone can agree that, in a round-robin tournament, the player with the highest total score wins, regardless of who the wins and losses were against.  Adding up total wins in a round robin is universally considered to be a fair method of scoring.

In the round robin of bots from clauchau/99of9, the expected winning percentages of each bot are (calculated by hand so please forgive any errors):

M       8.93
S         8.10
S+K-K   29.29
S+I     31.62
S+I-I   40.96
M+I-I   64.23
S+F-F   71.95
S+S-S   78.55
M+F-F   80.41
M+S-S   87.09

I would claim that the above list of winning percentages in a round robin DEFINES the relative playing strength of each bot in this field.  Anything else we say about ratings should be derivative from this concept.

Now, I admit that this leaves open the knotty question of how to estimate ratings when a round robin isn't possible.  I need to chew on that one some more before I offer up any suggestions.  But I have a strong intuition that "total score versus the field", i.e. the round-robin philosophy, is a solid starting point.

Title: Re: Arimaa rating deflation
Post by clauchau on Sep 18th, 2004, 7:34am
Oh yes, I love all that was recently said.

It looks like round robin scores are an additive logarithmic version of ratings with a smaller focus on the last games. It has the advantage of weighing every played opponents equally.

In theory I still could invite a hundred similar bots in the gameroom and play them all, amounting to a weigh of hundred against a single bot. So the scale depends on who plays, but it's alright I think and my hundred bots aren't a real issue.

Now recent games should however be weighed more. And maybe that only means we'll get back to the current formula. It's fun too. To me it looks a bit like a currency or some traded goods or several currencies with national local biases.

Title: Re: Arimaa rating deflation
Post by Fritzlein on Sep 18th, 2004, 4:01pm
Yes, clauchau, if there happens to be a player that I do well against and others do poorly against, I want as many copies of that player as possible in the field.  This can distort the ratings.  But I it isn't likely to be as much of an issue as the current problem that I can find one opponent I do relatively well against and play hundreds of games against that one opponent.

If nothing else, we should have a reasonable pool of bots with fixed ratings.  It seems that the current pool needs to be extended further up to overlap with ShallowBlue and human beginners.  Are there natural, easy-to-define bots that play better without looking deeper?  I notice that M+K-K isn't in the pool, where K is just a count of the number of pieces each player controls.  Furthermore, that bot could perhaps be further strengthed by tiebreaking pure materialism with a count of "favorable adjacencies", i.e. among moves which result in the same number of pieces being captured, break ties in order of (# of enemy pieces next to a stronger friendly piece) - (# of friendly pieces next to a stronger enemy piece).  We could call it M+K-K+A-A, where A is for adjacency.  We could also have M+K-K+F-F where now F stands for the number of frozen pieces on each side.

Are there other natural algorithms?

How difficult is it to add new bots to the "anchor group"?

Title: Re: Arimaa rating deflation
Post by 99of9 on Sep 18th, 2004, 6:10pm

on 09/18/04 at 16:01:59, Fritzlein wrote:
How difficult is it to add new bots to the "anchor group"?


Very easy if they're similarly structured.  If you can define it, then you can probably code it quickly by copying Clauchau's current bots.  Then you just have to run it against all previous ones, and feed the results into my program.

Actually there are quite a few bots that Clauchau wrote that haven't been included yet because they haven't been run against each other.  I guess they're intermediate ones though, but I'm slowly working through them.

Title: Re: Arimaa rating deflation
Post by clauchau on Sep 19th, 2004, 9:13am
I remember having tried a mix of Freezer and K, weighing those pieces that are frozen less than 1. The best weigh was 0.86 or so, but it didn't make that stronger a bot.

Maybe Freezer or A distinct from K would do some good as you suggest, fritzlein. In any case I bet we need more to make a really better bot - open access to goal ranks, something about traps and territory and strength density.

Here is a thought about how to make it natural: The first bots are about advancing rabbits. It's naturally derived from the winning condition. The only human thought we put into it is that in order to get a rabbit on the 8th rank, we better have
some on the 7th, the 6th, ...

We might try to naturally derive a bot from the condition "being one step from winning" = "having a rabbit on the 7th rank and no piece above and (a friend beside or no stronger foe beside) and having a step to play".

Title: Re: Arimaa rating deflation
Post by omar on Sep 21st, 2004, 3:52pm
Karl  (Fritzlein) and I had some long email discussions about the problems with the ELO rating systems. Some of the things that I learned from it was that:

* The ELO rating system would work fine as long as the players are not allowed to pick their opponents and the opponents are picked for them (as it happens in tournaments).

* When the players are allowed to pick their own opponents, the rating system can be abused by repeatedly playing the same or small group of opponents.

* When computers opponents are also added to the mix, it makes the problem even worse, because once a player learns how to defeat it they can do it again and again since the computer opponent will never figure out why it is losing and adapt itself (at least with the current computer opponents :-) ).

* A players rating is very dependent on the pool of players that are available to play against. For example a player with a true rating of 3000 will never show that rating if the rating of the other players in the pool are around 1000. Even if he consistently defeats all of them the rating formula will only let him increase to about 2200. So there needs to be a good continuum of players at all levels to support a healthy rating system.

* Even though we use winning percentages to compute the ratings we should not expect the ratings difference between any two players to accurately tell us what the winning percentages will be when they play each other. We can go forward, but we can't reliably go back.

* The meaning of a players rating should be how they have performed against the field. The more different opponents that a player has played the more meaningful and reliable the rating is. If only a few opponents have been played the rating is not very meaningful.

Keep in mind that anchoring a rating system does not prevent players from abusing the rating system by playing very limited sets of selected opponents. So this problem is completely independent of that.

After having these discussions with Karl I tried out some different formulas for computing the players ratings that might be more resistent to abuse. The current best formula has these properties:

* Winning many games against the same player will not increase your rating as much as winning the same number of games against differnt players.

* A rating obtained by playing the same player will fall more after a lose than the same rating obtained by playing many different opponents.

* A rating obtained by playing much weaker players will fall more after a lose than the same rating obtained by playing opponents of similar ratings.

* The recent games count more than the older games, but a player cannot wash out their history by playing a lot of games (like 200) with the same player.

* The rating uncertianty goes down faster if you play different opponents and it can go back up if you start playing many games with the same opponent or few opponents. In effect the rating uncertainty can reflect how meaningful your rating is.

I passed it on to Karl to look at and am waiting to hear back from him about it. He is pretty good at finding cracks in a system :-)

If anyone else wants to check it out, you can download it from:
 http://arimaa.com/arimaa/rating/testRatings.tgz


Title: Re: Arimaa rating deflation
Post by clauchau on Sep 22nd, 2004, 2:28pm
Ah, I see you too Omare became a forum God :)

I guess it's a good idea to have the ratings to be the solution of an equation, but I don't understand the equation.

And how to justify the addition of a draw against a zero rated opponent, in theory? I can only understand adding a draw against an equally rated opponent.

Title: Re: Arimaa rating deflation
Post by omar on Sep 23rd, 2004, 10:22am
I recently lowered the values for the forum seniority levels; that's why we all gained seniority :-)

The fictitious draw is need so that the equation does not blow up if a player has not lost or won a single game.

The choice of what rating that fictitious draw is against does not matter too much in the long run, but chosing the average opponent rating (a number which is not fixed) can cause a player to gain rating points after losing a game against a high rated player (which is counter intuitive to how ratings should work). So that is why I didn't chose that. I thought it is more important to chose a number that is fixed and does not change. Zero seemed like the obvious number :-)


Title: Re: Arimaa rating deflation
Post by clauchau on Sep 24th, 2004, 6:53am
The simple bots we've got so far on the bottom of the scale all happen to see no more than the horizontal projection of the board. Much less even, because the ranks of the noble piece aren't taken into consideration. So, I'm wondering, how about first making the best of that projection?

I want to know how high it is best to put every piece relative to each others. Like of course it is better to wait for a lower density of foes on the goal rank before advancing your rabbits, except the more you accompany them with friendly powerful pieces the higher the foe density you can tolerate.

Title: Re: Arimaa rating deflation
Post by Fritzlein on Nov 2nd, 2004, 2:48pm
In another thread I said:


on 11/02/04 at 12:02:38, Fritzlein wrote:
It isn't possible to deduce apriori whether the way players enter and leave the playing pool will have a net inflationary or net deflationary effect.  From my short time of observation, however, I rather expect that we are suffering from mild deflation at present.  And even if we experiencing a small amount of inflation in the sense that average rating of active players going up (which I don't think we are)  I suspect that there is significant deflation in the sense that Fotland meant, i.e. that a 1700-rated player today is signficantly stronger than a 1700-rated player was a year ago.


There is some evidence that there is both inflation in the one sense and deflation in the other.

I tested for deflation by looking at Arimaazilla's rating, since as I understand it, that bot plays the same as a year ago.  So I checked my database.  In September and October 2003, Arimaazilla was averaging a rating of 1506.  In September and October (up to the 24th) of 2004, Arimaazilla was averaging a rating of 1429.  That's not conclusive, but it sure is suggestive.

On the other hand, in November of 2003, the average rating of 23 active humans was 1558, whereas in November of 2004, the average rating of 19 active humans was 1663.

So at a first approximation, it is possible that we have equal and opposite trends.  Maybe the average rating of active players has gone up by 105 points in a year, but the rating of a player who hasn't improved has gone down by 80 points in the same time frame.

An alternative conclusion is that humans are just getting better at beating bots.  That is to say, perhaps there is neither inflation nor deflation, but human players are pulling away from the state of the art computer players.  This is intuitively very plausible to me.

Title: Re: Arimaa rating deflation
Post by 99of9 on Nov 2nd, 2004, 4:53pm
If you include bots in your definition of an "active player" what do you find?

This is what I expect to have inflated due to wandering players.  Find all the players that have played at least a few rated games in a month, average their ratings.  If someone plays 10 times the number of games their rating shouldn't be included in the calculation 10 times.

You are no doubt right that active humans have improved relative to the bots, but that is a different issue, that is the deflation that fotland was identifying when he started this thread.

Title: Re: Arimaa rating deflation
Post by Fritzlein on Nov 2nd, 2004, 6:34pm

on 11/02/04 at 16:53:50, 99of9 wrote:
If you include bots in your definition of an "active player" what do you find?


If we include active bots the average rating has still gone up, albeit by only 80 points.  So there is definitely inflation in the sense that the average rating of all active players has gone up, not just in the sense that humans are pulling ahead of the bots.

On the other hand, it is more difficult to separate the deflation of the rating of a consistent player from humans getting better at bashing bots.  Since humans generally improve over time, we can't really test what happens over time to the rating of a human who doesn't get better or worse.  But if we had such a human, I'll bet his rating would be lower now than a year ago.

Title: Re: Arimaa rating deflation
Post by RonWeasley on Nov 3rd, 2004, 11:55am
Here's something related but different.

If I don't play for a long time, I play worse in my first games after the layoff.  I thought I noticed this when naveed and clauchau started playing again after periods of inactivity.

This makes me think a rating should decay as a function of inactive time.  Maybe decay to 1500 with a time constant of 90 days.  I shy away from recommending we actually do this since I worry about bad side effects.

Another penalty might be applied if a player always has the same opponent.  This problem has been discussed before but not with a penalty in mind.  Seems like a great opportunity for side effects here.  This is why I would only duel with Crabbe.

Title: Re: Arimaa rating deflation
Post by Fritzlein on Nov 3rd, 2004, 4:23pm
In terms of accuracy of rating, it is reasonable to penalize people who have had a long layoff with no games.  Probably the relative ratings at any given time would be slightly more accurate if we did that.  However, rating systems usually shy away from such penalties on the grounds than they tend to deflate the system as a whole over time.

An alternative would be to increase the RU of anyone who hasn't played for a while.  Typically you would expect players to get worse during a layoff, but they might get better.  Perhaps I took time off to play a series against Bomb on my home computer, for example.  Increasing the RU says, in effect, we don't know what happened during the layoff, but the longer the layoff, the less certain we are about the accuracy of the old rating.

I'd be game for Omar to implement this change to the current system independent of all the other changes we have discussed.  Every week (or every day!) he could automatically add 1 to the RU of every account.  Then players who are inactive for a long time get to start off at their old rating, but it will move fast when they rejoin, and in the mean time the high RU will reflect that their rating isn't currently accurate.

Title: Re: Arimaa rating deflation
Post by omar on Nov 7th, 2004, 10:24pm
Sounds like a good idea. I setup a cron deamon to increment the RU by 1 point (if it is less than 120) each week.

Title: Re: Arimaa rating deflation
Post by Fritzlein on Nov 16th, 2004, 10:52am
As it turns out, using Arimaazilla as a benchmark for inflation might not be a totally brilliant idea.  Arimaazilla's rating apparently fluctuates a great deal depending on who its recent opponents have been.

http://www.math.umn.edu/~juhn0008/Rating_graph.png

There may be a general downward trend in Arimaazilla's rating, but there is enough noise that it isn't very conclusive

[Edit] Updated to January 4, 2005.  Arimaazilla and Arimaanator are down, Bomb is up.

[Edit] Updated through January 25, 2005.  Arimaanator has recovered somewhat to about 1650, but Arimaazilla is still under 1400, low by historical standards.

Title: Re: Arimaa rating deflation
Post by fotland on Nov 19th, 2004, 12:23am
What a discouraging graph :)  I got bomb to 1800 in about 6 weeks, and its rating hasn;t improved since, with all the extra work I've put in.

Reality is that the rating system has deflated several hundred points as people have gotten better at the game.  That early 1800 bomb would only be about 1600 now.

Title: Re: Arimaa rating deflation
Post by omar on Nov 22nd, 2004, 11:21pm
Wow nice. Looks like Im also taking a nose dive. I'd like to see Toby's graph :-)

Title: Re: Arimaa rating deflation
Post by Fritzlein on Mar 23rd, 2005, 9:24pm
I updated the graph through mid-March, but since I changed the style I thought I would leave the old one up.  Rather than having a dot per game, I have a dot per player per week, the weekly average of that player.  Also I replace Omar with Naveed, since Naveed has been the most active player over the life of the server.

http://www.math.umn.edu/~juhn0008/BotRatingDeflation.png

We can see that Arimaazilla is slightly low by historical standards, while Arimaanator is quite low.  There are too many active Arimaanator-beaters out there nowadays to let Arimaanator get any traction from new players.

Bomb is not low by historical standards, but is still stuck hovering just over 1800, which must be frustrating given that it has only played when under active development.  The gap between Bomb and the challenge prize does not seem to be narrowing even though Bomb is getting better.

Title: Re: Arimaa rating deflation
Post by Fritzlein on Mar 24th, 2005, 2:34pm

on 03/24/05 at 12:00:40, Arimanator wrote:
I rate a measly 1100 and I just beat bot arimaazon twice in a row...


You are an excellent example of how new players can contribute to either inflation or deflation.  At first, as your rating dropped from 1500 to 1100, you contributed about 100 points each to the ratings of Arimaalon and Arimaazilla.  If you leave now and never play again, you will have injected points into the rating system, contributing to inflation.

On the other hand, you are now underrated, and your RU is down to 30, so the system considers that you are an established player.  Now if you beat up on the bots until your rating is back up to 1500, you will take points from them exactly equal to the number of points you gain, i.e. you will draw 400 rating points back out when you only put 200 in, even if you start and end your career at the same rating of 1500.

One reason the bots are low at the moment is that Ryan Cable came through before you, lost many points to the bots at a high RU, then won them back at a low RU.  The net effect was pushing down the bots ratings.   In fact he has now pushed himself up to 1700, so he drew even more points out of the system beyond what he got for getting back up to 1500.

On the other hand, this is somewhat balanced by people who only lose a few games and don't come back, and therefore each donate a few rating points into the pool.

Title: Re: Arimaa rating deflation
Post by Fritzlein on Mar 24th, 2005, 3:24pm
While I was back on the subject, I couldn't resist another picture.  This graph shows the week-by-week average rating of all Arimaa players who played at least one rated game during that week.  Also there is a line for human players only.

The average active human player seems to be a bit stronger than the average active bot most of the time, but not always, and it wasn't that way at first.  Also, although everyone starts at 1500, the average active player is substantially higher.

http://www.math.umn.edu/~juhn0008/ActivePlayerRating.png

This graph is too noisy for me to tell whether it is rising or not: At least we aren't obviously experiencing an increase in average rating at the moment (which would be a type of inflation), whereas I think we are quite probably experiencing the type of deflation whereby 1800 designates a stronger player than it used to.


Title: Re: Arimaa rating deflation
Post by Fritzlein on Jul 24th, 2005, 6:57pm
Now that newcomers have more bots to play, more rungs in the ladder, so to speak, it seemed appropriate to update my graph of bot ratings to include them.

http://www.math.umn.edu/~juhn0008/StandardBotRatings.png

I thought that having a greater spectrum of bots might mean that the rating of each would swing less wildly, but it appears not to have happened.  Arimaalon, which is now the first rated bot for everyone, has perhaps taken on some of the uncertainty that used to go to Arimaazilla, but the latter still has regular ups and downs.  Arimaazon should by rights be consistently higher than Arimazilla, but there isn't much separation, which may be a result of some people preferring to play fast.

The rating of the bots still seems hugely dependent on who is playing them, and how persistent they are.  Blue22 has beaten Arimaanator down under 1600 for a while, which is historically very low, and which has allowed Arimaazon to cross over and temporarily have a higher rating, even though there is little question in my mind that Arimaanator is the stronger bot.

Maybe once the Arimaa playing population grows to the point that the influx of new players is more steady, the bot ratings will be more predictable.  For now, however, the ratings of the standard bots in the lobby seem to have more to do with whether the latest newbie likes to move up to new challenges as soon as possible, or likes to stay with one bot until gaining complete mastery before moving on.

I wonder whether a better series of steps would be to have all lobby bots playing at 2 minutes per move, but moving quickly to internal limits so the folks can blitz along if they so choose.  Perhaps the championship P1 and P2 bots would be good for this purpose.  Another purpose would be served by mixing things up: at present the first four bots all use the same evaluation function.  Newcomers might learn more from, say, ShallowBlue, LocP1, Arimaazilla, CluelessP1, GnobotP2, BombP2 than from the current setup.  Just a thought.

Title: Re: Arimaa rating deflation
Post by omar on Jul 27th, 2005, 6:58am
I like the idea of using the fixed performance CC bots in the lobby since they have such different styles of play.

Title: Re: Arimaa rating deflation
Post by Fritzlein on Sep 27th, 2005, 11:34pm
My previous graph of the average rating of players by week was too choppy to convey much.  I have generated a new graph of the monthly average rating of all active human players.  That is to say, I divided the game database into 30-day chunks; within each chunk I calculated the average rating of each human who played at least one rated game; then I averaged the ratings of all humans active in that chunk.

It is clear that we are experiencing a gradual rating inflation.  The average human rating, which started around 1500, has climbed to around 1600.

I'm guessing that if the system were anchored with fixed-performance bots of fixed rating, we would have seen rating inflation of more than 300 points in the same time span.  As it is, the system seems to be steering a resonable middle course between fixing the average rating of the playing pool (which is going up) and fixing the rating of an unchanging skill level (which is going down).

http://www.math.umn.edu/~juhn0008/RatingInflation.png

Title: Re: Arimaa rating deflation
Post by Ryan_Cable on Jan 29th, 2006, 8:46pm
I just noticed that it is possible to sort the New Players list by ratings.  (It seems to list humans and bots that joined within a year ago.)  There are 159 new players rated <1500, with a mean rating of 1381, collectively loosing 18892 points.  There are 69 new players rated >1500, with a mean rating of 1598, collectively gaining 8244 points.

Combined the new players lost 10648 points.  Most of these players played a few games at RU~=120 against opponents with RU~=30 and then left, so they added ~2662 points to the pool of active players.  There are 42 players with RU<=50, and 70 players with RU<=80, so in the last year, we added between ~38 and ~63 points per active player.

Only 2 new players are rated >=2000, blue22 and the inactive Arimanator.  Only 1 other new human is rated >=1700, the infrequently active OLTI.  Only 8 other new humans are rated >=1600; 5 of whom have been seen within 31 days, and 2 of whom have RU<=80.  I can only hope this doesn’t mean there is a drop off in the rate of potential 2000+ players joining.

Title: Re: Arimaa rating deflation
Post by Fritzlein on Jan 31st, 2006, 10:23am
Nice analysis as usual, Ryan.  You inspired me to make another graph.  I decided to base this one on game number rather than on date.  My definition of "Average Rating" of the Arimaa population was to take the last rating of each of the last 100 players to have played a rated game.  I included both bots and humans.  At present these 100 players include everyone who has played a rated game within approximately the past two months, but earlier (when the playing pool was smaller) it probably covered a longer time span.

http://www.math.umn.edu/~juhn0008/RatingInflation2.png

The point I wanted to make was that newbies are a source of rating deflation, at least temporarily.  When someone joins and drops 240 points to Arimaalon who gains 60 points, the system loses a net 180 points.  It isn't until that newbie stops playing and drops off the end of the list of active players that we can say the system has gained a net 60 points.

Thus we see clearly at the beginning, when the ratings still included all active players, there was a clear deflationary trend.  It wasn't until people started cycling out of the playing pool that inflation kicked in, inflation based on the fact that most players who become inactive have ratings below 1500.

Despite approximately 70 points of rating inflation, I'm guessing a player rated 2000 today is 100 to 200 points stronger than a player rated 2000 two years ago.  I'm curious how it would affect the rating inflation trend if the average active player weren't getting significantly stronger.  Perhaps if the deflationary influence of everyone learning together wasn't so strong, then the inflation would really take off.  Or maybe the inflation isn't so much affect by that.  I dunno.

Title: Re: Arimaa rating deflation
Post by Fritzlein on Apr 3rd, 2006, 3:40pm
We all know how lots of noobs join the server, lose a few games to low bots, and never come back.  They tend to donate 30 to 100 rating points each.  Those points stay among permanent players and contribute to rating inflation.

However, I wanted to study the effect of someone like Swynndla, who joins the server and loses a ton of points, but later wins them back with interest.  Between server games 25572 and 26260, Swynndla reduced his RU from 120 to 30 by playing 71 rated games.  He started at a rating of 1500 and ended at 1565, but that doesn't mean he robbed a net of 65 points from his opponents, because a variable RU means his rating was changing at a different speed than the ratings of his opopnents.  In particular, on his way down to a rating of 1024, his rating was moving faster than when he clawed his way back up.

I posted ealier my conviction that folks like Swynndla have a deflationary effect beyond the traditional one of leaving at a higher rating than they join.  (Heaven forbid Swynndla should leave soon, but when he does, he'll have a higher rating than 1500 where he started, so net points will leave the system with him.)  When I actually ran the numbers on these 71 games, it showed that my theory doens't work.  In fact, Swynndla actually only robbed 63 points from his opponents, i.e. less than the 65 he gained!

Where did my theory go wrong?  Hmmm...

Actually, it partially worked.  After 31 games, the point at which Swynndla first resurfaced above a 1500 rating, he had robbed his opponents of 66 points while only going up to 1512.  That's a net deflation effect of 54 points.  However, at that point his RU was still high at 70, and in subesquent games Swynndla peaked at rating of 1669 before falling back to 1565.  He reversed the effect I predicted by gaining points at higher RU and losing back points at lower RU for a net inflation.

Furthermore, anyone who gains rating points above 1500 while they have a high RU, even if they only hold steady thereafter until their RU reaches 30, will gain more points in their own rating than they robbed from others to get there.

I guess I need to revise my deflation theory, and say instead that noobs almost universally contribute to inflation.  Either they quickly cash out at a lower rating than 1500, which donates points to the system, or they stick around and start rising above 1500 before their RU bottoms out at 30, which means their own rating represents extra points in the system.  The (much rarer) deflationary cases are noobs who make it down to RU 30 before pumping their rating back over 1500.  Also there is traditional deflation from strong, established players leaving and taking their high rating along, but that doesn't happen much, because Arimaa is just too addictive.  I note that our two highest-rated depatures, mouse and Arimanator, have each recently come back.  :-)  That leaves only OmarFast and bleitner as high-rated inactive players.

To add even more fuel to the inflationary fire, the continual bot vs. bot games are taking inflationary points away from the bottom few bots and spreading them much more quickly across the entire system.  This drives those bottom bots back down in rating, so that they can rob noobs of points all the more quickly.  Throw in the fact that we had a flood of noobs in February and early March, and I'd say we're probably presently experiencing rating inflation like never before.

Title: Re: Arimaa rating deflation
Post by frostlad on Apr 3rd, 2006, 11:21pm
What does a player like filerank do to the point pool? just curious.

Title: Re: Arimaa rating deflation
Post by Swynndla on Apr 4th, 2006, 6:22am

on 04/03/06 at 15:40:29, Fritzlein wrote:
Heaven forbid Swynndla should leave soon
...
...
but that doesn't happen much, because Arimaa is just too addictive


Indeed, I'm addicted and therefore I won't be leaving :D

Title: Re: Arimaa rating deflation
Post by Fritzlein on Apr 4th, 2006, 1:33pm

on 04/03/06 at 23:21:03, frostlad wrote:
What does a player like filerank do to the point pool? just curious.

While filerank was winning his first 42 rated games in a row, he boosted his rating 424 points to 1924 while robbing 207 points from his opponents, for a net inflation effect of 217 points.  His RU was still 60 at that point.  Of course, if he had quit and gone home at that point it would have been a 207 point contributor to deflation, but he didn't quit.

Over filerank's first 166 rated games total, he gained 276 net rating points to land at 1776.  His opponents in that span gained a net 3 rating points.  Therefore, even if filerank now takes his high rating and goes home, he has still contributed three points to rating inflation!  If he stays around, however, he has contributed 279 points, mostly within his own rating.

As long as I'm at it, Frostlad, I might as well do your record  too.  You are a bit different from either Swynndla or filerank, because you had neither a huge drop nor a huge rise at the beginning.  Instead you've sort of worked your way up slowly and steadily.  In your first 121 rated games you gained 123 rating points while robbing 162 from your opponents.  So unlike the other two, you've actually contributed to deflation to the tune of 39 points.  Strange, eh?

I note that you have a better rate of gain (one point per game) than I do.  I've played 798 games and only gained 778 rating points.  ;-)

Title: Re: Arimaa rating deflation
Post by Fritzlein on Apr 4th, 2006, 2:15pm
Still in the same vein, I have 2997 rated HvB games in my database for 2006.  (I am suspicious of the "rated" flag's accuracy, as I posted in another thread, but for now pretend it is correct.)  In those 2997 games, the bots collectively gained 1448 points while the humans collectively lost 9687 points.  It fits with the bigger picture.

[EDIT] I withdraw my innuendos against the rated flag.  It was an error in one of my queries (not the one in this post) that made me unjustly suspicious.

Title: Re: Arimaa rating deflation
Post by frostlad on Apr 5th, 2006, 10:30am
Do you have a program or what is the formula that you use to calculate the net gained points vs how many I took fritz?

Title: Re: Arimaa rating deflation
Post by Fritzlein on Apr 5th, 2006, 1:04pm
(1) I download the zipped game files from at text from http://arimaa.com/arimaa/download/gameData/

(2) I import the data into Microsoft Access

(3) I query all rated games of one player, to output columns PlayerRating, OppRating, PlayerRU, OppRU, PlayerScore.

(4) I cut and paste the query results into an Microsoft Excel spreadsheet with formulas to calculate, for each game, how many rating points Player and Opp gained and lost.  I also have columns for cumulative loss/gain.

If I were a programmer, I would automate this process somehow, but given my limited abilities I just kludge along the best I can.  :-)

Title: Re: Arimaa rating deflation
Post by omar on Apr 5th, 2006, 10:50pm

on 04/05/06 at 13:04:17, Fritzlein wrote:
If I were a programmer, I would automate this process somehow, but given my limited abilities I just kludge along the best I can.  :-)


In my book, mathematician are a superset of programmers (it's just that they don't know it). :-)

Title: Re: Arimaa rating deflation
Post by Fritzlein on Aug 18th, 2006, 3:22pm

on 09/17/04 at 10:04:12, 99of9 wrote:
Well now I've written a program to do some simulated annealing to determine the best fit ratings.  Here's the output for that same dataset.

# Rating:     0.0 M
# Rating:   -23.6 S
# Rating:   311.3 S+K-K
# Rating:   362.9 S+I
# Rating:   491.4 S+I-I
# Rating:   806.9 M+I-I
# Rating:   913.0 S+F-F
# Rating:   986.1 S+S-S
# Rating:  1007.1 M+F-F
# Rating:  1073.9 M+S-S

So I wasn't that far off by hand...


I just re-read this post from two years ago, and realized two things:

1) I doubt simulated annealing is necessary for the estimation of ratings, because I suspect there are no local minima other than the global minimum.

2) M+S-S is another name for ArimaaScoreP1.  This is quite amusing, because after all the fuss about anchoring the system with a random mover rated zero, the un-anchored system is giving ArimaaScoreP1 a rating of 1178, i.e. roughly the rating of 1074 that 99of9 calculated it would have on an anchored scale!

One caveat is that ArimaaScoreP1 plays almost exclusively newcomers who haven't yet beaten ArimaaScoreP1, and if those opponents are on average overrated by about X points, then ArimaaScoreP1 is probably also overrated by about X points.  I would guess that if the scale were not distorted by having all newcomers play the same bot first, then ArimaaScoreP1 would be rated below 800 on the current scale.  In other words, our whole system would probably need to nudge up three hundred points to be consistent with a random player having a zero rating.  It only looks like a match at the moment because ArimaaScoreP1 is overrated.

Even so, it is hilarious that the arbitrary choice of having newcomers enter at 1500 has anchored the ratings not far from the most intuitive non-arbitrary point of random=0.  All that Omar has to do now to claim that our system is anchored, is to fix ArimaaScoreP1 to a rating of 1074!

(Of course, I still feel it is bad to use bots to anchor a rating system for reasons outlined earlier, but if anchoring the system merely means fixing ArimaaScoreP1 to a rating of 1074, it will make essentially no practical difference from the current system.)

Title: Re: Arimaa rating deflation
Post by Fritzlein on Aug 19th, 2006, 10:38am
I tried a new method of measuring rating inflation.  I divided games into blocks of 1000, and then averaged the ratings of both players of all games played in each block.  Thus if a high-rated player happened to play a bunch of games, that would pull the average up, and if a low-rated player played a bunch of games, that would pull the average down.  Bots and humans both included.

http://www.math.umn.edu/~juhn0008/RatingInflation3.png

As you can see from the graph above, that method is too volatile to get a good reading on inflation.  So I revisited an earlier method, namely taking the average over players rather than over games.  In the graph below, I averaged over the 100 most recent players to play a rated game.  Bots and humans both included.
 
http://www.math.umn.edu/~juhn0008/RatingInflation2.png

I attribute the dip in average rating shortly after 30,000 to an influx of new players at that time, and the beginning of the bot ladder, which made some of the previously-inactive low-rated bots into active players.  That raises the question why the surge of new players from MetaFilter, around game 18500, didn't have a similar impact, but at that time the first bot for newcomers was ShallowBlue, playing unrated.  Thus most of the MetaFilter newcomers never played any rated games.

My guess is that every surge of newcomers provides only a temporary drag on the average rating of active players, because if they leave soon their low rating drops out of the average while their donated points stay, but if they stay for a while, they usually get a higher rating.

From both of these graphs, we can see that the average player rating has risen from about 1500 to about 1600 over the lifetime of the server.

Title: Re: Arimaa rating deflation
Post by OLTI on Feb 9th, 2007, 3:24pm
Check out the "p8 rating" on Top Rated Players

Title: Re: Arimaa rating deflation
Post by Fritzlein on Feb 9th, 2007, 4:39pm
I didn't post an updated graph, but I recently ran a query, and the average rating of the last 100 players to play a game has hovered just over 1600 for the last 10,000 games.  I thought perhaps inflation was rampant, i.e. the trend might keep going up and up and up, but apparently it has leveled off a bit.  For now the "average" player is rated somewhere around 1620.

By another measure, we could look at only the established players page, where you have to have RU 60 or less to be listed.  The 61 established humans have an average rating of 1707, while the 48 established bots have an average rating of 1598.  Combined, the average established player is rated 1659.  This excludes newcomers, so it is a bit higher than my other figure of 1620.

Title: Re: Arimaa rating deflation
Post by omar on Jan 24th, 2008, 7:28pm
Karl mentioned to me that he and some other players have been noticing continued inflation in the rating system. For a long time now we've discussed anchoring the rating system so that it does not drift. Additionally we wanted to anchor it so that the random bot has a rating of zero. Karl has mentioned earlier in this thread (Aug 18th, 2006) that perhaps the easiest way to do it would be to fix the rating of ArimaaScoreP1 to 1074 based on the work that Claude and Toby did to determine the rating of such a bot relative to the random bot. I guess my biggest hold back has been that I wanted to experiment and learn more about anchoring a rating system before taking the leap. Also I wanted to do it when we switched to a new rating system that was more resistant to abuse. But since Im never able to get around to working on that, I think I'll just do the easy thing now to reduce the drift.

I'll start by fixing the rating of ArimaaScoreP1 to 1074. After that we can perhaps also fix the rating of ArimaaScoreP2 if we are able to calculate its theoretical rating relative to P1.

Another thing we can do easily is change the value of the initial ratings that players start out at. Karl did some analysis and posted in another thread that the probability of a new player winning against ArimaaScoreP1 in their first and second game is 66.7% and 71.7%. Based on this we can probably guess a better value for a new player relative to what ArimaaScoreP1's fixed rating will be.

http://arimaa.com/arimaa/forum/cgi/YaBB.cgi?board=talk;action=display;num=1169996698


Title: Re: Arimaa rating deflation
Post by clauchau on Jan 25th, 2008, 7:09am
Is ArimaaScoreP1 maximizing (the player's Arimaa score minus the opponent's Arimaa score) ? It would then not exactly be the same as the bot M+S-S used in Toby's and mine calcutations, which amounted to maximizing (the player's Arimaa score*1000000 minus the opponent's Arimaa score.

Title: Re: Arimaa rating deflation
Post by omar on Jan 25th, 2008, 7:48am
Yes, it computes the total score as my score minus opponents score.

http://arimaa.com/arimaa/bots/bot_ArimaaScore/src/eval.c

Title: Re: Arimaa rating deflation
Post by omar on Jan 25th, 2008, 5:54pm
I didn't quite get your scoring formula at first, but Karl pointed out that it basically amounts to only maximizing the players score and using the opponents score only to break ties.

In that case there is a difference and we can't assume that the rating of ArimaaScoreP1 is 1074. Since you are computing both players score anyways, would it be easy enough to modify the program to include a bot that uses the same eval as ArimaaScoreP1. I haven't been able to get the program to compile on linux.

Here is the link to the original program:
http://arimaa.com/arimaa/download/randomBot/claude/r.cpp

If someone could get this to compile with gcc that would be great.


Title: Re: Arimaa rating deflation
Post by clauchau on Jan 28th, 2008, 4:29am

on 01/25/08 at 17:54:25, omar wrote:
http://arimaa.com/arimaa/download/randomBot/claude/r.cpp


Thanks for the link, I had lost it all. I was able to get it compiled with gcc by adding :

Code:
#include <iostream>

and explicitly casting back negated enums = ints to enums :

Code:
const Direction Right= (Direction)-Left;


I've coded ArimaaScoreP1 and ran 1000 games against M+S-S. It only takes 4 minutes here, so I'll run more later when I've fixed something about too long games.

The way it is now, ArimaaScoreP1 won 250 games, M+S-S won 723 games, and 27 games were counted as draws because they lasted more than 512 plies. I have to fix these draws and use the Arimaa score to count more winning games, don't I?

Also I wonder what ArimaaScoreP1 does about repetitions. Over my 1000 games, 49 ended because of a position repeated a 3rd time, because none of the bots in my experiment care about repetition. Should I leave my ArimaaScoreP1 like that, not caring ?

Title: Re: Arimaa rating deflation
Post by omar on Jan 28th, 2008, 9:07am
Thanks Claude. Looks like ArimaaScore is checking for repition.

http://arimaa.com/arimaa/bots/bot_ArimaaScore/src/getMove.c

around line 1009

For games that are getting longer than 512 moves, would it be possible to send or post the move list for one or two games so we can see what is going on and if extending the limit might help.

And here is the setups that it uses:

http://arimaa.com/arimaa/bots/bot_ArimaaScore/src/setups

and here is the README file that describes the program options:

http://arimaa.com/arimaa/bots/bot_ArimaaScore/src/README

This is the way the ArimaaScoreP1 program is being invoked: getMove -d 4 -1 src/setups

Title: Re: Arimaa rating deflation
Post by clauchau on Jan 28th, 2008, 10:19am
Thanks Omar.

Good, I'll check the source and prevents ArimaaScore to repeat according to what I find. It seems bot_ArimaaScore avoids first-time repeatitions indeed but I'll check more thoroughly.

I'll also get its initial setup move right tomorrow, I have it wrong so far.

Before getting it right, I ran 10,000 random games : 316 games reached my 512 plies limit (256 Gold moves and 256 Silver moves). They really were draws and the ArimaaScore didn't matter for any of them because they all consisted of roaming lone Gold and Silver elephants.

Title: Re: Arimaa rating deflation
Post by omar on Jan 29th, 2008, 8:05am

on 01/28/08 at 10:19:08, clauchau wrote:
Before getting it right, I ran 10,000 random games : 316 games reached my 512 plies limit (256 Gold moves and 256 Silver moves). They really were draws and the ArimaaScore didn't matter for any of them because they all consisted of roaming lone Gold and Silver elephants.


Since we have been playing all our tournament games using the extermination rule already and I plan to make that the default for rated games in the gameroom, lets apply that for these games as well. The first player to lose all the rabbits loses the game by extermination. It will eliminate the draws that you are seeing. Draws could still be possible if both players have lost their last rabbit after the completion of the move, but this would be super super rare.

Title: Re: Arimaa rating deflation
Post by jdb on Jan 29th, 2008, 8:59am

on 01/29/08 at 08:05:26, omar wrote:
Draws could still be possible if both players have lost their last rabbit after the completion of the move, but this would be super super rare.


I thought that was a win for the player making the move, or am I mistaken?

Title: Re: Arimaa rating deflation
Post by omar on Jan 29th, 2008, 10:01pm
You are right. It would not be a draw, but rather a win for the player who made the move.

We discussed this rare situation a couple years ago.

http://arimaa.com/arimaa/forum/cgi/YaBB.cgi?board=talk;action=display;num=1134228374;start=4#4



Title: Re: Arimaa rating deflation
Post by clauchau on Jan 30th, 2008, 9:00am
As Don Dailey admitted, bot_ArimaaScore and other parent bots pick up a move among equivalent best moves in a biased random way. You can see the bias demonstrated when shuffling a few cards on "Coding Horror: The Danger of Naivete": http://www.codinghorror.com/blog/archives/001015.html

The more cards, the bigger the bias. It may result in picking up some moves a magnitude more often. The bias is difficult to foresee because it depends on how many equivalent moves there are and the anecdotical order in which the bot initially builds up the move list.

I'm totally happy with it as far as playing goes, and I don't really know whether it makes the bot any pourcent stronger or not, but I'm not comfortable with it when building up reproducible basic knowledge about Arimaa and simple reference bots. So I plan to gather statistics about an idealized bot_ArimaaScoreP1 where that bias is corrected. I'll test whether my own version of a biased bot_ArimmaScoreP1 is any stronger than the unbiased one.

In case it matters, the correction is easy to make in the source code. We have to change Line 927 or so in the function scramble of getMove :

Code:
   int  r = rand() % lc; // swaps the current card with any card in the deck

should become

Code:
   int  r = i + rand() % (lc-i); // swaps the current card with itself or any following card

Title: Re: Arimaa rating deflation
Post by clauchau on Jan 30th, 2008, 10:19am

on 01/29/08 at 08:05:26, omar wrote:
The first player to lose all the rabbits loses the game by extermination.


Cool, this makes a big difference with basic bots and gives more victories to the bots that deserve them more. In my random sampled matches, I don't get any more game drawn by being too long or any game lost by repetition, because the rabbits get extinct before.

Title: Re: Arimaa rating deflation
Post by mistre on Jan 30th, 2008, 10:33am
Omar,

With the new extermination rule, this makes it impossible to have a draw, correct?

Once implemented, you can change the reason code n = no rabbits left; draw  to e = extermination, player has no rabbits left.


Title: Re: Arimaa rating deflation
Post by clauchau on Jan 30th, 2008, 10:44am

on 01/28/08 at 09:07:23, omar wrote:
http://arimaa.com/arimaa/bots/bot_ArimaaScore/src/setups


Oh well, now that I've argued for a simple idealized bot_ArimaaScoreP1, especially because I would have a hard time to reproduce the current bias in scrambling its move list, I also feel like questioning the initial setup the real bot uses on the site. It has nothing to do with maximizing the Arimaa Score. It's using some other human knowledge and common setups of other bots. I'd rather keep having it to maximize the Arimaa Score and put all the rabbits on the second row.

However I'll do my best in fighting my reluctance to study the real bots if I feel it is wanted. What do you think, is it better to anchor the rankings on a real actual bot, with tons of games already played, or on a new simple idealized version that could be made available as well but with an empty playing record in the gameroom, with the advantage that it is more easily described and reproductible?

Title: Re: Arimaa rating deflation
Post by Fritzlein on Jan 30th, 2008, 10:47am
Even with the rabbit extermination rule, there is a possibility the players will shuffle pieces endlessly without accomplishing anything.  Technically the repetition rule will end such games eventually, so they aren't drawn by rule, but in practice the sun might burn out before the game ends.  The practical solution for live games, at least at present, is to have a time cutoff after which the ArimaaScore formula determines the winner.  But what if the scores of the two players are tied?  Omar has foreseen even this infinitesimal possibility, and ruled that Silver wins if the score is tied after time cutoff.

At one time we feared that optimal play would produce indefinite piece shuffling, but now it seems that optimal play will either pull opposing rabbits or advance friendly rabbits voluntarily, so there is little chance of drawn out human games.  Bots are another matter.  They are still dumb enough to potentially get caught playing aimlessly forever.  Thus Claude needs to have some move cutoff built into his testing program.

I think what Claude's latest post is saying is that the bots he is testing now are so dumb and aggressive (not smart and defensive) that some bot will lose all its rabbits before the cutoff comes into play.

Title: Re: Arimaa rating deflation
Post by Fritzlein on Jan 30th, 2008, 11:06am

on 01/30/08 at 10:44:11, clauchau wrote:
What do you think, is it better to anchor the rankings on a real actual bot, with tons of games already played, or on a new simple idealized version that could be made available as well but with an empty playing record in the gameroom, with the advantage that it is more easily described and reproductible?

I like the idea of anchoring on an idealized bot that is easier to describe.  ArimaaScoreP1 plays stronger if you give it a fixed setup with all rabbits back, but so what?  It would also play stronger with a fixed first move of elephant forward four steps.  The objective isn't to have a strong bot, the objective is to give beginners a punching bag so they can familiarize with the game.

The idealized bot is one that chooses the move that maximizes ArimaaScore, and if there is more than one such move, chooses each with equal probability.  For setup this means all rabbits forward and the other pieces randomly shuffled on the back row.  The only reason to make an exception for the setup would be to make the bot play stronger.

However, if we have a choice of two idealized anchors, both equally easy to describe, I would prefer to use the stronger anchor.  Claude, did you say that the bot which played to maximize its own ArimaaScore was stronger than the one which played to maximize its own ArimaaScore minus the opposing ArimaaScore?

I'm not even wedded to the idea that the anchor must be ArimaaScoreP1.  There may be a stronger one-ply bot with a simpler description, in which case I would vote to add that bot to the ladder as the anchor instead, rather than anchoring on an existing bot.  But in the strength vs. simplicity tradeoff, I would definitely not want to use anything searching deeper than one ply.

Title: Re: Arimaa rating deflation
Post by Fritzlein on Jan 30th, 2008, 11:29am
Incidentally, if you wonder why I formerly opposed using anchor bots but now support it, my stance actually has little to do with anchoring per se.  I used to believe that anchor bots would actually contribute to ratings inflation, because people would find an anchor they could beat and beat it a zillion times.  Since they gain but the anchor doesn't lose, that's a net injection of points into the system.

However, other folks have pointed out that we seem to be suffering from inflation in the current system anyway, and I am starting to believe it.  What is the root cause of this inflation?  I think Ryan Cable was the first to point out it is coming from the bottom of the bot ladder.  New users join, play a few games, and quit with a lower rating than they started.  That means they left behind some points in the system.

Now, if the lowest bot in the ladder is anchored rather then floating, a new user who loses to that bot will not leave any points behind.  That bot's rating will not budge.  So the single biggest injection of points has been eliminated at a stroke.

So you see I still don't care about anchoring the system per se.  I just want an anti-inflationary measure.  If we anchored the rating of BombP2, I would hate it, because I would expect it to be inflationary, making our problems worse rather than better.  Anchoring the lowest bot is good only because of the way our ladder works to make that anti-inflationary.

If anchoring the lowest bot happens to have a fringe benefit that we can say our rating scale is somehow related to a random mover having a rating of zero, that's an added bonus, but it isn't my reason for supporting the change. :-)

Title: Re: Arimaa rating deflation
Post by clauchau on Jan 31st, 2008, 7:31am

on 01/30/08 at 11:06:31, Fritzlein wrote:
Claude, did you say that the bot which played to maximize its own ArimaaScore was stronger than the one which played to maximize its own ArimaaScore minus the opposing ArimaaScore?


Almost. More exactly, with every elementary one-sided score I've tried, including ArimaaScore, the bot which played to maximize its own score first and in case of equality minimized the opposing score was stronger than the one which played to maximize its own score minus the opposing score. With (idealized) ArimaaScore, the 1st bot wins about 65% of its games against the 2nd bot.

It seems to show that attacking more than defending is rewarding in Arimaa.

Title: Re: Arimaa rating deflation
Post by Fritzlein on May 21st, 2008, 8:42am
Omar, why not just anchor ArimaaScoreP1 to a rating of 1000?  That will counter the immediate problem of a gradual upward drift of ratings.  (I'm coming to believe that we are experiencing rating increases that are greater than our collective improvement in skill.)  Our system would then be dual-anchored by starting beginners are 1500 and fixing ArimaaScoreP1 to 1000.

The theoretical problem of how ArimaaScoreP1 should be rated relative to a random mover is less important, partly because accuracy is meaningless.  Because the ratings are not transitive, there is no absolute scale, so an approximate answer is good enough.  It doesn't make sense to wait to anchor until we have a accurate answer to an unanswerable question.  We should rather take steps to fix an observed, practical problem.

Anchoring one bot should have an impact in the right direction.  Is it enough?  We could wait and see what the effect is.  I tend to believe that anchoring the whole continuum of fixed performance bots would be a medicine worse than the disease.  I expect the best practical outcome from anchoring some small number of bots at the bottom of the ladder.  But we won't figure out what is best by theorizing.  Let's start trying stuff!

Title: Re: Arimaa rating deflation
Post by omar on Jun 3rd, 2008, 5:18am
I've been meaning to do this for some time now.

http://arimaa.com/arimaa/forum/cgi/YaBB.cgi?board=talk;action=display;num=1065901453;start=117#117

I think I am just waiting to decide what to do about the initial rating of new players. If we want to change that as well then I would like to make both changes at once.

Karl posted the records of new players against the bots they encounter first.
http://arimaa.com/arimaa/forum/cgi/YaBB.cgi?board=talk;action=display;num=1169996698

So based on this if we set ArimaaScoreP1 to 1000 then how about setting the initial rating of new players to 1100.

Title: Re: Arimaa rating deflation
Post by Adanac on Jun 3rd, 2008, 7:19am

on 06/03/08 at 05:18:11, omar wrote:
I've been meaning to do this for some time now.

http://arimaa.com/arimaa/forum/cgi/YaBB.cgi?board=talk;action=display;num=1065901453;start=117#117

I think I am just waiting to decide what to do about the initial rating of new players. If we want to change that as well then I would like to make both changes at once.

Karl posted the records of new players against the bots they encounter first.
http://arimaa.com/arimaa/forum/cgi/YaBB.cgi?board=talk;action=display;num=1169996698

So based on this if we set ArimaaScoreP1 to 1000 then how about setting the initial rating of new players to 1100.


Omar, thanks for addressing this issue - this should put a lid on the inflation problem.  I've always liked the idea of decreasing the initial rating for new players.

Title: Re: Arimaa rating deflation
Post by Fritzlein on Jun 8th, 2008, 9:20am

on 06/03/08 at 05:18:11, omar wrote:
I think I am just waiting to decide what to do about the initial rating of new players. If we want to change that as well then I would like to make both changes at once.

Making both changes at once sounds fine.  It may be an over-correction that brings about deflation, but it's better to try something than not try anything.


Quote:
Karl posted the records of new players against the bots they encounter first.
http://arimaa.com/arimaa/forum/cgi/YaBB.cgi?board=talk;action=display;num=1169996698

So based on this if we set ArimaaScoreP1 to 1000 then how about setting the initial rating of new players to 1100.

Those stats show that newcomers win about 2/3 against ArimaaScoreP1 in their very first game, which would suggest that they are 200 rating points better than ArimaaScoreP1, even when they have no experience.  Therefore it would make more sense to fix ArimaaScoreP1 to 1000 and set initial rating of new players to 1200.  Also, however, the stats show that players improve approximately 100 points between their first game and second game.  Therefore we should either give a 100-point bonus for completing a game, or we should set the initial rating to 1300, which amounts to basically the same thing given that the first bot everyone plays will have a fixed rating, so we aren't worried about folks who only play once injecting points into the system.

Title: Re: Arimaa rating deflation
Post by omar on Jun 11th, 2008, 2:20pm
Sounds good I will fix the rating of ArimaaScoreP1 to 1000 and start new players at 1300.

Karl, can you make a graph of the rating history of ArimaaScoreP1 and maybe some of the other fixed performance bots. This will show clearly that there is inflation. I was looking at the numbers on the screen and there definitely is a slow and gradual rise in the rating of the fixed performance bots. Having some graphs to easily see this and compare it in the future would be nice.

Title: Re: Arimaa rating deflation
Post by Fritzlein on Jun 11th, 2008, 7:00pm
I guess I should have posted the stats here instead of in the WHR thread, but I can at least link it:
http://arimaa.com/arimaa/forum/cgi/YaBB.cgi?board=talk;action=display;num=1207699394;start=15#15


on 06/11/08 at 14:20:26, omar wrote:
Sounds good I will fix the rating of ArimaaScoreP1 to 1000 and start new players at 1300.

Thank you!  I will watch the results with interest.  I expect this will not only stabilize ratings, but noticeably deflate them from current levels, which isn't all bad if it is correcting for past inflation.

Title: Re: Arimaa rating deflation
Post by aaaa on Jun 15th, 2008, 2:55pm
Wouldn't it make more sense to anchor the aggregate of the ratings of all the bots on the ladder?

Title: Re: Arimaa rating deflation
Post by Fritzlein on Jun 15th, 2008, 4:49pm

on 06/15/08 at 14:55:35, aaaa wrote:
Wouldn't it make more sense to anchor the aggregate of the ratings of all the bots on the ladder?

What does it mean to anchor the aggregate?  Do you mean that the ratings of the bots can move relative to each other, but the average rating is held constant?  For example, if a newcomer enters and gives ArimaaScoreP1 30 points, then the other 15 bots we are anchoring on as an aggregate each lose 2 points to compensate?

That's an interesting idea to prevent points from entering or leaving the anchor pool.  The anchor bots could still get out of synch with the non-anchor players, if most of the games didn't involve an anchor player, but the larger anchor would be better on that score than an anchor of one (ArimaaScoreP1).  The disadvantages I see right off would be:
1. Slight implementation hassle: ratings couldn't be stored as integers any more or rounding error would defeat the mechanism.
2. Initialization: How would one decide on the total rating points of the anchor pool?
3. Helping bot-bashers: Someone sucking points from a particular bot reaches a limit until someone else gives that bot points, but now points could be given to the bashed bot indirectly.

My own attitude is that I have long since given up on having accurate ratings in any system that includes (A) bots and (B) self-selection of opponents.  It's impossible.  However, if we are forgetting accuracy and just thinking about what the best inflation anchor would be, I like an anchor pool with sloshing around ratings better than I like anchoring the rating of each fixed-performance bot.  Whether I like an anchor pool better than tinkering with the ratings of newcomers depends on how well the latter works.  If we can get a non-inflationary and non-deflationary system by just changing one number, then it's a much easier way to address the whole issue.

Title: Re: Arimaa rating deflation
Post by Fritzlein on Jul 20th, 2008, 8:44am

on 06/11/08 at 14:20:26, omar wrote:
Sounds good I will fix the rating of ArimaaScoreP1 to 1000 and start new players at 1300.

It has been almost six weeks since the anti-inflation changes went into effect, so I thought I would check in on what effect they were having.  I took all the players who had accounts prior to the new rules, and calculated the net change in their ratings due to playing newcomers and ArimaaScoreP1.  As a group the old-timers (bots and humans both) lost 758 rating points.  So if deflation is what we wanted, deflation is what we got.  Presumably the system will continue to hemorrhage points until it finds a new equilibrium at a much lower point.

However, if we scratch beneath the surface a bit, that deflationary number seems less like a trend and more like a quirk.  ArifSyed purposely deflated his own rating by 455 points over 16 losses to ArimaaScoreP1.  We worried a bit that fixing the rating of a bot would allow people to arbitrarily inflate their ratings by playing it over and over, but we didn't consider that it also makes the perfect sinkhole for sandbaggers.  In the same time period, ArifSyed has increased his rating by 122 points by beating newcomers, many more points than he would have taken if his rating had been accurate.

In a sense, ArifSyed is acting like a deflation pump himself, by taking rating points from newcomers and making them disappear into thin air.  333 points of the net deflation was his direct doing.

Still, that leaves 425 points of deflation that is a "natural" result of having ArimaaScoreP1 fixed at 1000, and having newcomers enter at 1300.  It's still too early to tell, but I think we may have over-corrected for the inflation we were previously experiencing.

Title: Re: Arimaa rating deflation
Post by aaaa on Jul 20th, 2008, 9:09am
I still say that there should be two ratings: one for human-bot games and one for homogeneous (bot-bot/human-human) ones.

Title: Re: Arimaa rating deflation
Post by mistre on Jul 20th, 2008, 9:49am

on 07/20/08 at 08:44:29, Fritzlein wrote:
In a sense, ArifSyed is acting a deflation pump himself, by taking rating points from newcomers and making them disappear into thin air.  333 points of the net deflation was his direct doing.


Hopefully he is only making the rating points disappear into thin air and not the newcomers themselves.  ;D

Title: Re: Arimaa rating deflation
Post by Fritzlein on Jul 20th, 2008, 10:52am

on 07/20/08 at 09:49:07, mistre wrote:
Hopefully he is only making the rating points disappear into thin air and not the newcomers themselves.  ;D

Heh.  I hope my editor catches all the unclear antecedents and suchlike before my book hits the presses.  :P

Title: Re: Arimaa rating deflation
Post by Arimabuff on Jul 20th, 2008, 12:02pm

on 07/20/08 at 10:52:41, Fritzlein wrote:
Heh.  I hope my editor catches all the unclear antecedents and suchlike before my book hits the presses.  :P

Don't worry Honoré de Balzac one of our geniuses of literature was known for his many bloopers.

He once wrote: "Oh oh, he said in Portuguese."


How do you say "oh, oh" in Portuguese?  ;)

Title: Re: Arimaa rating deflation
Post by aaaa on Aug 4th, 2008, 8:29pm
Interesting how David Fotland's suggestion concerning the ratings ended up being followed more than four and a half years later.

Title: Re: Arimaa rating deflation
Post by Fritzlein on Aug 5th, 2008, 8:50am
Wow, that is ironic.  Years later we implemented what he suggested in the first post.  But the weirdest thing is that Fotland wanted to lower the ratings of new players because of deflation, and we finally decided to do it because of inflation.  Go figure.

Title: Re: Arimaa rating deflation
Post by Fritzlein on Aug 3rd, 2009, 7:40pm
Well, it turns out that our efforts to combat rating inflation have worked very well, probably too well, to the point that we will now overshoot into severe rating deflation unless we take corrective action.

To detect whether rating inflation/deflation is occurring, I calculated the average rating of each bot that Omar runs (no developer bots) for each calendar year, considering a bot only if it played at least thirty rated games for the year.  A partial table of the results is below:

Bot \ Year  .     .  2003  2004  2005  2006  2007  2008  2009
-------------------  ----  ----  ----  ----  ----  ----  ----
bot_Bomb2005Blitz .     .     .  1876  1856  1931  2038  2012
bot_Bomb2005CC    .     .     .  1774  1858  1916  1903  1876
bot_Bomb2005Fast  .     .     .  1827  1826  1930  1877  1901
bot_GnoBot2005Blitz     .     .  1652  1747  1841  1857  1728
bot_GnoBot2005Fast.     .     .  1541  1724  1734  1734  1664
bot_Arimaazilla   .  1516  1419  1449  1451  1502  1505  1419
bot_Bomb2005P1    .     .     .  1488  1632  1715  1649  1517
bot_Bomb2005P2    .     .     .  1752  1806  1887  1864  1824
bot_GnoBot2005P1  .     .     .  1382  1262  1392  1311  1244
bot_GnoBot2005P2  .     .     .  1552  1608  1651  1636  1545


Taking all the bots Omar runs into consideration, not just the above bots, and dividing it between fixed-performance bots and variable performance bots, I get the following average year-over-year rating changes:

fixed-performance
Year    Change
----    ------
2005-6  + 5
2006-7  +59
2007-8  -27
2008-9  -46


variable-performance
Year    Change
----    ------
2005-6  +81
2006-7  +50
2007-8  +32
2008-9  -21


Now, it is no problem if variable-performance bots have gained an average of 142 rating points in the past four years.  That sounds perfectly consistent with increased strength based purely on better hardware.  In fact, an increase of about 36 points per year is consistent with other estimates of the value of faster hardware.

Also, it is no problem that fixed performance bots are now rated nine points lower, on average, than they were in 2005.  We want the ratings of fixed-performance bots to remain basically constant.  We inflated throughout 2006, 2007, and into 2008, but that was wiped out by deflation  in the latter part of 2008 and the first half of 2009.  We are back to normal, in a manner of speaking.

The difficulty is that we are still rapidly deflating.  The changes we made (anchoring ArimaaScoreP1's rating to 1000 and dropping newcomers to 1300) have not yet run their course.  We are not at equilibrium, and unless we make changes now, I predict we will far overshoot on the deflationary side.

Since ratings are near a historically reasonable level now, I recommend that we immediately increase the ratings of newcomers to 1400.  Probably even that will leave us with some deflation, but maybe not, and it seems reasonable to try.  We can check in again at the end of the year.

The alternative, I believe, is to wait until we are sure that the system has overly deflated, and then have to take corrective action to pump rating points back into it.  That's silly.  Rather than having swings up and down, I'd prefer to have some kind of stabilization, so that a 2000 rating in any year means about the same thing as a 2000 rating in any other year.

Just my $0.02

Title: Re: Arimaa rating deflation
Post by Arimabuff on Aug 4th, 2009, 4:41am

on 08/03/09 at 19:40:49, Fritzlein wrote:
...Just my $0.02

Not counting inflation.  ;D

Title: Re: Arimaa rating deflation
Post by mistre on Aug 4th, 2009, 9:05am

on 08/03/09 at 19:40:49, Fritzlein wrote:
Since ratings are near a historically reasonable level now, I recommend that we immediately increase the ratings of newcomers to 1400.  Probably even that will leave us with some deflation, but maybe not, and it seems reasonable to try.  We can check in again at the end of the year.


I agree with Karl.  I have noticed the deflation and if allowed to continue it will only increase.  Starting newcomers at 1400 seems sensible.


Title: Re: Arimaa rating deflation
Post by Fritzlein on Aug 4th, 2009, 9:39am

on 08/04/09 at 04:41:23, Arimabuff wrote:
Not counting inflation.  ;D

Hehe, since yesterday it has become my $0.01999

Title: Re: Arimaa rating deflation
Post by omar on Aug 6th, 2009, 1:47pm

on 08/03/09 at 19:40:49, Fritzlein wrote:
Well, it turns out that our efforts to combat rating inflation have worked very well, probably too well, to the point that we will now overshoot into severe rating deflation unless we take corrective action.

No wonder my ratings have been going down :-)

OK I'll change the initial ratings of new players to 1400. Is there any way to know if that new value will be right or will we have to change it again?


Title: Re: Arimaa rating deflation
Post by Fritzlein on Aug 6th, 2009, 4:09pm

on 08/06/09 at 13:47:50, omar wrote:
OK I'll change the initial ratings of new players to 1400. Is there any way to know if that new value will be right or will we have to change it again?

I think we might have to change it again.  The problem is that when we decided to combat rating inflation, we instituted two deflationary measures at the same time.  Lowering newcomer ratings from 1500 to 1300 was deflationary for obvious reasons, but the second change of fixing ArimaaScoreP1's rating to 1000 was also deflationary for less obvious reasons.

Lots of new players lose their first game because they are unclear on the concept.  It used to be that whenever a new player came in and lost a game, he gave points to ArimaaScoreP1 that stayed in the system, but now those lost points disappear into thin air because ArimaaScoreP1's rating is fixed.  Yes, some people also gain points from thin air by beating ArimaaScoreP1, but since they gain 18 for winning and lose 102 for losing, the net effect is negative.

Another way of looking at it is that we have put ArimaaScoreP1 outside of the system by fixing its rating.  People don't actually "enter the system" until after they have beaten ArimaaScoreP1.  Because of all the losses to ArimaaScoreP1, the average rating of people entering the system is actually even lower than 1300.

Our two changes at once were obviously an over-correction, but I'm not sure what the ideal middle ground is.  Even with the change of increasing the starting rating up to 1400, we will still have inflationary and deflationary forces competing to make a balance.  Inflation will still be caused by newcomers losing a few games and leaving with a rating lower than the rating with which they entered the system.  Deflation will still be caused by people entering the system and working up the ladder until they have a higher rating than they had when they entered the system.

Which will weigh heavier, lots of small sources of points, or a few large drains of rating points?  I don't know, and it depends on user behavior.  Even if we modeled all the past data to determine the starting rating which gave a perfect balance, user behavior might change.  For example, when the boxed set comes out, we might get a higher ratio of small points-contributors who soon leave, or maybe we will get a higher ratio of dedicated players who hang around and deflate the system.  Even if we are perfectly calibrated now, the balance could shift in the future.

I suggest we re-evaluate in another year to see whether the ratings of fixed-performance bots have leveled off, or are still declining, or have bounced back up.  If the ratings have basically leveled, then we can stand pat.  My hunch, however, is that the ratings will still be declining.  If that is true, we may want to pop the newcomer ratings up to 1450, or even all the way to 1500.  In other words, it may have been that fixing ArimaaScoreP1's rating to 1000 was all the anti-inflationary medicine we needed, and lowering newcomer ratings in addition was pure overkill.

If you aren't satisfied with approximate stabilization adjustments every year or two, you could take my measure of inflation, make the automatic measurement once a day, and accordingly adjust the ratings of newcomers for the following day.  I would recommend against it, though, because the daily measure of inflation could fluctuate wildly, causing our countermeasures to similarly fluctuate wildly and constantly over-correct even if the balance is approximately correct.

Title: Re: Arimaa rating deflation
Post by omar on Aug 7th, 2009, 7:27am
Thanks for that explanation Karl. I changed the new player initial ratings to 1400.

Title: Re: Arimaa rating deflation
Post by mistre on Aug 8th, 2009, 7:00am
I don't want to reopen a closed discussion, but how was it decided that ArimaaScorep1 should be fixed to 1000?  Why not 1100?  Was there a methodology to this number or was it arbitrarily decided like the initial start ratings of new members?

Title: Re: Arimaa rating deflation
Post by Fritzlein on Aug 8th, 2009, 8:59am
The rating of 1000 for ArimaaScoreP1 was somewhat, but not entirely arbitrary.  The logic for choosing the fixed point was that if we are going to try to keep ratings stable, why keep them stable around an arbitrary point?  Why not instead keep them stable relative to point with absolute meaning?

The most natural, meaningful fixed point seemed to be for a random mover to have a rating of zero.  It makes sense that an entity with complete knowledge of the rules, but trying neither to win nor to lose, would have a rating that is neither positive nor negative.  A negative rating would then correspond to actively trying to lose, and positive rating would correspond to actively trying to win.

The difficulty in anchoring the rating system on a random mover is that its skill level is so far from our own that we have difficulty estimating how far away it is.  The rating model is roughly accurate for closely matched opponents, but the greater the gap in skill, the worse the approximation becomes.  Some intuitive guesses were around a 2000 to 3000 point gap between a random mover and ShallowBlue.  It turns out, however, that random play is not so horrible.  Choosing a move at random is likely to advance a rabbit, which is a useful thing to do.

In order to estimate how bad a random mover is, clauchau created a ladder of bots with well-defined bits of knowledge, e.g. try to capture a piece, try to advance rabbits, etc.  Clauchau and 99of9 let this bot ladder play against itself, and the results are in page 6 of this thread.  The top of the ladder was the bot clauchau called M+S-S, which earned a rating of 1074 relative to the random mover having a rating of zero, according to 99of9's calculations.

'M' stands for generating all possible moves and selecting the best.  '+S' stands for maximizing your own score according to formerly-official Arimaa score function.  '-S' stands for minimizing your opponent's score.  At first I interpreted this to mean that 'M+S-S' was another name for ArimaaScoreP1, and therefore that relative to the fixed point of random mover having a zero rating, ArimaaScoreP1 should have a rating of 1074.

It turned out, however, that M+S-S is actually stronger than ArimaaScoreP1, because ArimaaScoreP1 is equally concerned with maximizing its own score and minimizing the opponent's score, whereas M+S-S first maximizes its own score, and only minimizes the opponents score as an afterthought to break ties between the set of moves which maximize the mover's score.  M+S-S beat ArimaaScoreP1 about 65% of the time according to clauchau at the bottom page 9 of this thread.  According to the rating formula, that would mean that ArimaaScoreP1 is 108 points worse than M+S-S, i.e. approximate a rating of 966 relative to a random mover having a rating of zero.

So, according to the best information we have, and insofar as the decision is not completely arbitrary, ArimaaScoreP1 should be fixed at 966 rather than at 1000.  However, this round-off error is overwhelmed by so many other considerations that it is totally insignificant.

First (and least important), the playouts that set the scale were random.  If we ran the experiment again, we would get a different value for the rating of M+S-S.

Second, and critically important, we could get any answer we wanted by choosing a different ladder of bots between random mover and M+S-S.  Arimaa ratings are not transitive.  They are only meaningful against the exact pool of players you have competed against.  No matter how accurately you measure the relative playing strengths in a given pool of players, say with millions of plays, those relative ratings would change every time a player is added to or subtracted from that pool.  I am convinced that if we wanted to skew the results, we could devise a different ladder to prove that M+S-S should have a rating over 2000.

Third, and most important of all, even if we managed to anchor ArimaaScoreP1's rating at the "perfect" distance above the rating of a random mover, that would not insure that anyone else's rating would drift toward a perfect distance above the random mover.  Again, ratings are not transitive, so fixing a bot rating is either inflationary or deflationary according to human behavior.  If we all banded together to incessantly defeat ArimaaScoreP1 by rote, we could inflate our own ratings without bound.  The reality is actually the opposite; since most people who can beat ArimaaScoreP1 stop playing it, the fixed rating has a deflationary effect.  But the mere fact that human behavior determines whether ArimaaScoreP1 pumps points into the system or draws points out proves that it isn't calibrating the rest of the ratings relative to random mover.

My personal opinion is that anchoring the rating system relative to random mover is so futile, it should play no part in our rating system decisions.  A vastly more useful rule of thumb would be to try to make it comparable to the chess scale, where an average club player is rated 1500 and an average tournament player is rated 2000.  It actually benefits Arimaa to have ratings similar to chess ratings so that the scale is familiar to outsiders.  The "anchored at zero" concept is a mathematical invention of ours that doesn't correspond to any outsider intuition.

An alternative rule of thumb would be that whatever system we happen to have chosen, let's keep things approximately constant.  It would be annoying to have a discontinuity in the history of ratings at any point, making past ratings not comparable to future ratings.

Luckily for all, it seems that all three objectives are essentially commensurate.  The scale we happened to have chosen is roughly comparable to the chess scale, so by keeping things stable as they are, we are also keeping things in line with outsider intuition.  In an even greater stroke of luck, this scale happens to be approximately in line with a rating of zero for a random mover.  No, the correspondence isn't exact, but our ability to measure is so clouded by non-transitivity that we are within the limits of any meaningful comparison anyway.

So it turns out that, more or less, we live in the best of all possible worlds.  :)

Title: Re: Arimaa rating deflation
Post by Fritzlein on Feb 9th, 2010, 1:19am

on 08/03/09 at 19:40:49, Fritzlein wrote:
fixed-performance
Year    Change
----    ------
2005-6  + 5
2006-7  +59
2007-8  -27
2008-9  -46


variable-performance
Year    Change
----    ------
2005-6  +81
2006-7  +50
2007-8  +32
2008-9  -21


on 08/06/09 at 16:09:42, Fritzlein wrote:
My hunch, however, is that the ratings will still be declining.  If that is true, we may want to pop the newcomer ratings up to 1450, or even all the way to 1500.  In other words, it may have been that fixing ArimaaScoreP1's rating to 1000 was all the anti-inflationary medicine we needed, and lowering newcomer ratings in addition was pure overkill.

The above-quoted statistics were based on a partial year 2009.  When I redo it for all of 2009, I get

fixed-performance
Year    Change
----    ------
2005-6  + 5
2006-7  +59
2007-8  -27
2008-9  -61


variable-performance
Year    Change
----    ------
2005-6  +81
2006-7  +50
2007-8  +32
2008-9  -48


In other words, the deflation had not yet fully run its course when we made the mid-2009 correction.  Even after we bumped starting players from 1300 up to 1400, there was a bit more deflation working its way through the system.

However, I don't recommend any more corrective action at present.  My gut feeling is that the deflation has now fully worked its way into the gameroom ratings and we have more or less stabilized.  If we do nothing but take the same measurements at the end of 2010, I predict the averages will have drifted only slightly down.  If we make any change, though, it should probably be to the upside, for example by starting new players at 1450.

Totaling the differences from 2005 to 2009 shows that fixed performance bots have drifted down 24 points total, so perhaps we have over-corrected for several years of steady inflation.  However, each year-on-year change had a different set of bots for comparison.  Taking only the five fixed-performance bots which have been continuously present with floating ratings, namely GnoBot2005P1, GnoBot2005P2, Bomb2005P1, Bomb2005P2, and Arimaazilla, their average rating was actually 2 points higher in 2009 than in 2005.  Therefore, I think we're back to approximately a normal level, and if we stabilize near here life is fine.

Well see again at the end of 2010 whether my intuitions have worked out.  :)

Title: Re: Arimaa rating deflation
Post by omar on Feb 10th, 2010, 2:39pm
Thanks for posting this Karl. It's good to know that the gameroom ratings are not deflating as much now. Although now that we have WHR ratings to use for seeding tournaments, I am less concerned about the integretry of the gameroom ratings.

Title: Re: Arimaa rating deflation
Post by zhanrnl on Feb 10th, 2010, 8:20pm
Wow, just read through the entire topic: very interesting! It did strike me as odd that ArimaaScoreP1 was pinned at 1000, but now I see there was a very good reason behind it.

Title: Re: Arimaa rating deflation
Post by Fritzlein on Jun 15th, 2010, 1:50pm

on 02/09/10 at 01:19:12, Fritzlein wrote:
However, I don't recommend any more corrective action at present.  My gut feeling is that the deflation has now fully worked its way into the gameroom ratings and we have more or less stabilized.

Bot ratings through the first five months of 2010 suggest that the system has indeed stabilized, and no further deflation is occurring.

fixed-performance
Year    Change
----    ------
2005-6  + 5
2006-7  +59
2007-8  -27
2008-9  -61
2009-10 +24


variable-performance
Year    Change
----    ------
2005-6  +81
2006-7  +50
2007-8  +32
2008-9  -48
2009-10  +2


One might ask how fixed-performance bots gained 22 rating points on variable performance bots as a group.  One could opine that the server is overloaded, which drags down the performance of variable bots, but my hunch is that the data is merely a statistical fluke.  Note that between 2008 and 2009, the fixed-performance bots as a group lost 13 rating points relative to the variable-performance bots, which should not have happened, because the server hardware did not improve between years.  Therefore that anomaly is merely reversing at present, a sign of measurement noise.

There are lots of reasons not to trust individual game room ratings, but at least at a marco level there is rough stability.  A rating of 1700 now means approximately what a rating of 1700 meant five years ago.  On the whole, we have neither inflation nor deflation.

This, in turn, means that the increase in top ratings very probably reflects an underlying reality of improved skill.  Anyone who is rated over 2000 today, if they could take a time machine back, would be a contender for the 2005 World Championship.  The fact that chessandgo is rated over 2600 today merely shows how far we have advanced, that is to say, how high the skill pyramid has been built up.

Title: Re: Arimaa rating deflation
Post by omar on Jun 16th, 2010, 9:50am
Thanks for looking at this Karl. Wow, looks like we've finally stabilized the rating system.

It's good to have some fixed performance bots that play at the level of beginner human players.


Title: Re: Arimaa rating deflation
Post by Fritzlein on Feb 3rd, 2011, 11:15pm

on 02/09/10 at 01:19:12, Fritzlein wrote:
My gut feeling is that the deflation has now fully worked its way into the gameroom ratings and we have more or less stabilized.  If we do nothing but take the same measurements at the end of 2010, I predict the averages will have drifted only slightly down.  If we make any change, though, it should probably be to the upside, for example by starting new players at 1450.
[...]
Well see again at the end of 2010 whether my intuitions have worked out.  :)

Well, now we have data for another year.

Bot \ Year  .     .  2003  2004  2005  2006  2007  2008  2009  2010
-------------------  ----  ----  ----  ----  ----  ----  ----  ----
bot_Bomb2005Blitz .     .     .  1876  1856  1931  2038  1950  1900  
bot_Bomb2005CC    .     .     .  1774  1858  1916  1903  1876  1886
bot_Bomb2005Fast  .     .     .  1827  1826  1930  1877  1871  1878
bot_GnoBot2005Blitz     .     .  1652  1747  1841  1857  1734  1732
bot_GnoBot2005Fast.     .     .  1541  1724  1734  1734  1676  1695
bot_Arimaazilla   .  1516  1419  1449  1451  1502  1505  1433  1488
bot_Bomb2005P1    .     .     .  1488  1632  1715  1649  1542  1558
bot_Bomb2005P2    .     .     .  1752  1806  1887  1864  1787  1822
bot_GnoBot2005P1  .     .     .  1382  1262  1392  1311  1274  1316
bot_GnoBot2005P2  .     .     .  1552  1608  1651  1636  1577  1593


From this small sample, it looks like bot ratings have bounced back a little in 2010 from the lows of 2009.  I speculated that we might have to bump up the starting rating from 1400 to 1450 to combat lingering deflation, but I was wrong.  Instead there might have been slight re-inflation.  However, although 2010 was a bit above the reference year of 2005, it was still well below the inflationary peak of 2007.  One could perhaps argue for lowering the rating of newcomers to 1350, but my new gut feeling is that the ratings are close enough to stable as makes no odds.  I recommend we let it ride and measure again at the end of 2011 to see if much of anything has changed.

Title: Re: Arimaa rating deflation
Post by ddyer on Feb 4th, 2011, 12:22pm

Online rating systems can only be a very rough guide to playing ability.   Learn to live with it.



Arimaa Forum » Powered by YaBB 1 Gold - SP 1.3.1!
YaBB © 2000-2003. All Rights Reserved.