Arimaa Forum (http://arimaa.com/arimaa/forum/cgi/YaBB.cgi)
Arimaa >> General Discussion >> Rating points per 2x thinking time?
(Message started by: Fritzlein on Aug 23rd, 2005, 5:24pm)

Title: Rating points per 2x thinking time?
Post by Fritzlein on Aug 23rd, 2005, 5:24pm
In the "BombP3/P4" thread, in the "Arimaa Challenge Format" thread, and in the "Fast Rating / Slow Rating" thread, the same issue has indirectly come up: exactly how much better will the same software play on faster hardware?

A reasonable guess is that playing strength will be logarithmic in the thinking time.  If a bot were doing a full-width search, the depth of the search would be logarithmic in thinking time, and (given a constant evaluation function) one can suppose rating points are linear in search depth.  Some evidence from chess computing supports this hypothesis fairly strongly.

So let's take it as a hypothesis that for every doubling of thinking time, constant software will gain X rating points.  My question is how to make a reasonable measurement of X.

At first I thought of comparing Arimaazilla and Arimaazon, two bots with different thinking times and the same evaluation function.  Since Arimaazon came online, it has averaged a 1502 rating, while Arimaazilla has averaged a 1432 rating, for a difference of 70 points.  Arimaazon always thinks 60 seconds per move, while I recalled that Arimaazilla thinks 6 seconds for move for a time factor of 10.  I calculated then that each doubling of think time is worth 70/lg(10) = 21 rating points per doubling.

Unfortunately, this failed to account for the slowing down of the server.  Apparently nowadays Arimaazilla thinks about 30 seconds per move to get to its standard depth.  (Also perhaps the slow server rather than rating deflation explains Arimaazon's slight decline?)  If Arimaazilla thought for a fixed time as well, then they would be more comparable, but as it stands I don't see how to get useful information out of the variable data.

Alternatively, one could compare, for example, BombBlitz to BombFast, where the latter always has exactly double the thinking time.  Unfortunately that comparison will also tell us little, because the difference in time control has even more of an effect on the human than on the bot.

Another experiment I can think of would be for a bot developer to play a bot against itself many times at different speeds.  I am afraid, however, that the bots aren't sufficiently randomized for this to give an accurate reading.  In a deterministic situation, a bot that is only a little better will win every game.

Can anyone think of any reasonable way to measure the rating increase per doubled time for any bot?  (Always assuming such a thing exists...)

Title: Re: Rating points per 2x thinking time?
Post by Janzert on Aug 23rd, 2005, 5:58pm
I happened to just read a semi-related paper last night about chess program strength increase with search depth increase.

http://supertech.lcs.mit.edu/~heinz/dt/node46.html

Seems that the conclusion is that strength seems to increase linearly with search depth. Which of course would imply an exponential amount of work for a linear increase in strength.

Don't know how much would be directly applicable for Arimaa, but thought you might like taking a look at it. Also they and all the studies they reference seem to all use self play.

Janzert

Title: Re: Rating points per 2x thinking time?
Post by jdb on Aug 23rd, 2005, 6:20pm
Interesting question.

I would submit that at the current stage of development in Arimaa bots, the change in strength would be very hard to separate from other, knowledge based factors.

In chess, a reasonably modern program's evaluation function 'knows' about all the major strategic themes in chess. There are no gaping 'holes' to exploit. So, a program that searches deeper will play better.

In Arimaa, as the ever more impressive bot bashing feats show, there are numerous holes in the bot's knowledge. These holes can never be fixed with more search time.

Adding search time will not make a bot any stronger against someone who knows how to expoit its strategic weaknesses. So a test to measure the change in playing strength due to search time will need to address this.

Title: Re: Rating points per 2x thinking time?
Post by Arimanator on Aug 24th, 2005, 2:00am
I believe at any rate that if we could have enough computer power to have P4 bots with a reasonable thinking time (less than 3 min. a move) we would have to greatly improve our tactics to beat them on a botbashing level.

( for the record given that there are close to 10000 min. in a week that would mean about 10000 X what we have today,  how long will we have to wait for that, what do you think Fritzlein?)

As far as I can tell most of our tactical schemes don't hold out if you submit them to a 4 move analysis. So computer power would certainly equal better players in the long run.

Like the old battle for space between the Russians and the Americans, when one made a step forward the other one had to improve to keep up.

It's a shame  they stopped it or we would ALL be playing golf on the moon. (kidding  ;) )

Title: Re: Rating points per 2x thinking time?
Post by Fritzlein on Aug 24th, 2005, 10:03am

on 08/23/05 at 17:58:07, Janzert wrote:
I happened to just read a semi-related paper last night about chess program strength increase with search depth increase.

http://supertech.lcs.mit.edu/~heinz/dt/node46.html


Thanks for the link to this article.  I'm pretty sure the third page of it was already linked from a previous thread, but it is interesting to see it in the context of the whole article.  I'm surprised that the authors considered it a more useful metric to measure how much a computer changes its mind as opposed to how deep it searches, but they seem to have good reasons for this.

I'm also surprised that self-play is the most widely used metric for increasing strength with increasing search depth, but now that I think about it more carefully, it makes a good deal of sense.  Different levels of a bot playing against the same human is likely to give a much less useful measure, because the human will probably win in more or less the same way until the search gets deep enough to foil that method.  So against a human it might look the strength spikes abruptly when in fact it was increasing slowly all along.

I guess I'm implicitly clinging to two theories here:  First, a human who is a little bit better than a computer will appear to be a lot better, while a human who is a little worse than a computer will appear to be a lot worse.  There is a very abrupt transition between winning almost never and almost always.  Second, I believe flaws in strategic judgement of bots can always be covered by sufficent search depth, although it isn't clear how much depth will be necessary in advance.  For example, 5-ply might be enough for Bomb to stop mutual massacre from working, whereas for Bomb to stop the E-H smother attack 15 ply would suffice.  In practice improved evaluation is sometimes the only option, but that's just because we don't have enough computing power.

On a side note, one thing I inferred from the article is that each additional ply of search in chess takes 3 to 4 times as long as the previous.  How can this be true if there are 30 to 40 possibile moves in each position?  My naive understanding is that pruning somehow cuts it down to an average of 4 moves per position that are worth seriously considering and the rest are demonstrably inferior.  Is that a reasonable way to explain why each ply doesn't take 40 times as long as the previous?

If I'm shooting near the mark, then I want to know what the corresponding number is for Arimaa.  The interesting number is not how many thousands of moves each position has, but how many of them must be taken seriously.  To put it another way, if I want to know how many times longer than the previous each ply will take in Arimaa, I can't guess based on the number of possible moves per position.  I need to know how effective pruning will be, and that in turn depends on both the nature of the game and the subtlety of the evaluation function.

JDB, would you mind testing for Clueless how many times longer it takes to search 3 ply than it takes to seach 2 ply?  Or perhaps you know off the top of your head.

These thoughts are making me think that I didn't describe very accurately in the Wikipedia article why computers struggle at Arimaa.  If I figure out how to say it more precisely, I'll go back and rewrite it.

Title: Re: Rating points per 2x thinking time?
Post by 99of9 on Aug 24th, 2005, 6:25pm

on 08/24/05 at 10:03:57, Fritzlein wrote:
On a side note, one thing I inferred from the article is that each additional ply of search in chess takes 3 to 4 times as long as the previous.  How can this be true if there are 30 to 40 possibile moves in each position?  My naive understanding is that pruning somehow cuts it down to an average of 4 moves per position that are worth seriously considering and the rest are demonstrably inferior.  Is that a reasonable way to explain why each ply doesn't take 40 times as long as the previous?

If doing full "minimax" then you are right that the algorithm scales approximately as W^D where W is the average number of moves per position, and D is the depth you search to.  However, raw "alpha-beta" scales better than this: roughly W^(D/2).  So for every extra ply you can expect sqrt(W) extra time (in your chess example that would be a time factor of 5-7).

However, even what I say is oversimplified.  When you take into account transposition tables (which test for positions which have been examined before), or move ordering (choosing which move to search first from a given position greatly affects the speed of alpha-beta), this scaling changes again.  Finally, as you say, pruning is extremely important to the scaling.  Gnobot doesn't prune at all, so is the most likely to be characterised by these simple scaling rules.

Title: Re: Rating points per 2x thinking time?
Post by Fritzlein on Sep 11th, 2005, 2:37am
JDB has thoughtfully provided me with a test version of Clueless so that I can get some experimental data on the value of deeper thinking via self play.

At first I intended to play 2s/move vs 4s/move, and 4s/move vs 8s/move, etc.  The way Clueless is engineered, however, it must examine each move to a depth of 4 steps before evaluating it.  If there isn't enough time to look at every move to depth 4, some moves are simply not considered at all.  Given that my computer is slow, and given that in complex situations Clueless does some automatic extra processing of even those 4-step moves, at times it takes 20 seconds for the bot to finish the initial 4-step search.

Therefore, at faster time controls, the advantage of greater time is not due to deeper search vs. shallower search so much as it is due to looking at all the moves versus looking at only some of the moves.  Since this is not at all what I wanted to investigate, I decided to set up my match between the 30s/move and 60s/move versions.

The final score is 86 to 58 in favor of the deeper-searching bot, for an advantage of 68 rating points.  This is much more than I expected.

To put the result in perspective, we need to consider the luck factor.  The statistical 95% confidence interval is plus or minus 11.53 games.  So what I should really say is that the deeper searcher is somewhere between 129 and 12 points stronger.  In other words, there is enough evidence to be statistically satisfied that the deeper searcher is in fact stronger, but there is a wide range within which the true strength difference might lie.

By the way, as a procedural footnote, the offline match script (generously provided by Omar for testing) doesn't pass the complete game state to the bots; it only passes the current position.  Therefore, even though Clueless is smart enough to avoid repetition of position, some games end that way anyway because the match script does pass enough information.  I have decided not to count such games at all, because the player that loses on repetition isn't necessarily the player that was losing over the board.  That delays the compilation of a statistically significant amount of data still further, but I think it means the data I do collect will be more meaningful.

Title: Re: Rating points per 2x thinking time?
Post by 99of9 on Sep 11th, 2005, 7:01am
How are you avoiding always repeating the same game?   Are the setups different for each game?

Title: Re: Rating points per 2x thinking time?
Post by Fritzlein on Sep 11th, 2005, 7:31am

on 09/11/05 at 07:01:58, 99of9 wrote:
How are you avoiding always repeating the same game?   Are the setups different for each game?


There is a random seed for each game; I am passing consecutive integers.  The piece setup is somewhat randomized, and I think something else is randomized too, but you would have to ask JDB what exactly that is.

Title: Re: Rating points per 2x thinking time?
Post by jdb on Sep 11th, 2005, 10:32am
The random seed you are passing in is not used.
It was too much of a pain to get that info to where it was needed.

The opening setups have some randomness to them. I also added some random adjustments to the piece square tables in the evaluation. The seed for both of these things is taken from the current system time.




Title: Re: Rating points per 2x thinking time?
Post by Fritzlein on Oct 3rd, 2005, 10:21am
OK, I think I'm ready to close the book on this experiment.  After throwing out 31 games which ended in repetition, I had 144 games I counted, 72 with the 30-second bot moving first, and 72 with the 60-second bot moving first.  The score was 58 to 86, which means the doubled thinking time is worth 68 rating points.  The 95% confidence interval is that the doubled thinking time is worth between 12 and 129 rating points, so the advantage to the deeper searcher is statistically significant, but the range is still over 100 points.  I would try to narrow it down further, but this experiment has already dragged on a while on my slow computer.

I confess that 68 points per doubling is way more than I expected.  However, we should definitely take this result with a grain of salt.  Of those 144 games I also note that the score was 82 to 62 in favor of Gold, for an advantage of 49 rating points.  Who here believes that Gold has that much of an advantage in human vs. human games?  (Come to think of it, 99of9 might: he never believed the server statistics showing an advantage of 2 or 3 rating points.)  Admittedly, the Gold advantage in this study is not as significant statistically as the 2x thinking time advantage, but it still points out a danger in extrapolating from a single study.

Despite the numerical evidence to the contrary, I still think 2x thinking time is worth less than 68 points and moving first is worth less than 49 points.  :-)

Title: Re: Rating points per 2x thinking time?
Post by jdb on Oct 3rd, 2005, 11:56am
Interesting results. One point to consider when evaluating the results, is the 30sec and the 60sec bot had identical evaluations functions.

If for example, they were both playing against Fritzlein, their records would most likely be identical, thus showing no advantage to extra thinking time.

Testing with identical eval's magnifies any difference in playing strength.

Title: Re: Rating points per 2x thinking time?
Post by omar on Oct 6th, 2005, 6:38am
Thanks for these results Karl. It is very difficult to really know for sure how much improvement there is with doubling the thinking time. At these low time values, the effect of crossing over into the next ply may be having a significant effect. Comparing 1 hour of thinking vs 2 hours might show very little improvement. But of course it is not easy to collect data for such high time values; especially to have it be statistically significant.

I came across this page which discusses the results of various such experiments in chess:

http://supertech.lcs.mit.edu/~heinz/dt/node49.html


Title: Re: Rating points per 2x thinking time?
Post by Janzert on Oct 6th, 2005, 6:56am
Hehe, seems to be a popular page on the subject anyway. ;)

Janzert

Title: Re: Rating points per 2x thinking time?
Post by Fritzlein on Apr 26th, 2010, 8:41pm
Tize, thanks for running your experiment on the effect of CPU doubling.  I thought I would quote it in this thread since it is a little off-topic in the Challenge Screening thread.


on 04/26/10 at 13:06:51, tize wrote:
I've let marwin play itself with different time to think to get a rough idea of the rating difference of a cpu doubling when two players have the same strategic strength.

And here's what I got:                                              
GamesWon Winning %Rating per doubling
1s vs 2s80496179
2s vs 10s56447896
10s vs 20s72456288
15s vs 30s885967123
15s vs 60s302480120

Which means that a doubling of cpu power gives you about 100 rating points in the best case. When facing humans or other bots I assume that the rating difference is smaller.

I'm very surprised by these results, because the doubling seems to be worth more than I expected.  How much randomness was there between games?

For your row of 2s vs. 10s, I get that a winning ratio of 44/56 is worth 226 Elo points.  Divided by 3.32 doublings gives 68 points per doubling, rather than 96 given in the table.  This makes the results look even more jumpy between experiments.  Still I'm curious what's up with the high improvement per doubling overall.

Was the time fixed per move (i.e. no time management allowed over the course of the game)?  I wonder if 15s is a particularly bad cutoff point for some reason.  Or maybe I have just been underestimating the value of CPU.  :P

Title: Re: Rating points per 2x thinking time?
Post by Fritzlein on Apr 26th, 2010, 8:47pm

on 04/26/10 at 15:36:05, omar wrote:
Thanks for posting this. For the last one did you mean 30s vs 60s?

I think tize meant what he said there.   An 80% win ratio is worth 240 Elo points, so at two doublings in thinking time, it comes out to 120 points per doubling.  But just in case 15s is a bad cutoff for whatever reason, I'd like to see the experiment repeated with 30s vs. 60s compared directly.  That is the time control I used in my old experiment, although I can't remember why, and it was on such a slow computer I imagine it would be something like 3s vs 6s on tize's present computer.  :)


Quote:
I am surprised it is gaining 100 points per doubling. In chess they say it is about 50 to 70 elo points per doubling.

Really?  That is less than I had heard.  What is the source of that figure?

Title: Re: Rating points per 2x thinking time?
Post by tize on Apr 26th, 2010, 10:18pm

on 04/26/10 at 20:41:22, Fritzlein wrote:
I'm very surprised by these results, because the doubling seems to be worth more than I expected.  How much randomness was there between games?

For your row of 2s vs. 10s, I get that a winning ratio of 44/56 is worth 226 Elo points.  Divided by 3.32 doublings gives 68 points per doubling, rather than 96 given in the table.  This makes the results look even more jumpy between experiments.  Still I'm curious what's up with the high improvement per doubling overall.

Was the time fixed per move (i.e. no time management allowed over the course of the game)?  I wonder if 15s is a particularly bad cutoff point for some reason.  Or maybe I have just been underestimating the value of CPU.  :P


I'm also very surprised by the high numbers, but I still think that when facing humans it's a lot less. I have no data to back that up though.

Marwin has four different setups which are choosen from randomly(but not uniformly) and then in the game he can pick all moves that are scoring higher than highest_score - 0.12. The highest scoring move has the highest probability and the one at highest-0.12 has the lowest probability of being picked.

I haven't checked most of these games, but the ones that I have checked have been slowly drifting away from each other.

I agree that 44/56 is worth 226 points, but going from 2s to 10s is only a 5 times increase which is 2.32 doublings.

Yes, the games were played without any time management, just a fixed time for every move. And they played each colour equally many times.

I will try out 30s vs 60s just to see what that gives us, but it will take some time.

Title: Re: Rating points per 2x thinking time?
Post by rbarreira on Apr 27th, 2010, 12:01am
In chess a doubling of CPU power is really significant, since the effective branching factor  (i.e. the time taken for n+1 plies divided by the time for n plies) is around 2 from what I've read.

Regarding this particular experiment, as tize said making a bot play against itself will exaggerate the strength increase since it's definite that the bot with more thinking time will see everything the other one did, and more.

Title: Re: Rating points per 2x thinking time?
Post by Fritzlein on Apr 27th, 2010, 9:20am

on 04/26/10 at 22:18:43, tize wrote:
I agree that 44/56 is worth 226 points, but going from 2s to 10s is only a 5 times increase which is 2.32 doublings.

Oops, you are right.  I double-checked my math before I posted, but I was using the wrong equation!  :-[quote]Marwin has four different setups which are choosen from randomly(but not uniformly) and then in the game he can pick all moves that are scoring higher than highest_score - 0.12. The highest scoring move has the highest probability and the one at highest-0.12 has the lowest probability of being picked.

I haven't checked most of these games, but the ones that I have checked have been slowly drifting away from each other.
[/quote]
OK, that intuitively seems like a fair bit of randomness, but I witnessed a heated debate in a computer chess forum caused by correlation in game results from the same starting position between bots that everyone was swearing were random.  There was divergence in both moves and outcomes, so the games looked random to the naked eye, but when someone actually set out to measure correlation, it was there.  For starters, I would be more comfortable with uniformly choosing among aaaa's 192 random setups rather than non-uniformly among just four.


Quote:
I'm also very surprised by the high numbers, but I still think that when facing humans it's a lot less. I have no data to back that up though.

One thing that jdb mentioned earlier is that it may exaggerate the differences in playing strength when both bots use the same evaluation function.  So perhaps the ideal test is to have a pool of bots, each with a slower and faster version, play a round-robin against each other, and see how the faster bots as a group do against the slower bots as a group.

Probably setting up a pool of opponents is more important than fine-tuning the self-play experiment.  I'm questioning your results from self-play, but it could well be that very long and careful experiments would merely validate the result of about 100 points per doubling.  Perhaps it is more important to question the usefulness of self-play at all, as rbarriera has done.


Quote:
I will try out 30s vs 60s just to see what that gives us, but it will take some time.

Yes, I remember that experiment taking quite some time when I ran it.  Looking back in this thread, I see why I had to choose such a slow time control, i.e. the bot was not even finishing one ply plus extensions at 15s.  I didn't want to compare the strength of examining X% of root moves to the strength of examining 2X% of root moves; I wanted to compare the strength of shallower search to deeper search.  In marwin's case, however, I don't see why there would be a qualitative difference between 15s and 30s, only the quantitative difference which is what we are trying to measure.  So feel free to abandon the 30s vs. 60s experiment if it seems to be taking too long.

Thanks again for running this experiment and sharing your results!

Title: Re: Rating points per 2x thinking time?
Post by Fritzlein on Apr 27th, 2010, 9:22am

on 04/27/10 at 00:01:58, rbarreira wrote:
In chess a doubling of CPU power is really significant, since the effective branching factor  (i.e. the time taken for n+1 plies divided by the time for n plies) is around 2 from what I've read.

Ah, this jogs my memory a little.  Perhaps I am recalling from chess that each ply is worth about 100 Elo points, but a doubling of CPU speed actually results in somewhat less than a ply of search depth, thus Omar's figure of 50 to 70 Elo per doubling.

Title: Re: Rating points per 2x thinking time?
Post by jdb on Apr 27th, 2010, 11:02am
The increase in thinking time results in an increase in tactical strength. The log files for clueless from the postal tournament show this effect pretty clearly. With more time, it makes tactically better moves. The extra time does not result in any emergent strategic knowledge.


Title: Re: Rating points per 2x thinking time?
Post by omar on Apr 27th, 2010, 5:54pm

Quote:
Really?  That is less than I had heard.  What is the source of that figure?


Found it on the Wikipedia:
 http://en.wikipedia.org/wiki/Computer_chess#Playing_strength_versus_computer_speed

It is based on the findings of Levy and Newborn.

Title: Re: Rating points per 2x thinking time?
Post by tize on Apr 28th, 2010, 9:25pm

Quote:
Regarding this particular experiment, as tize said making a bot play against itself will exaggerate the strength increase since it's definite that the bot with more thinking time will see everything the other one did, and more.

Both the one thinking long and the one thinking short will se everything and more than what the other one did. Because a doubling of cpu power gives slightly less than 1 step deeper search, and the next player starts searching 4 steps further into the game. Which means that the one thinking more sees about 5 steps deeper and the one thinking less sees 3 steps deeper than what the opponent did. So both players are up for surprises.


Quote:
OK, that intuitively seems like a fair bit of randomness

Maybe it sounds more random than what  it is, this is just the normal way for marwin to play I haven't really changed or added any thing random for this test.


Quote:
I'm questioning your results from self-play

Good :), I do that as well, but it was an easy experiment to do and it seemed like a fun thing to try while letting marwin run test games for other things.

Title: Re: Rating points per 2x thinking time?
Post by rbarreira on May 27th, 2010, 7:04am
I wonder if it would be more convincing if we used two different bots instead of self play.

By that I mean comparing bot 1 vs bot 2, with bot 1 or bot 2 having half the time.

Even more interesting would be to see the difference against humans, but that gets quite a bit messier with psychological influence, and perhaps different people playing against each version...

edit - however we could eliminate the psychological influence by not letting people know which bot is thinking half the time (and waiting the other half out in other to appear the same)

Title: Re: Rating points per 2x thinking time?
Post by tize on Jul 31st, 2010, 6:00am

on 04/26/10 at 22:18:43, tize wrote:
I will try out 30s vs 60s just to see what that gives us, but it will take some time.


It really took a long time...

The 60s version won 68 out of 108 games over the 30s, which is a 62% winning rate and gives a 92 rating difference.

Title: Re: Rating points per 2x thinking time?
Post by Fritzlein on Jul 31st, 2010, 9:31am

on 07/31/10 at 06:00:09, tize wrote:
The 60s version won 68 out of 108 games over the 30s, which is a 62% winning rate and gives a 92 rating difference.

That is really quite remarkable.  It is much higher than my expectations.  On the other hand, my expectations are based on such scanty data that they are probably wrong.

Title: Re: Rating points per 2x thinking time?
Post by rbarreira on Jul 31st, 2010, 9:56am
It doesn't seem outrageous considering that it's self play and that the normal expectation would be around 70 rating points.

Title: Re: Rating points per 2x thinking time?
Post by rbarreira on Jul 31st, 2010, 10:04am

on 08/23/05 at 18:20:53, jdb wrote:
In Arimaa, as the ever more impressive bot bashing feats show, there are numerous holes in the bot's knowledge. These holes can never be fixed with more search time.

Adding search time will not make a bot any stronger against someone who knows how to expoit its strategic weaknesses. So a test to measure the change in playing strength due to search time will need to address this.


I don't think this is necessarily correct, at least it may be a bit misleading. Chess programmers have found that even a very stupid evaluation function (a random number generator) gives better results with more search.

It seems impossible on the surface, but there's actually a very good reason for it, which the author of Crafty explains in this post:

http://www.talkchess.com/forum/viewtopic.php?p=360985#360985

Title: Re: Rating points per 2x thinking time?
Post by Fritzlein on Jul 31st, 2010, 12:35pm
I'd wager the random evaluation works less well in Arimaa than in chess, because mobility is so much less important in Arimaa.

I agree with the general point that extra search depth helps even a weak evaluation function.  Eventually every strategic weakness can be compensated by searching deeply enough.  The question is just how long "eventually" is.  Or, to put it in more concrete terms, how many rating points do you gain per doubling in thinking time.

My expectation is that the weaker an evaluation function, the less each doubling of search gains you.  Thus the fact that marwin is gaining so much per doubling might suggest that marwin's strategic grasp is relatively strong.

Title: Re: Rating points per 2x thinking time?
Post by rbarreira on Jul 31st, 2010, 1:28pm
I will try the same experiment with my bot (which certainly doesn't have a good grasp of strategy, or even tactics for that matter), maybe that will help answering some questions.

Title: Re: Rating points per 2x thinking time?
Post by Fritzlein on Jul 31st, 2010, 4:10pm

on 07/31/10 at 13:28:15, rbarreira wrote:
I will try the same experiment with my bot (which certainly doesn't have a good grasp of strategy, or even tactics for that matter), maybe that will help answering some questions.

Great, I'm curious about the result.

Title: Re: Rating points per 2x thinking time?
Post by tize on Aug 1st, 2010, 2:26am

on 07/31/10 at 12:35:00, Fritzlein wrote:
My expectation is that the weaker an evaluation function, the less each doubling of search gains you.  Thus the fact that marwin is gaining so much per doubling might suggest that marwin's strategic grasp is relatively strong.


If your expectation is correct then one could let the bot play a few hundred selfplay games after big strategic changes in the evaluation to see if the new bot has a better strategic understanding of the game.  :)

Of course it might be better just to let the new version play the old version...

Title: Re: Rating points per 2x thinking time?
Post by rbarreira on Aug 1st, 2010, 3:35am
Soon after I wrote that post yesterday I started the test with bot_briareus.

The time control was 5s for one instance and 10s for the other one (no time management whatsoever).

When I woke up this morning, 114 games had finished and the 10s version had won 80 of them. That's a 70% winning rate, or a 149 elo advantage!

I will continue to let it run until the 200 games I had scheduled. And I can try 30s vs 60s too, though that will take a few days...

Title: Re: Rating points per 2x thinking time?
Post by rbarreira on Aug 1st, 2010, 6:37am
The winning percentage is holding pretty constant at 70%. However I have looked at some of the games and I don't think there's enough randomness, with only one setup and many of the openings common between different games.

I was already thinking of introducing some randomness, and this is a good excuse to do it.

Title: Re: Rating points per 2x thinking time?
Post by Fritzlein on Aug 1st, 2010, 12:00pm
Wow, so briareus gains even more elo points per doubling than marwin!  But yes, randomness is very important.  I was involved in a chess experiment in which the similarity between games was not even apparent to the naked eye, yet the correlation was nonetheless present and significant.  If you can actually see that the games are similar by inspection, then surely randomness is insufficient.

Title: Re: Rating points per 2x thinking time?
Post by rbarreira on Aug 2nd, 2010, 1:43pm
After introducing some randomness to the moves (not the setups), here are the results:

- the 10s version won 113 games, the 5s version won 87 games. Bayeselo gives the following output:

Rank Name        Elo    +    - games score oppo. draws
  1 Briareus1    28   26   22   200   57%   -28    0%
  2 Briareus2   -28   24   23   200   44%    28    0%

The actual games can be seen here, if anyone wants to take a quick peek at how different the games are (I didn't look at it yet):

http://mmfactor.net/arimaa/result_10s_vs_5s.pgn

Title: Re: Rating points per 2x thinking time?
Post by Fritzlein on Aug 2nd, 2010, 5:50pm
Ah, so down to just 56 points for the doubling with more randomness.  Would you expect a similar result with completely randoms setups, but otherwise deterministic play?  (Doubling the nodes instead of the time so that clock fluctuations don't break determinism.)

Title: Re: Rating points per 2x thinking time?
Post by rbarreira on Aug 3rd, 2010, 2:22am
Right now there are three sources of randomness:

- Shuffling the root move list
- Multithreading
- Clock fluctuations as you said

All of these sources of randomness can be removed, but I'm not sure I get the point. If each game uses completely random setups, what's the objective of removing the rest of the randomness?

Title: Re: Rating points per 2x thinking time?
Post by speek on Aug 3rd, 2010, 12:45pm

on 07/31/10 at 12:35:00, Fritzlein wrote:
I'd wager the random evaluation works less well in Arimaa than in chess, because mobility is so much less important in Arimaa.


I'm curious why you'd say that.  Mobility is a good proxy for material advantage and for not having frozen pieces, so I'm not sure why it'd be less important in Arimaa.

Also, I strongly suspect the value of evaluating a ply decreases as you go deeper, and probably decreases exponentially (ie, less and less).  The jump from 0 ply to 1 ply would surely result in a 100% winning record for the 1 ply bot.  The jump from 1 to 2 probably pretty close to the same.  But as you go from ply 30 to 31, I would suspect the difference becomes fairly small and would be more a matter of the eval function.

And so, testing 30s vs 60s wouldn't reveal the same difference as testing 30m vs 60m.  I think Omar was getting at this idea earlier in this conversation.

Title: Re: Rating points per 2x thinking time?
Post by rbarreira on Aug 3rd, 2010, 1:12pm

on 08/03/10 at 12:45:55, speek wrote:
Also, I strongly suspect the value of evaluating a ply decreases as you go deeper, and probably decreases exponentially (ie, less and less).  The jump from 0 ply to 1 ply would surely result in a 100% winning record for the 1 ply bot.  The jump from 1 to 2 probably pretty close to the same.  But as you go from ply 30 to 31, I would suspect the difference becomes fairly small and would be more a matter of the eval function.


That depends on who the bot is playing against. If it's playing against a bot which also searches very deep, going one ply deeper must make a huge difference. If it's against a much weaker opponent, it probably wouldn't make a difference just due to the fact that either N or N+1 plies are both enough to clearly dominate.

Title: Re: Rating points per 2x thinking time?
Post by speek on Aug 3rd, 2010, 2:00pm
I can only see it being a huge difference if it can find a 31-move forcing combo that is unavoidable.  Or, in other words, how often does searching the 31st ply cause the bot to change it's mind vs how often searching the 2nd ply causes it to change it's mind?

Title: Re: Rating points per 2x thinking time?
Post by Fritzlein on Aug 3rd, 2010, 2:02pm

on 08/03/10 at 12:45:55, speek wrote:
I'm curious why you'd say that.  Mobility is a good proxy for material advantage and for not having frozen pieces, so I'm not sure why it'd be less important in Arimaa.

Janzert did an investigation into mobility.  You can get mobility graphs of old games here: http://arimaa.janzert.com/gamegraphs

It turns out that much of "mobility" comes from not having pieces bunched together.  Something like 85% of potential moves consist of moving four different pieces one step each.  Therefore a good first approximation of the mobility of each player is (available single steps)^4/4!.  The divisor accounts for different step orderings that result in the same move.

To increase your mobility, you should get your pieces off the back ranks and edges to maximize the number potential single steps that each piece can take.  for chess, getting pieces off the back ranks and edges is a generally useful thing to do; not so for Arimaa.

When we used Janzert's tool to study mobility, we discovered that the winning side was usually ahead in mobility in the latter half of the game, reflecting having more pieces left, but often trailed in mobility in the early part of the game, even if they were winning wire to wire.

Arimaazilla is a good example of a bot that naturally maximizes its mobility because it likes to charge forward.  Have a look at my first-ever game against Arimaazilla:
http://arimaa.com/arimaa/games/jsShowGame.cgi?gid=7600&s=w

You can see how Arimaazilla is getting itself into trouble by charging rabbits forward, which I can hostage, frame, and eventually capture.  Now, when do you think mobility as an evaluation function will catch on that I am winning?  How about on move 18, by which time I have captured three rabbits for free, framed a fourth, and taken a fifth hostage?  Nope.  As you can see here:
http://arimaa.janzert.com/gamegraphs/?gameid=7600

Arimaazilla still has more possible moves than I do at that point, because its remaining (non-captured, non-frozen) pieces are spread out over the board, whereas most of my full set of pieces are still huddled on my home ranks.

When Janzert first made the mobility graphs available, I was excited to use it as a tool for gaining insight into what separates a good position from a bad one.  Alas, before long I gave up the tool as essentially useless.  It seemed to mostly measure dispersion of pieces, which doesn't map to anything strategically significant.


Quote:
Also, I strongly suspect the value of evaluating a ply decreases as you go deeper, and probably decreases exponentially (ie, less and less).  The jump from 0 ply to 1 ply would surely result in a 100% winning record for the 1 ply bot.  The jump from 1 to 2 probably pretty close to the same.  But as you go from ply 30 to 31, I would suspect the difference becomes fairly small and would be more a matter of the eval function.

Here again, your intuition is a common one, and one that I shared before encountering the evidence.  For chess as well, many skeptics asserted that increasing computer power would not be sufficient to unseat the World Champion, because the gain in Elo points would not be linear in search depth.  They speculated that each additional ply of depth would provide less increase in strength that the previous.  That sounded reasonable to me, but in the case of chess, it was wrong.  All the evidence points to a linear relationship.

Give society's experience with computer chess, there is a greater burden of proof on Arimaa skeptics.  If we want to say that the fourth ply of depth in Arimaa is worth fewer rating points than the third ply, we need to both make a plausibility argument and explain why Arimaa is different from chess.

I personally see no reason that Arimaa won't be just like chess insofar as log(search time) plotted against rating will be linear.  We certainly haven't proved this, but that's my strong expectation.

Title: Re: Rating points per 2x thinking time?
Post by Fritzlein on Aug 3rd, 2010, 2:11pm

on 08/03/10 at 14:00:00, speek wrote:
Or, in other words, how often does searching the 31st ply cause the bot to change it's mind vs how often searching the 2nd ply causes it to change it's mind?

You have hit on the right question.  Interestingly, for chess engines, the answer appears to be that a given engine is equally likely to change its mind on every ply.  I know it is non-intuitive, but an investigation that reached this conclusion is linked earlier in this thread:
http://supertech.lcs.mit.edu/~heinz/dt/node58.html
http://supertech.lcs.mit.edu/~heinz/dt/

'Surprisingly enough, the experiments do not provide any conclusive empirical evidence for the intuitive notion that the ``Best Change'' rates taper off continuously with increasing search depths.  They rather remained range-bound within 15%-17% for most of the deep searches as investigated in the experiments.'

Title: Re: Rating points per 2x thinking time?
Post by rbarreira on Aug 3rd, 2010, 2:15pm

on 08/03/10 at 14:00:00, speek wrote:
I can only see it being a huge difference if it can find a 31-move forcing combo that is unavoidable.  Or, in other words, how often does searching the 31st ply cause the bot to change it's mind vs how often searching the 2nd ply causes it to change it's mind?


Endgame tablebases both in chess and in Arimaa tell us that there are plenty of long forced combinations.

Title: Re: Rating points per 2x thinking time?
Post by speek on Aug 3rd, 2010, 2:18pm

on 08/03/10 at 14:02:27, Fritzlein wrote:
Janzert did an investigation into mobility.  You can get mobility graphs of old games here: http://arimaa.janzert.com/gamegraphs

...

 Alas, before long I gave up the tool as essentially useless.  It seemed to mostly measure dispersion of pieces, which doesn't map to anything strategically significant.


Interesting (and thank!).  I wonder if it would be possible to distinguish "useful" mobility from "un-useful" mobility.  For instance, having an elephant that can't move is surely a problem, as compared to having a rabbit that can't move, which isn't a big deal.  As a basic approach, only count non-rabbit moves, or only count push/pull moves.   I'm not ready to give up on some kind of mobility measure.


on 08/03/10 at 14:02:27, Fritzlein wrote:
 For chess as well, many skeptics asserted that increasing computer power would not be sufficient to unseat the World Champion, because the gain in Elo points would not be linear in search depth.  They speculated that each additional ply of depth would provide less increase in strength that the previous.  That sounded reasonable to me, but in the case of chess, it was wrong.  All the evidence points to a linear relationship.


But then we're talking comparing chess engines which were already 7-8 ply deep, and the appearance of a linear relationship could hold for a few ply.  Over a short enough range, an exponential function can look perfectly linear, especially with noisy data.  But with Arimaa, we're not even getting to 3-ply on 30 or 60s tests, right?  I feel my intuition has yet to be disproven.

Title: Re: Rating points per 2x thinking time?
Post by Fritzlein on Aug 3rd, 2010, 2:32pm

on 08/03/10 at 14:18:20, speek wrote:
But then we're talking comparing chess engines which were already 7-8 ply deep, and the appearance of a linear relationship could hold for a few ply.  Over a short enough range, an exponential function can look perfectly linear, especially with noisy data.  But with Arimaa, we're not even getting to 3-ply on 30 or 60s tests, right?  I feel my intuition has yet to be disproven.

OK, that's a good point about shallow search versus deep search.  And I did specifically say that we haven't proven the relationship for Arimaa.  I just think that your intuition has a greater burden of proof now than you would have had prior to the example of chess.

Is there some reason to expect Arimaa to be different from chess?  For the past fifty years we have seen log(computing power) increase linearly, and we have also seen the rating of the best chess computer increase linearly.  Coincidence?

Title: Re: Rating points per 2x thinking time?
Post by rbarreira on Aug 3rd, 2010, 2:35pm
Thinking a bit more about my point on endgame tablebases, is there an histogram somewhere of how many forced combinations of length X there are? (for Chess at least)

It would be interesting to see how the number grows. If it grows exponentially with X that's a strong explanation as to why an extra ply of search keeps being important in chess.

Title: Re: Rating points per 2x thinking time?
Post by speek on Aug 3rd, 2010, 2:40pm
http://people.csail.mit.edu/heinz/dt/node53.html

I see exponential decrease over 2-14 ply.  I bet it gets more pronounced if 0 and 1 ply are included.

Title: Re: Rating points per 2x thinking time?
Post by Fritzlein on Aug 3rd, 2010, 2:48pm

on 08/03/10 at 14:40:19, speek wrote:
http://people.csail.mit.edu/heinz/dt/node53.html

I see exponential decrease over 2-14 ply.  I bet it gets more pronounced if 0 and 1 ply are included.

Oh, my mistake.  I thought that table was showing only new best moves, i.e. it wasn't just showing the rate at which the computer changed its mind about the best move.  Of course, if you only count new moves, there will have to be a tapering off because most of the plausible moves will already have been considered by the time you reach depth 14.  But now I see that this table includes every time the computer changes its mind.  Given that, I'm now surprised that the researchers reached a conclusion that there was no tapering off!   :o

Title: Re: Rating points per 2x thinking time?
Post by Fritzlein on Aug 3rd, 2010, 2:56pm

on 08/03/10 at 14:40:19, speek wrote:
more pronounced if 0 and 1 ply are included.

By the way, there is no such thing as a 0-ply search.  If you have even looked at each move you have done a 1-ply search.  The table also can't include 1-ply searches because it is measuring how often the bot changes its mind.  Since 1-ply is the shallowest search that can be completed, there is nothing that a 1-ply search could be changing its mind from.

Title: Re: Rating points per 2x thinking time?
Post by speek on Aug 3rd, 2010, 3:03pm

on 08/03/10 at 14:56:45, Fritzlein wrote:
By the way, there is no such thing as a 0-ply search.  there is nothing that a 1-ply search could be changing its mind from.


Another way to look at it is that it is guaranteed to change its mind :-)  This way of seeing it preserves the exp decrease that I expect.

Title: Re: Rating points per 2x thinking time?
Post by Fritzlein on Aug 3rd, 2010, 3:04pm
Given that I apparently read the table incorrectly the first time, I have to re-think what it means that a computer changes its mind less and less for each ply of greater search depth.  At first blush, it doesn't destroy the linear relationship between log(search time) and Elo rating.  All it does is destroy an explanation for that observed relationship, correct?

If so, there is still an observed chess phenomenon that runs counter to your intuitions.  Would you agree?

Title: Re: Rating points per 2x thinking time?
Post by jdb on Aug 3rd, 2010, 4:01pm
If the computer can get enough search depth, so an extra ply of search makes no difference, then the game is not worth playing anymore.

I think the Heinz paper is saying that the "Best Change" rate decays initially, but levels out around 16% once the search gets deep enough.

Title: Re: Rating points per 2x thinking time?
Post by rbarreira on Aug 3rd, 2010, 4:06pm

on 08/03/10 at 16:01:56, jdb wrote:
If the computer can get enough search depth, so an extra ply of search makes no difference, then the game is not worth playing anymore.


I think that's a good way to put it, and is where I wanted to get with the tablebases example... One extra ply of depth means you can either know what your opponent will do for one more ply, or you can plan what you will do so that you're sure of one more ply.

Why would either of those two things not be a great advantage?

Title: Re: Rating points per 2x thinking time?
Post by speek on Aug 3rd, 2010, 4:16pm

on 08/03/10 at 16:01:56, jdb wrote:
I think the Heinz paper is saying that the "Best Change" rate decays initially, but levels out around 16% once the search gets deep enough.


They would have no basis for such an argument, as they have only 3 ply that are suggestive of a leveling out.  If it's a tapering, you might have to go out to 20 ply or more to see it as a curve rather than as a flat line.

Title: Re: Rating points per 2x thinking time?
Post by speek on Aug 3rd, 2010, 4:17pm

on 08/03/10 at 16:01:56, jdb wrote:
If the computer can get enough search depth, so an extra ply of search makes no difference, then the game is not worth playing anymore.


The question isn't between no difference/difference, it's between linear/exponential.

Title: Re: Rating points per 2x thinking time?
Post by speek on Aug 3rd, 2010, 4:32pm

on 08/03/10 at 15:04:37, Fritzlein wrote:
If so, there is still an observed chess phenomenon that runs counter to your intuitions.  Would you agree?


Improvement of chess engines over the last 10-20 years hasn't really reached all the way back to reflecting only 1,2,or 3 ply performance, so it's hard to tell, I think.  So, this observation you refer to doesn't seem very rigorous.

Title: Re: Rating points per 2x thinking time?
Post by Fritzlein on Aug 3rd, 2010, 4:52pm

on 08/03/10 at 16:32:49, speek wrote:
Improvement of chess engines over the last 10-20 years hasn't really reached all the way back to reflecting only 1,2,or 3 ply performance, so it's hard to tell, I think.  So, this observation you refer to doesn't seem very rigorous.

The straight line I refer to doesn't go back ten or twenty years, but rather at least forty.  The last forty years, 1970-2010 have seen approximately twenty doublings of computer power.  Are you saying that even if the graph is a straight line across twenty doublings, that might be explained by looking at too small a part of the graph?  That if we were able to look at the very beginnings and also look at the future, we would see it is really a curved line shooting up at first and tapering off in the future?  The evidence might not be "very rigorous", but the counter that any small bit of a curve appears to be a straight line is a weak argument applied across forty years of history.

I suppose that one could alternatively argue that the usefulness of increasing computer power has been declining for a long time, but chess programmers have been improving their software at an accelerating rate.  Thus increasing rating by 40 points per year might have been 35 points due to faster harder and 5 points due to better software in 1970, but by now it is 25 points due to faster hardware and 15 points due to better software, or something like that.


Title: Re: Rating points per 2x thinking time?
Post by speek on Aug 3rd, 2010, 5:02pm

on 08/03/10 at 16:52:37, Fritzlein wrote:
The straight line I refer to doesn't go back ten or twenty years, but rather at least forty.  


Do you have data on computer chess ratings on a standard base of equipment (ie, $1000 worth of computer equipment at the time) for those 40 years?  I don't know this data, but would like to see it.

Also, if it is inclusive of both software changes (ie comparing Rybka today with Sargon of 1980) and hardware changes, then isn't the evidence that software + hardware advancements = linear improvement?  And then we'd have a time trying to disentangle the two.

Title: Re: Rating points per 2x thinking time?
Post by rbarreira on Aug 3rd, 2010, 5:44pm
http://www.transhumanist.com/volume1/chess_plot_075.jpg

It's not charted by dollar value, but supposedly it's charted by MIPS/positions per second.

Title: Re: Rating points per 2x thinking time?
Post by Fritzlein on Aug 3rd, 2010, 5:44pm

on 08/03/10 at 17:02:23, speek wrote:
Do you have data on computer chess ratings on a standard base of equipment (ie, $1000 worth of computer equipment at the time) for those 40 years?  I don't know this data, but would like to see it.

I'm thinking of a graph that Omar presented at his talk I attended.  I'll try to hit him up for the graph and the methodology behind it.  Clearly I have talked enough on memory and need to inject some hard source data into the debate.

Let me point out, however, that the other side of the debate (that doubled thinking time has a diminishing improvement in playing skill as thinking time lengthens) doesn't have any hard data either.  The observed tapering off from the MIT research was a tapering off in changing the preferred move rather than a tapering off in playing improvement.


Quote:
Also, if it is inclusive of both software changes (ie comparing Rybka today with Sargon of 1980) and hardware changes, then isn't the evidence that software + hardware advancements = linear improvement?  And then we'd have a time trying to disentangle the two.

Yes, the graph I'm thinking of includes both software and hardware advances, so it would indeed be difficult to disentangle the two.  My personal expectation would be that software advances are more likely to cause sharp jumps (i.e. hundreds of Elo points) whereas hardware advances have been steadier, and thus should have made a steadier contribution.

Title: Re: Rating points per 2x thinking time?
Post by speek on Aug 3rd, 2010, 6:27pm
Well, eyeballing isn't a nice way to do this, but by that chart, it looks like a curve to me.  Also, it only goes to 1997, with Deep Blue achieving a rating of 2750.  Rybka now, 13 years later, appears to achieve a rating around 3200.  An Intel I7 has a MIPS rating of around 150,000 MIPS.  Maybe someone else can do the math, heh.

Title: Re: Rating points per 2x thinking time?
Post by Fritzlein on Aug 3rd, 2010, 7:09pm

on 08/03/10 at 18:27:56, speek wrote:
Well, eyeballing isn't a nice way to do this, but by that chart, it looks like a curve to me.

Yep, looks like a curve to my eyeballs too.  Also I like the fact that it isn't rating versus year, like the graph I remember, because the real issue is the value of computing power, not the value of time.  On the other hand, there is something screwy about the horizontal axis.  Is it search depth, positions per second, or MIPs?  Those three are not entirely convertible.  Computers can, for example, use heavier or lighter evaluation, or more or less pruning, which affects the conversion between these variables.


Quote:
Also, it only goes to 1997, with Deep Blue achieving a rating of 2750.  Rybka now, 13 years later, appears to achieve a rating around 3200.  An Intel I7 has a MIPS rating of around 150,000 MIPS.  Maybe someone else can do the math, heh.

Yeah, a chart that stops at Deep Blue is going to look more like tailing off than a chart that includes Rybka.  If you compare, not just performance, but performance per unit computational power, Rybka makes Deep Blue look like a waste of silicon.  But that once again brings the value of software improvements into the picture, which again clouds the value of extra hardware.

Title: Re: Rating points per 2x thinking time?
Post by speek on Aug 3rd, 2010, 7:22pm
Rybka definitely represents some software improvements, as it  has been tops in the world for quite some time.  It also brought some new ideas to the evaluation table, so to speak.

*sigh*.  Perhaps I can back up a bit and argue that the jump from 1-ply to 2-ply is a doozy, in terms of the playing performance of an min-maxing algorithm.  The rest may well be pretty equal.

Title: Re: Rating points per 2x thinking time?
Post by jdb on Aug 3rd, 2010, 7:30pm

on 08/03/10 at 14:15:53, rbarreira wrote:
Endgame tablebases both in chess and in Arimaa tell us that there are plenty of long forced combinations.


I am confused. Someone please feel free to correct my ponderings.

The concept of "Best Change Rate" and "amount of improvement with an extra ply of search" are highly correlated. As long as "Best Change Rate" is greater than zero, an extra ply of search will help.  In order to show the "Best Change Rate" is greater than zero at a certain depth, we need to find some tactic that requires that depth. As already pointed out both chess and arimaa have many deep combinations. So it will take a very deep search before "Best Change Rate" reaches zero.

It might be more effective to analyze some chess endgame tablebases and see what percentage of positions at a certain distance to mate have  a unique winning move. ( To my mind, that is a halfways decent definition of a tactic). There should be a way to massage that into an estimate of "Best Change Rate".

Title: Re: Rating points per 2x thinking time?
Post by Fritzlein on Aug 3rd, 2010, 7:45pm
Another comment on the curved graph of performance versus computing power: the point apart from Deep Blue that makes it look most like a curve is Bernstein's 1957 program.  I can understand the allure of having the first chess-playing computer be the first data point, but how was the rating of 800 determined?  In 1957 the Elo system had not yet been invented.  The USCF used Harkness ratings, and if I am not mistaken, there were few if any club players with such a low rating.  UCSF ratings below 1000 didn't become common until after the explosion of scholastic chess.  So who did that computer play to give us an estimate of its skill?  Is it just a wild guess?

Perhaps the algorithm was published and then re-implemented later as part of a historical investigation into how strong it would have been.

Title: Re: Rating points per 2x thinking time?
Post by Fritzlein on Aug 3rd, 2010, 7:57pm

on 08/03/10 at 19:22:11, speek wrote:
*sigh*.  Perhaps I can back up a bit and argue that the jump from 1-ply to 2-ply is a doozy, in terms of the playing performance of an min-maxing algorithm.  The rest may well be pretty equal.

Heh.  I was just going to back up and say that the "proven" linear relationship was not so proven as I thought.  Rybka certainly argues for the importance of software improvements.  Also there was a huge software advance between 1957 and 1958, namely alpha-beta pruning.  What reason do I have to believe that software advances have been constant?  So even if a linear relationship holds between calendar year and chess computer strength, or between MIPS and chess computer strength for improving algorithms, it is still far from showing a linear relationship between MIPS and chess computer strength for a constant algorithm.

Besides which, rbarriera's graph of Elo versus MIPS looks curved.

Besides which, Arimaa might be different from chess.  :)

Title: Re: Rating points per 2x thinking time?
Post by Fritzlein on Aug 3rd, 2010, 8:10pm

on 08/03/10 at 19:30:05, jdb wrote:
The concept of "Best Change Rate" and "amount of improvement with an extra ply of search" are highly correlated.

That, indeed, was Newborn's hypothesis as explained in the linked paper.  But now that speek has induced me to re-examine the paper with a more critical eye, I am at a loss to understand how the research presented there supports Newborn's hypothesis.  Nowhere do the researchers directly measure the strength of Dark Thought when searching to various fixed ply.  With no direct measurement of playing strength, how can they conclude that it correlates to the a bot changing its mind, or discovering new best moves, or anything at all?


Quote:
As long as "Best Change Rate" is greater than zero, an extra ply of search will help.

Yes, I don't think anyone believes that extra ply are no help, unless the computer is already playing perfectly.  The issue is whether each ply of additional depth is equally helpful.


Quote:
It might be more effective to analyze some chess endgame tablebases and see what percentage of positions at a certain distance to mate have  a unique winning move. ( To my mind, that is a halfways decent definition of a tactic). There should be a way to massage that into an estimate of "Best Change Rate".

From the Wikipedia article on endgame tablebases: "White to move has a forced win, starting with 1. Rg1+ (the only winning move). White wins a rook on move 290. Sixty-eight of the moves are the only move which preserves the win."  So 68/290 in that particular winning sequence were unique best moves, some of them at huge depth.
http://en.wikipedia.org/wiki/Chess_endgame#Longest_forced_win

Title: Re: Rating points per 2x thinking time?
Post by rbarreira on Aug 4th, 2010, 2:15am

on 08/03/10 at 19:09:11, Fritzlein wrote:
Yeah, a chart that stops at Deep Blue is going to look more like tailing off than a chart that includes Rybka.  If you compare, not just performance, but performance per unit computational power, Rybka makes Deep Blue look like a waste of silicon.  But that once again brings the value of software improvements into the picture, which again clouds the value of extra hardware.


Deep Blue wasted a lot of silicon not because its software was bad, but because it was massively parallelized (lots of slow processors loosely connected together), which meant the different processors were doing a lot of repeated or wasted work. So I am doubtful of calling Rybka vs Deep Blue a software improvement, it's a hardware improvement in that nowadays we have powerful shared-memory architectures like the ones Rybka runs on (i.e. multicore CPUs). Parallel but still able to communicate at high speed, which means sharing transposition tables and in all ways work closely together in order to use the silicon to the max.

Not to say the software didn't improve at all... It's just that according to what I've read, the last big software improvements were null move pruning (early 90s) and late-move-reductions (2000s?) which both together give a bit more than 100 elo.

Title: Re: Rating points per 2x thinking time?
Post by rbarreira on Aug 4th, 2010, 2:30am

on 08/03/10 at 20:10:32, Fritzlein wrote:
 With no direct measurement of playing strength, how can they conclude that it correlates to the a bot changing its mind, or discovering new best moves, or anything at all?


To me it seems rather obvious that changing its mind at a certainly ply should correlate directly with the value of searching that ply. For the simple reason that if you don't change your mind, you can't possibly gain anything from doing the search.

I may be engaging in some fallacy though...

Title: Re: Rating points per 2x thinking time?
Post by Janzert on Aug 4th, 2010, 6:57am

on 08/04/10 at 02:30:30, rbarreira wrote:
To me it seems rather obvious that changing its mind at a certainly ply should correlate directly with the value of searching that ply. For the simple reason that if you don't change your mind, you can't possibly gain anything from doing the search.

I may be engaging in some fallacy though...


If a bot never changes it's mind after searching another ply it can't of course be playing any better. But just because it changes it's mind is not sufficient to show that it is playing better.

Janzert

Title: Re: Rating points per 2x thinking time?
Post by rbarreira on Aug 4th, 2010, 7:08am

on 08/04/10 at 06:57:12, Janzert wrote:
If a bot never changes it's mind after searching another ply it can't of course be playing any better. But just because it changes it's mind is not sufficient to show that it is playing better.

Janzert


By definition if it changes its mind at ply N+1, it means the move it had found at ply N would give a worse result by the time ply N+1 comes.

Obviously it's not possible to prove that the new move is necessarily better, but it must be better according to the evaluation function. If we don't trust the evaluation function, we might as well throw our arms in the air and not search at all...

Title: Re: Rating points per 2x thinking time?
Post by Fritzlein on Aug 4th, 2010, 7:28am

on 08/04/10 at 02:30:30, rbarreira wrote:
To me it seems rather obvious that changing its mind at a certainly ply should correlate directly with the value of searching that ply. For the simple reason that if you don't change your mind, you can't possibly gain anything from doing the search.

I may be engaging in some fallacy though...

I don't think this is a fallacy, but a question remains as to the magnitude of the benefit.  Is the bot changing its mind between plies one and two of equal benefit to the bot changing its mind between plies two and three?  If changing your mind is worth a constant amount, then to know the benefit of an additional ply, all we need to do is measure how often mind-changing happens at that ply.

On the other hand, it could be a lot trickier than that.  The value of a rabbit certainly changes across the Elo scale.  If we calibrated by rabbit handicaps instead of by winning percentages, our scale would look very different.  Two beginners who are 50-50 with a rabbit handicap might be 53-47 at an even game, whereas two experts who are 50-50 with a rabbit handicap might be 63-37 at an even game.

I know that rabbits aren't like insights, but the point is that X at a low level might not have the same effect on winning percentages as X at a high level.  I'd like to read more about chess experiments that have actually measured winning percentages as a function of time, and whether those have behaved log-linearly, quite apart from mind-changing as a function of time.

Title: Re: Rating points per 2x thinking time?
Post by rbarreira on Aug 4th, 2010, 7:39am
Suppose your evaluation function is just material.

If you change your mind between plies 1 and 2 you're saying "oops, the move I had chosen would lose material after 2 plies".

If you change your mind between plies 99 and 100 you're saying "oops, the move I had chosen would lose material after 100 plies".

If you're playing against an opponent nearly equal to you that can see your mistakes to the depth you're searching, in both cases you will lose material by not searching the extra ply. The net result is the same.

How frequent it is to change your mind at a given ply, I don't know... But I don't see any difference between changing your mind at ply X or X+100.

Title: Re: Rating points per 2x thinking time?
Post by speek on Aug 4th, 2010, 9:01am
The difference between missing the 2nd ply vs missing the 100th ply is the number of choices you have of how you lose material and exactly which material you lose.  If you don't look at the 2nd ply, you have zero choices of what material to lose.  If you look 2ply, and I look 3, you will not see that you are going to lose a dog until you're next move, at which time you have the opportunity to make some choices about how you lose the dog, or whether you want to lose a horse or two rabbits instead.

When you're talking the 100th ply, and I'm only looking 99, you're talking a lot of choices.  

Also, the fact is, the identical algorithm and identical evaluation function played against itself fully deterministically, one looking x ply ahead and the other x+1 is not a good way to measure ELO difference, since it will essentially yield 100% win result (or at least and undefeated result in chess).

Title: Re: Rating points per 2x thinking time?
Post by rbarreira on Aug 4th, 2010, 9:13am

on 08/04/10 at 09:01:18, speek wrote:
The difference between missing the 2nd ply vs missing the 100th ply is the number of choices you have of how you lose material and exactly which material you lose.  If you don't look at the 2nd ply, you have zero choices of what material to lose.  If you look 2ply, and I look 3, you will not see that you are going to lose a dog until you're next move, at which time you have the opportunity to make some choices about how you lose the dog, or whether you want to lose a horse or two rabbits instead.

When you're talking the 100th ply, and I'm only looking 99, you're talking a lot of choices.  


If your evaluation is just material, you are going to choose randomly between different ways of losing or winning the same material. I know it was a simplistic example, but the main point still stands and works no matter the evaluation function.


Quote:
Also, the fact is, the identical algorithm and identical evaluation function played against itself fully deterministically, one looking x ply ahead and the other x+1 is not a good way to measure ELO difference, since it will essentially yield 100% win result (or at least and undefeated result in chess).


Not necessarily. If you have either multi-threading or just a simple shuffle of the move ordering before starting to search that will make them play different games all the time. The reason is that different moves with the same score will get chosen. This happened in my earlier test even though I just had multithreading and no explicit randomness at all.

Just because you are looking one ply further, that doesn't mean you will always steer the game to an advantageous position.

Title: Re: Rating points per 2x thinking time?
Post by Manuel on Aug 4th, 2010, 9:53am

on 08/04/10 at 09:13:55, rbarreira wrote:
Just because you are looking one ply further, that doesn't mean you will always steer the game to an advantageous position.


That is exactly the point why looking one ply further and changing your mind does not need to correlate linearly with strength.
On average looking further will improve your strength, but the effect does not have to be as strong (=linearly) at different plyes.

Title: Re: Rating points per 2x thinking time?
Post by rbarreira on Aug 4th, 2010, 10:00am

on 08/04/10 at 09:53:42, Manuel wrote:
That is exactly the point why looking one ply further and changing your mind does not need to correlate linearly with strength.
On average looking further will improve your strength, but the effect does not have to be as strong (=linearly) at different plyes.


I have already admitted that I don't know how often a change of mind will occur at different plies.

I strongly believe that it takes many plies to get that rate significantly down though... I don't know how else to explain the fact that chess programs keep increasing their search depth and keep deriving strength from that.

For Arimaa, we'll keep learning in the foreseeable future.

Title: Re: Rating points per 2x thinking time?
Post by Janzert on Aug 5th, 2010, 7:48am

on 08/04/10 at 07:08:35, rbarreira wrote:
By definition if it changes its mind at ply N+1, it means the move it had found at ply N would give a worse result by the time ply N+1 comes.


Where the result in this case is measured by the current evaluation function.


Quote:
Obviously it's not possible to prove that the new move is necessarily better, but it must be better according to the evaluation function. If we don't trust the evaluation function, we might as well throw our arms in the air and not search at all...


But if we trust the evaluation function we might as well throw our arms in the air and not search at all...

Sorry couldn't resist. Obviously evals are neither perfectly correct nor perfectly wrong. There is also a distinction to be made where an eval is wrong because of a lack of knowledge and wrong because of wrong knowledge. I don't think the former will ever hurt a deeper search, but certainly the latter can.

I do think rate of changing move choice is probably an ok rough measure. I don't think it's proof though and would certainly rather see studies with actual play.

Janzert

Title: Re: Rating points per 2x thinking time?
Post by rbarreira on Mar 7th, 2011, 6:54am
Just a FYI, I'm conducting briareus vs Opfor tests on this to see the effect of doubling thinking speed without using self-play.

I'm running 200 Fast time control games of briareus (32 bit) with 1 thread against Opfor 2009 (32 bit), then I will set briareus to use 2 threads and see how the elo changes. This will take a few days, only 58 games of the first run are done yet.

So it's not quite double thinking speed, but should be pretty close and should be a lower bound on the value of doubled thinking time. It's probably also more relevant to the way hardware is improving these days, more by increasing the number of cores than by improving the single-core performance.

Title: Re: Rating points per 2x thinking time?
Post by Fritzlein on Mar 7th, 2011, 8:51am

on 03/07/11 at 06:54:41, rbarreira wrote:
Just a FYI, I'm conducting briareus vs Opfor tests on this to see the effect of doubling thinking speed without using self-play.

That's a great idea.  Maybe self-play doesn't exaggerate gains, but intuitively one suspects it would.


Quote:
I'm running 200 Fast time control games of briareus (32 bit) with 1 thread against Opfor 2009 (32 bit), then I will set briareus to use 2 threads and see how the elo changes. This will take a few days, only 58 games of the first run are done yet.

As jdb and I were chatting about last night, the 95% confidence interval for 200 games is about +/- 50 Elo.  So if doubled thinking time is worth less than 50 Elo, you might not get a statistically significant result.


Quote:
So it's not quite double thinking speed, but should be pretty close and should be a lower bound on the value of doubled thinking time. It's probably also more relevant to the way hardware is improving these days, more by increasing the number of cores than by improving the single-core performance.

One wonders whether, years from now, the shift to multiple cores will be viewed as the beginning of the end of Moore's Law, i.e. nominally continuing to double transistor density, but without actually continuing to double performance.

Title: Re: Rating points per 2x thinking time?
Post by Fritzlein on Aug 20th, 2011, 11:40am
I have taken aaaa's ratings of fixed-performance bots from this post (http://arimaa.com/arimaa/forum/cgi/YaBB.cgi?board=talk;action=display;num=1207699394;start=105#116) to look at the rating differences between P1 and P2 versions of the same bot.  I excluded ArimaaScore because of the distortion of first-time players being unclear on the rules, interface, etc., and excluded Rat2009 due to insufficient data.  Maybe there are other exclusions that should be made for reasons I am unaware, but if so we'll just have to live with the extra noise in the data.

I was left with 21 pairs of P1/P2 bots.  On average the P2 bot was rated 266 points higher.  A measure of rating points per ply isn't a direct measure of rating points per doubled CPU, but if plies ~ sqrt(branching) ~ sqrt(2^14) ~ 2^7, then we are talking about 266/7 ~ 38 rating points per doubling.

Of course, the human players might be a hidden variable here.  The time control is a standard two minutes per move for the humans regardless of how long the bot thinks.  It seems likely to me that when the bot slows down, the human tends to slow down too, so the P2 bots are being rated against tougher competition than the P1 bots.  That would mean this measure of rating points per doubled CPU would be understated.

Also Clueless seems to be bringing the average down, and my recollection is that some versions of CluelessP2 don't always complete two ply due to heavy extensions, and thus are not really a whole ply stronger than CluelessP1.  Excluding Clueless raises the average to 312 points per ply, 45 points per doubling.

Despite the potential errors in the estimation, I thought I'd throw this data point into the mix.

Title: Re: Rating points per 2x thinking time?
Post by rbarreira on Oct 29th, 2012, 4:07am
One more data point:

I had the chance to borrow a 32-core machine for a few hours which gave me the opportunity to run a quick test of bot_ziltoid playing against itself, 1 core vs 32 cores.

Timecontrol: 5s/50s/100/0/25m/50s
Result: 32-core wins 59-11 (I only had time for 70 games)

This results in 292 / lg (32) = 58.4 elo per doubling the number of threads (which is of course less valuable than doubling thinking time)

Title: Re: Rating points per 2x thinking time?
Post by Fritzlein on Oct 29th, 2012, 11:24am
Nice data point; thanks.  It seems that Moore's Law is now only doubling CPU's, so that may be more relevant than doubling thinking time.



Arimaa Forum » Powered by YaBB 1 Gold - SP 1.3.1!
YaBB © 2000-2003. All Rights Reserved.