Arimaa Forum - Rating points per 2x thinking time?

Welcome, Guest. Please Login or Register.
Jul 3^rd, 2025, 8:12pm

Home

Help

Members

Arimaa Forum « Rating points per 2x thinking time? »

   Arimaa Forum
   Arimaa
   General Discussion (Moderator: supersamu)
   Rating points per 2x thinking time?

« Previous topic | Next topic »

Pages: 1 2 3 ... 6

Notify of replies

Send Topic

Author

Topic: Rating points per 2x thinking time? (Read 11457 times)

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Rating points per 2x thinking time?
« on: Aug 23^rd, 2005, 5:24pm »

Quote

Modify

In the "BombP3/P4" thread, in the "Arimaa Challenge Format" thread, and in the "Fast Rating / Slow Rating" thread, the same issue has indirectly come up: exactly how much better will the same software play on faster hardware?

A reasonable guess is that playing strength will be logarithmic in the thinking time. If a bot were doing a full-width search, the depth of the search would be logarithmic in thinking time, and (given a constant evaluation function) one can suppose rating points are linear in search depth. Some evidence from chess computing supports this hypothesis fairly strongly.

So let's take it as a hypothesis that for every doubling of thinking time, constant software will gain X rating points. My question is how to make a reasonable measurement of X.

At first I thought of comparing Arimaazilla and Arimaazon, two bots with different thinking times and the same evaluation function. Since Arimaazon came online, it has averaged a 1502 rating, while Arimaazilla has averaged a 1432 rating, for a difference of 70 points. Arimaazon always thinks 60 seconds per move, while I recalled that Arimaazilla thinks 6 seconds for move for a time factor of 10. I calculated then that each doubling of think time is worth 70/lg(10) = 21 rating points per doubling.

Unfortunately, this failed to account for the slowing down of the server. Apparently nowadays Arimaazilla thinks about 30 seconds per move to get to its standard depth. (Also perhaps the slow server rather than rating deflation explains Arimaazon's slight decline?) If Arimaazilla thought for a fixed time as well, then they would be more comparable, but as it stands I don't see how to get useful information out of the variable data.

Alternatively, one could compare, for example, BombBlitz to BombFast, where the latter always has exactly double the thinking time. Unfortunately that comparison will also tell us little, because the difference in time control has even more of an effect on the human than on the bot.

Another experiment I can think of would be for a bot developer to play a bot against itself many times at different speeds. I am afraid, however, that the bots aren't sufficiently randomized for this to give an accurate reading. In a deterministic situation, a bot that is only a little better will win every game.

Can anyone think of any reasonable way to measure the rating increase per doubled time for any bot? (Always assuming such a thing exists...)

IP Logged

Janzert
Forum Guru

Arimaa player #247

Gender: male

Posts: 1016

Re: Rating points per 2x thinking time?
« Reply #1 on: Aug 23^rd, 2005, 5:58pm »

Quote

Modify

I happened to just read a semi-related paper last night about chess program strength increase with search depth increase.

http://supertech.lcs.mit.edu/~heinz/dt/node46.html

Seems that the conclusion is that strength seems to increase linearly with search depth. Which of course would imply an exponential amount of work for a linear increase in strength.

Don't know how much would be directly applicable for Arimaa, but thought you might like taking a look at it. Also they and all the studies they reference seem to all use self play.

Janzert

IP Logged

jdb
Forum Guru

Arimaa player #214

Gender: male

Posts: 682

Re: Rating points per 2x thinking time?
« Reply #2 on: Aug 23^rd, 2005, 6:20pm »

Quote

Modify

Interesting question.

I would submit that at the current stage of development in Arimaa bots, the change in strength would be very hard to separate from other, knowledge based factors.

In chess, a reasonably modern program's evaluation function 'knows' about all the major strategic themes in chess. There are no gaping 'holes' to exploit. So, a program that searches deeper will play better.

In Arimaa, as the ever more impressive bot bashing feats show, there are numerous holes in the bot's knowledge. These holes can never be fixed with more search time.

Adding search time will not make a bot any stronger against someone who knows how to expoit its strategic weaknesses. So a test to measure the change in playing strength due to search time will need to address this.

IP Logged

Arimanator
Forum Guru

Arimaa player #1064

Gender: male

Posts: 158

Re: Rating points per 2x thinking time?
« Reply #3 on: Aug 24^th, 2005, 2:00am »

Quote

Modify

I believe at any rate that if we could have enough computer power to have P4 bots with a reasonable thinking time (less than 3 min. a move) we would have to greatly improve our tactics to beat them on a botbashing level.

( for the record given that there are close to 10000 min. in a week that would mean about 10000 X what we have today, how long will we have to wait for that, what do you think Fritzlein?)

As far as I can tell most of our tactical schemes don't hold out if you submit them to a 4 move analysis. So computer power would certainly equal better players in the long run.

Like the old battle for space between the Russians and the Americans, when one made a step forward the other one had to improve to keep up.

It's a shame they stopped it or we would ALL be playing golf on the moon. (kidding Wink

)

« Last Edit: Aug 24^th, 2005, 2:58am by Arimanator »

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Rating points per 2x thinking time?
« Reply #4 on: Aug 24^th, 2005, 10:03am »

Quote

Modify

on Aug 23^rd, 2005, 5:58pm, Janzert wrote:

I happened to just read a semi-related paper last night about chess program strength increase with search depth increase.

http://supertech.lcs.mit.edu/~heinz/dt/node46.html

Thanks for the link to this article. I'm pretty sure the third page of it was already linked from a previous thread, but it is interesting to see it in the context of the whole article. I'm surprised that the authors considered it a more useful metric to measure how much a computer changes its mind as opposed to how deep it searches, but they seem to have good reasons for this.

I'm also surprised that self-play is the most widely used metric for increasing strength with increasing search depth, but now that I think about it more carefully, it makes a good deal of sense. Different levels of a bot playing against the same human is likely to give a much less useful measure, because the human will probably win in more or less the same way until the search gets deep enough to foil that method. So against a human it might look the strength spikes abruptly when in fact it was increasing slowly all along.

I guess I'm implicitly clinging to two theories here: First, a human who is a little bit better than a computer will appear to be a lot better, while a human who is a little worse than a computer will appear to be a lot worse. There is a very abrupt transition between winning almost never and almost always. Second, I believe flaws in strategic judgement of bots can always be covered by sufficent search depth, although it isn't clear how much depth will be necessary in advance. For example, 5-ply might be enough for Bomb to stop mutual massacre from working, whereas for Bomb to stop the E-H smother attack 15 ply would suffice. In practice improved evaluation is sometimes the only option, but that's just because we don't have enough computing power.

On a side note, one thing I inferred from the article is that each additional ply of search in chess takes 3 to 4 times as long as the previous. How can this be true if there are 30 to 40 possibile moves in each position? My naive understanding is that pruning somehow cuts it down to an average of 4 moves per position that are worth seriously considering and the rest are demonstrably inferior. Is that a reasonable way to explain why each ply doesn't take 40 times as long as the previous?

If I'm shooting near the mark, then I want to know what the corresponding number is for Arimaa. The interesting number is not how many thousands of moves each position has, but how many of them must be taken seriously. To put it another way, if I want to know how many times longer than the previous each ply will take in Arimaa, I can't guess based on the number of possible moves per position. I need to know how effective pruning will be, and that in turn depends on both the nature of the game and the subtlety of the evaluation function.

JDB, would you mind testing for Clueless how many times longer it takes to search 3 ply than it takes to seach 2 ply? Or perhaps you know off the top of your head.

These thoughts are making me think that I didn't describe very accurately in the Wikipedia article why computers struggle at Arimaa. If I figure out how to say it more precisely, I'll go back and rewrite it.

IP Logged

99of9
Forum Guru

Gnobby's creator (player #314)

Gender:

Posts: 1413

Re: Rating points per 2x thinking time?
« Reply #5 on: Aug 24^th, 2005, 6:25pm »

Quote

Modify

on Aug 24^th, 2005, 10:03am, Fritzlein wrote:

On a side note, one thing I inferred from the article is that each additional ply of search in chess takes 3 to 4 times as long as the previous. How can this be true if there are 30 to 40 possibile moves in each position? My naive understanding is that pruning somehow cuts it down to an average of 4 moves per position that are worth seriously considering and the rest are demonstrably inferior. Is that a reasonable way to explain why each ply doesn't take 40 times as long as the previous?

If doing full "minimax" then you are right that the algorithm scales approximately as W^D where W is the average number of moves per position, and D is the depth you search to. However, raw "alpha-beta" scales better than this: roughly W^(D/2). So for every extra ply you can expect sqrt(W) extra time (in your chess example that would be a time factor of 5-7).

However, even what I say is oversimplified. When you take into account transposition tables (which test for positions which have been examined before), or move ordering (choosing which move to search first from a given position greatly affects the speed of alpha-beta), this scaling changes again. Finally, as you say, pruning is extremely important to the scaling. Gnobot doesn't prune at all, so is the most likely to be characterised by these simple scaling rules.

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Rating points per 2x thinking time?
« Reply #6 on: Sep 11^th, 2005, 2:37am »

Quote

Modify

JDB has thoughtfully provided me with a test version of Clueless so that I can get some experimental data on the value of deeper thinking via self play.

At first I intended to play 2s/move vs 4s/move, and 4s/move vs 8s/move, etc. The way Clueless is engineered, however, it must examine each move to a depth of 4 steps before evaluating it. If there isn't enough time to look at every move to depth 4, some moves are simply not considered at all. Given that my computer is slow, and given that in complex situations Clueless does some automatic extra processing of even those 4-step moves, at times it takes 20 seconds for the bot to finish the initial 4-step search.

Therefore, at faster time controls, the advantage of greater time is not due to deeper search vs. shallower search so much as it is due to looking at all the moves versus looking at only some of the moves. Since this is not at all what I wanted to investigate, I decided to set up my match between the 30s/move and 60s/move versions.

The final score is 86 to 58 in favor of the deeper-searching bot, for an advantage of 68 rating points. This is much more than I expected.

To put the result in perspective, we need to consider the luck factor. The statistical 95% confidence interval is plus or minus 11.53 games. So what I should really say is that the deeper searcher is somewhere between 129 and 12 points stronger. In other words, there is enough evidence to be statistically satisfied that the deeper searcher is in fact stronger, but there is a wide range within which the true strength difference might lie.

By the way, as a procedural footnote, the offline match script (generously provided by Omar for testing) doesn't pass the complete game state to the bots; it only passes the current position. Therefore, even though Clueless is smart enough to avoid repetition of position, some games end that way anyway because the match script does pass enough information. I have decided not to count such games at all, because the player that loses on repetition isn't necessarily the player that was losing over the board. That delays the compilation of a statistically significant amount of data still further, but I think it means the data I do collect will be more meaningful.

« Last Edit: Nov 22^nd, 2005, 4:12pm by Fritzlein »

IP Logged

99of9
Forum Guru

Gnobby's creator (player #314)

Gender:

Posts: 1413

Re: Rating points per 2x thinking time?
« Reply #7 on: Sep 11^th, 2005, 7:01am »

Quote

Modify

How are you avoiding always repeating the same game? Are the setups different for each game?

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Rating points per 2x thinking time?
« Reply #8 on: Sep 11^th, 2005, 7:31am »

Quote

Modify

on Sep 11^th, 2005, 7:01am, 99of9 wrote:

How are you avoiding always repeating the same game? Are the setups different for each game?

There is a random seed for each game; I am passing consecutive integers. The piece setup is somewhat randomized, and I think something else is randomized too, but you would have to ask JDB what exactly that is.

IP Logged

jdb
Forum Guru

Arimaa player #214

Gender: male

Posts: 682

Re: Rating points per 2x thinking time?
« Reply #9 on: Sep 11^th, 2005, 10:32am »

Quote

Modify

The random seed you are passing in is not used.
It was too much of a pain to get that info to where it was needed.

The opening setups have some randomness to them. I also added some random adjustments to the piece square tables in the evaluation. The seed for both of these things is taken from the current system time.

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Rating points per 2x thinking time?
« Reply #10 on: Oct 3^rd, 2005, 10:21am »

Quote

Modify

OK, I think I'm ready to close the book on this experiment. After throwing out 31 games which ended in repetition, I had 144 games I counted, 72 with the 30-second bot moving first, and 72 with the 60-second bot moving first. The score was 58 to 86, which means the doubled thinking time is worth 68 rating points. The 95% confidence interval is that the doubled thinking time is worth between 12 and 129 rating points, so the advantage to the deeper searcher is statistically significant, but the range is still over 100 points. I would try to narrow it down further, but this experiment has already dragged on a while on my slow computer.

I confess that 68 points per doubling is way more than I expected. However, we should definitely take this result with a grain of salt. Of those 144 games I also note that the score was 82 to 62 in favor of Gold, for an advantage of 49 rating points. Who here believes that Gold has that much of an advantage in human vs. human games? (Come to think of it, 99of9 might: he never believed the server statistics showing an advantage of 2 or 3 rating points.) Admittedly, the Gold advantage in this study is not as significant statistically as the 2x thinking time advantage, but it still points out a danger in extrapolating from a single study.

Despite the numerical evidence to the contrary, I still think 2x thinking time is worth less than 68 points and moving first is worth less than 49 points.

IP Logged

jdb
Forum Guru

Arimaa player #214

Gender: male

Posts: 682

Re: Rating points per 2x thinking time?
« Reply #11 on: Oct 3^rd, 2005, 11:56am »

Quote

Modify

Interesting results. One point to consider when evaluating the results, is the 30sec and the 60sec bot had identical evaluations functions.

If for example, they were both playing against Fritzlein, their records would most likely be identical, thus showing no advantage to extra thinking time.

Testing with identical eval's magnifies any difference in playing strength.

IP Logged

omar
Forum Guru

Arimaa player #2

Gender: male

Posts: 1003

Re: Rating points per 2x thinking time?
« Reply #12 on: Oct 6^th, 2005, 6:38am »

Quote

Modify

Thanks for these results Karl. It is very difficult to really know for sure how much improvement there is with doubling the thinking time. At these low time values, the effect of crossing over into the next ply may be having a significant effect. Comparing 1 hour of thinking vs 2 hours might show very little improvement. But of course it is not easy to collect data for such high time values; especially to have it be statistically significant.

I came across this page which discusses the results of various such experiments in chess:

http://supertech.lcs.mit.edu/~heinz/dt/node49.html

IP Logged

Janzert
Forum Guru

Arimaa player #247

Gender: male

Posts: 1016

Re: Rating points per 2x thinking time?
« Reply #13 on: Oct 6^th, 2005, 6:56am »

Quote

Modify

Hehe, seems to be a popular page on the subject anyway. Wink

Janzert

IP Logged

Fritzlein
Forum Guru

Arimaa player #706

Gender:

Posts: 5928

Re: Rating points per 2x thinking time?
« Reply #14 on: Apr 26^th, 2010, 8:41pm »

Quote

Modify

Tize, thanks for running your experiment on the effect of CPU doubling. I thought I would quote it in this thread since it is a little off-topic in the Challenge Screening thread.

on Apr 26^th, 2010, 1:06pm, tize wrote:

I've let marwin play itself with different time to think to get a rough idea of the rating difference of a cpu doubling when two players have the same strategic strength.

And here's what I got:

	Games	Won	Winning %	Rating per doubling
1s vs 2s	80	49	61	79
2s vs 10s	56	44	78	96
10s vs 20s	72	45	62	88
15s vs 30s	88	59	67	123
15s vs 60s	30	24	80	120

Which means that a doubling of cpu power gives you about 100 rating points in the best case. When facing humans or other bots I assume that the rating difference is smaller.

I'm very surprised by these results, because the doubling seems to be worth more than I expected. How much randomness was there between games?

For your row of 2s vs. 10s, I get that a winning ratio of 44/56 is worth 226 Elo points. Divided by 3.32 doublings gives 68 points per doubling, rather than 96 given in the table. This makes the results look even more jumpy between experiments. Still I'm curious what's up with the high improvement per doubling overall.

Was the time fixed per move (i.e. no time management allowed over the course of the game)? I wonder if 15s is a particularly bad cutoff point for some reason. Or maybe I have just been underestimating the value of CPU. Tongue

« Last Edit: Apr 26^th, 2010, 8:50pm by Fritzlein »

IP Logged

Pages: 1 2 3 ... 6

Notify of replies

Send Topic


« Previous topic \| Next topic »