Arimaa Forum - Print Page


    
      
        Arimaa Forum
        (http://arimaa.com/arimaa/forum/cgi/YaBB.cgi)
      

        Arimaa >> Events >> 2010 Challenge Screening
        
(Message started by: Fritzlein on Mar 13^th, 2010, 9:55am)

Title: 2010 Challenge Screening
Post by Fritzlein on Mar 13^th, 2010, 9:55am

I'm going to start a thread specific to the screening, partly so we can dissect the game results and discuss any technical issues that arise, and partly so that I will be able to extend the following table and find it next year:

Year Pairs Decisive Winner/Score Loser/Score
---- ----- -------- ------------ -----------
2007 12 . 2 . Bomb / 2 Zombie / 0
2008 16 . 7 . Bomb / 6 Sharp / 1
2009 23 . 7 Clueless / 5 GnoBot / 2

Title: Re: 2010 Challenge Screening
Post by Fritzlein on Mar 13^th, 2010, 3:13pm

I just now tried to play a screening game, and I get the message

Quote:

Sorry the server is currently busy.

bot

However, bot_clueless vs. Hippo is the only screening game at present. The other server should be available. What's up with that?

Title: Re: 2010 Challenge Screening
Post by aaaa on Mar 13^th, 2010, 3:29pm

I got the same message and after I pointed this out to Omar, he fixed it for me.

Title: Re: 2010 Challenge Screening
Post by Fritzlein on Mar 14^th, 2010, 9:10pm

In honor of marwin's winning the first decisive pair of games of the 2010 screening, I have updated the table to add a row for this year. Also I have included performance ratings for each bot for each year over all screening games played (i.e. not just over the paired games)

Year Pairs Decisive Winner / Score / Perf Loser / Score / Perf
---- ----- -------- --------------------- --------------------
2007 12 . 2 . bomb / 2 / 2087 . Zombie / 0 / 1876
2008 16 . 7 . bomb / 6 / 1918 . sharp / 1 / 1576
2009 23 . 7 clueless / 5 / 1910 . GnoBot / 2 / 1792
2010 25 . 11 marwin / 6 / 2065 clueless / 5 / 1960

The performance rating is perhaps not very reliable. Bomb in 2008 and clueless in 2009 had similar performance ratings from the screening, but bomb went 0-9 in the Challenge whereas clueless went 2-7. We shouldn't read too much into marwin's impressive 4-3 start in the screening.

[EDIT]
updated through game 139910 (including joe's game 138003; not including hanzack's games or quad's)

Title: Re: 2010 Challenge Screening
Post by RonWeasley on Mar 15^th, 2010, 4:45am

Without going into a lot of detail, I'm going to use a gut reaction and allow joe to replay the game against bot_marwin that he resigned by mistake on move 3. I'm not sure about the precedent this sets, but it seems like human confusion was the root cause. If this were a WC game I would rule that the result must stand, but the screening games are all about getting information about bots' playing ability against humans. Forcing this result to stand would be counter to that purpose.

Title: Re: 2010 Challenge Screening
Post by Fritzlein on Mar 23^rd, 2010, 8:38pm

In my week-long absence the race tightened up. Marwin now leads only 3-2 in decisive pairs, and has an unpaired loss to novacat. The screen, like the Computer Championship, could go down to the wire.

I have updated the table in my previous post. The screening performance ratings of marwin and clueless are now 2036 and 1977 respectively, rather in line with my gut feeling about their true abilities under Challenge conditions.

Title: Re: 2010 Challenge Screening
Post by Fritzlein on Mar 25^th, 2010, 6:47am

Wow, clueless's win over novacat brings the match all square, 3-3 in the six decisive pairs. This is as exciting as the Computer Championship itself! I hope we have several more pairs completed before the month is over, starting with onigawara, 722caasi, camelback, Simon, The_Jeh, joe, and clauchau finishing the pairs they started. Marwin's performance rating in my table is still slightly higher than clueless's because of incomplete pairs slightly favoring marwin overall.

Title: Re: 2010 Challenge Screening
Post by camelback on Mar 25^th, 2010, 12:43pm

I'm not going to play at 2 min control anymore. It was tempting to play new bots but annoying to get dragged beyond 4 hours. At last I had to leave while in winning position and can't unrate.

May be there should be different time control options in the future for bot screening.

Title: Re: 2010 Challenge Screening
Post by Fritzlein on Mar 25^th, 2010, 2:37pm

At one point I was in favor of different time controls, but I changed my mind. The screening time controls should be the same as the challenge time controls, and those should not change.

I like that Omar hasn't changed the rules of the Arimaa Challenge for several years now, including the time controls. People who try to meet AI challenges always have to face the "moving goalposts" syndrome, i.e. whenever they get close to humans, the match conditions change to make it more favorable to humans. I was upset when Omar changed the Arimaa Challenge to require beating three out of three defenders, since that is harder than just beating one defender. By the same token I don't want to have faster time controls now when humans don't mind and then change to slower time controls later when we need it. It's not fair to base the match conditions on what is most convenient for humans at the moment.

Besides, it turns out that we are having an awesome screening at two minutes per move. Since I last wrote, The_Jeh and onigawara completed pairs, moving the score to 4-4, still tied! I somehow feel we won't know until the very last day which bot will advance.

Title: Re: 2010 Challenge Screening
Post by 99of9 on Mar 26^th, 2010, 1:37am

I can't see the link to play the screening bots anymore. Can someone help me?

Title: Re: 2010 Challenge Screening
Post by Nombril on Mar 26^th, 2010, 3:15am

http://arimaa.com/arimaa/challenge/2010/playBestBots.cgi

For me, on some browsers/computers it seems a few of the announcements disappear after I have viewed them once. Don't know if this is happening to you.

Title: Re: 2010 Challenge Screening
Post by 99of9 on Mar 26^th, 2010, 4:42am

Thanks Nombril. I hope I get some more free nights to take revenge on marwin.

Title: Re: 2010 Challenge Screening
Post by Fritzlein on Mar 29^th, 2010, 5:52am

How exciting! Rabbits' punchout of marwin gave clueless the lead for all of 83 minutes before Eltripas downed clueless to even the score at 5-5 again. Just two and a half days left in this thriller...

Title: Re: 2010 Challenge Screening
Post by knarl on Mar 29^th, 2010, 4:34pm

I have plans to try a reckless race game against marwin again, after almost winning (is there such a thing in arimaa? :)) last time. I just need to find time so I can play both bots before the deadline.

I watched marwin's defeat to rabbits yesterday. Talk about putting up a death struggle at the end!

Cheers,
knarl.

Title: Re: 2010 Challenge Screening
Post by Fritzlein on Mar 29^th, 2010, 4:48pm

on 03/29/10 at 16:34:15, knarl wrote:

I have plans to try a reckless race game against marwin again, after almost winning (is there such a thing in arimaa? :)) last time.

Heheh. Your game against marwin reminds me of something I once heard a chess player say: "He creamed me, but just barely!" :)

Title: Re: 2010 Challenge Screening
Post by Fritzlein on Mar 31^st, 2010, 5:04pm

Congratulations to marwin on a one point victory in the screening!

Year Pairs Decisive Winner / Score / Perf Loser / Score / Perf
---- ----- -------- --------------------- --------------------
2007 12 . 2 . bomb / 2 / 2087 . Zombie / 0 / 1876
2008 16 . 7 . bomb / 6 / 1918 . sharp / 1 / 1576
2009 23 . 7 clueless / 5 / 1910 . GnoBot / 2 / 1792
2010 25 . 11 marwin / 6 / 2065 clueless / 5 / 1960

I would have guessed clueless would perform in the 1950-2000 range and marwin in the 2000-2050 range. Marwin's actual performance rating of 2065 in the screening is scary good. However, both camelback and robinson had winning positions which they had to abandon due to time constraints. Removing these two games drops marwin's performance rating to 2029, and scoring them as wins for the humans would take marwin all the way down to 1994.

So I don't think marwin is better than expected, just a significant step forward and not a huge one. If we estimate that Bomb played at 1850 in 2004 and marwin now plays at 2050 in 2010, that's a rate of progress of 33 rating points per year. We'll have to see whether that long-term rate projects linearly into the future, or continues its more recent spike.

Title: Re: 2010 Challenge Screening
Post by aaaa on Mar 31^st, 2010, 7:57pm

Congratulations again, tize. I noticed I'm the only one who gave either bot 2 net points, but fortunately, the fact that the total influence of my games was more than those of any other, didn't make the difference, if barely. Maybe next time, for fairness sake, the number of different opponents with unequal results should be considered first, with the current scoring system being the first tiebreaker and, finally, the championship result. Thoughts anyone?

Title: Re: 2010 Challenge Screening
Post by tize on Apr 2^nd, 2010, 2:26am

Thank you guys.

I never would have guessed that the screening could be this even with marwin and clueless taking turns to be ahead with just a few days before finish.

Quote:

If we estimate that Bomb played at 1850 in 2004 and marwin now plays at 2050 in 2010, that's a rate of progress of 33 rating points per year.

If say that this years hardware was 8 times faster than 2004's hardware and that a doubling of speed gives 100 points, we have that the software improvments have a negative progress rate of about 16 rating points per year. :o

I better stop "improving" marwin... :-/

Title: Re: 2010 Challenge Screening
Post by Fritzlein on Apr 2^nd, 2010, 4:47am

on 04/02/10 at 02:26:50, tize wrote:

If say that this years hardware was 8 times faster than 2004's hardware and that a doubling of speed gives 100 points

The amount of rating improvement from doubling CPU speed is a figure I am very interested in. It appears to be a bit less than 100 for chess. I doubt it would be more for Arimaa than for chess; I waver between thinking the benefit of CPU doubling will be less for Arimaa and thinking it will be about the same for both games.

As for the software improvement represented by marwin, one could also make a case that Bomb played at 1850 strength in 2008, so marwin playing at 2050 in 2010 represents a rate of progress of 100 rating points per year. Even Assuming 40 points of hardware progress per year (particularly generous since a quad core doesn't search 4x nodes), that still leaves 60 points per year due to better software. :)

Title: Re: 2010 Challenge Screening
Post by chubb on Apr 5^th, 2010, 12:56am

Hi,

could you tell me what screening exactly means. Why let the bots play against each other again after the computer championship? To find their rating?

Apart from that I am curious about the challenge matches. It should have started yesterday, but I can't find anything about them and the first scheduled match for bot_Marwin is scheduled for Friday. It would be cool if the coverage of the matches would be easier to know in advance and to follow.

Thank you,
chubb

Title: Re: 2010 Challenge Screening
Post by Fritzlein on Apr 5^th, 2010, 5:07am

on 04/05/10 at 00:56:16, chubb wrote:

could you tell me what screening exactly means. Why let the bots play against each other again after the computer championship? To find their rating?

When the Arimaa Challenge was first established, the computer challenger was simply the winner of the Computer Championship. Starting in 2007, the rules were changed so that the top two computers from the Computer Championship have a playoff (screening) for the right to be the challenger. But the bots do not play off against each other, they play against humans.

It seems that it would be possible to develop a bot that plays well against other bots and poorly against humans; this is not the kind of bot we want in the Arimaa Challenge. The Computer Championship only tells us which bot plays well against other bots, whereas the screening shows how well the top two bots play against humans. So far, the winner of the Computer Championship has also won the screening in every year, but I don't expect the trend to continue indefinitely. The primary purpose of the screening is to keep developers focused on winning the Arimaa Challenge, instead of just trying to win the Computer Championship and giving up on beating humans.

A secondary reason for the screening is to prevent a bot from winning the Arimaa Challenge with glaring weaknesses that humans can exploit, but didn't have time to figure out. Secrecy is the friend of software that can't learn and adapt. The screening gives humans time to test out various strategies against the computer challenger and see which ones are most effective. This makes it far less likely for a computer to win the Arimaa Challenge only to be busted a month later.

Quote:

Apart from that I am curious about the challenge matches. It should have started yesterday, but I can't find anything about them and the first scheduled match for bot_Marwin is scheduled for Friday. It would be cool if the coverage of the matches would be easier to know in advance and to follow.

I'll let Omar field that question. Apparently one of our three Arimaa Challenge defenders has gone incommunicado.

Title: Re: 2010 Challenge Screening
Post by omar on Apr 5^th, 2010, 8:54am

on 04/05/10 at 00:56:16, chubb wrote:

The first round of games will be played this week. The challenge defenders get to select the time for their games. I have scheduled the games for the first round. In the gameroom look in the 'Scheduled Games' section.

Title: Re: 2010 Challenge Screening
Post by tize on Apr 26^th, 2010, 1:06pm

Since we talked about how much rating increase a double of cpu power would give I have made a little experiment with marwin.

I've let marwin play itself with different time to think to get a rough idea of the rating difference of a cpu doubling when two players have the same strategic strength.

And here's what I got:

	Games	Won	Winning %	Rating per doubling
1s vs 2s	80	49	61	79
2s vs 10s	56	44	78	96
10s vs 20s	72	45	62	88
15s vs 30s	88	59	67	123
15s vs 60s	30	24	80	120

Which means that a doubling of cpu power gives you about 100 rating points in the best case. When facing humans or other bots I assume that the rating difference is smaller.

Title: Re: 2010 Challenge Screening
Post by omar on Apr 26^th, 2010, 3:36pm

Thanks for posting this. For the last one did you mean 30s vs 60s?

I am surprised it is gaining 100 points per doubling. In chess they say it is about 50 to 70 elo points per doubling.

Title: Re: 2010 Challenge Screening
Post by Fritzlein on Apr 26^th, 2010, 8:57pm

Thanks for running that experiment, tize. I'm very interested in the results. Since we were starting to drift a little off topic, I replied in this thread (http://arimaa.com/arimaa/forum/cgi/YaBB.cgi?board=talk;action=display;num=1124835841).