Arimaa Forum (http://arimaa.com/arimaa/forum/cgi/YaBB.cgi)
Arimaa >> Events >> Challenge Screening Rules
(Message started by: Fritzlein on Mar 15th, 2015, 2:26pm)

Title: Challenge Screening Rules
Post by Fritzlein on Mar 15th, 2015, 2:26pm
Several informal discussions of the rules for the Challenge Screening phase have popped up recently.  It has been a long time since the current format was chosen; since then we have gained a lot of experience with it, and also circumstances have changed.  It seems time to revisit the reasons behind the rules thoroughly and formally.

Note this discussion will not be purely theoretical.  Omar has expressed a willingness to overhaul the rules, and an eagerness to hear what the community thinks.

The current rules seek to meet two objectives simultaneously.  One is to discourage developers from trying to win the Computer Championship instead of targeting the Arimaa Challenge.  It might be possible to create an anti-bot bot that would do well in the Computer Championship but be easily exploitable by humans.  Taking the top two finishers from the Computer Championship and selecting the one that scores better against humans in the Screening is intended to give the opportunity to win the Challenge to the bot that is most geared towards doing so.

The second objective is for bots to have a "fair" amount of exposure before the Challenge.  Man vs. machine matches are held under very different circumstances, which can significantly tilt the odds towards one side or the other.  For example, in Deep Blue vs. Kasparov, the computer had no public record whatsoever.  One can argue that Deep Blue eked out a narrow victory in part due to the surprise factor.  Kasparov tried to win with some anti-computer play that backfired.  If he had known what to expect, he might have been able to win with regular chess.  On the other extreme, in the Fritz vs. Kramnik match and the most recent shogi man vs. machine match, the humans got an exact copy of the software to train against.  They could have won merely by discovering a single strategic weakness and exploiting that specifically.

The Screening tries to find a middle ground.  Forcing the bots to be exposed to the community means that humans can't be taken completely by surprise.  On the other hand, limiting each player to one game with each color against each bot (and barring the defenders themselves from playing) protects the bots somewhat from being beaten by bot-bashers playing obsessively until they work out some narrow winning formula.

Criticisms of the current system include (1) the difficulty of enforcing the rules when duplicate accounts are so easy to create (2) the perception that the results can be skewed by players not taking the Screening seriously (3) the perception that the results can be skewed by someone who has a motive for one bot to win rather than the other, (4) the perception that the best anti-human bot is not necessarily selected anyway due to pure randomness, and (5) a desire to play the bots more than the Screening permits.

The primary argument for changing nothing is that most suggested changes tilt the playing field towards a successful defense of the Challenge.  The current rules have already been criticized in some quarters as being too favorable to the humans, and making it even more so would rightly be criticized as "moving the goalposts", fostering a perception that we are changing the rules because we are scared of losing now.

That said, any suggested changes will be considered.  Let the ideas flow!

Title: Re: Challenge Screening Rules
Post by kzb52 on Mar 15th, 2015, 2:57pm
I'll start by mentioning a suggestion you have put forward a few times, and that I quite like as well.  

The idea is to move from a totally open screening process (where players with questionable backgrounds and motives can participle) to one where a list of players approved by the challenge TD who have all agreed to play 2 or 4 games are the screening participants.

Since the screening games are quite a commitment (up to 8 hours of fun!) I think we should make an effort to be as inclusive and easygoing on players as possible.  Let players volunteer (or even un-volunteer) on short notice and at any point during the screening period.  Ideally, the TD would reach out to some players who would be excellent screening participants - but may not be following arimaa.com events too closely - as well.  Very few players who ask to participate should be denied.  We do want participants who represent a wide variety of styles and skill levels.  This also has the benefit of (probably) reducing the number of uncompleted game pairs that we just have to ignore under the current format.

There are a couple of ways to implement this, I'll suggest a few here
-the TD maintains a list of players allowed to start screening games, and you have to contact him and be approved on the front end before starting
-anyone can start screening games as under the current system, but players the TD doesn't approve of are removed from the official screening list after the game(s) are completed
-there is a registration process, but to avoid involving the TD every single time a player signs up, it is automated and really straightforward.  The TD can then remove people who shouldn't be playing at any time, either before or after screening games are played

My two biggest concerns about this change are that a) it's going to be more work for the director and b) it may limit the number of games played in the screening, which could make the results less reliable (concern #4 above).  I have two other ideas I'll share that might work better.

1.  Allow anyone who wants to play the bots to do so, as many times as they want.  Then let the TD pull out pairs of games (there would have to be some rules about how this process goes) which will actually count toward the screening. More games means a (theoretically) more reliable result, and if the TD can remove games played under questionable circumstances then foul play is much less likely to skew the verdict.  However, this may be perceived as "moving the goalposts" as it does allow humans more tries to work out the bot's weaknesses.

2.  Raise the minimum number of games needed to play the screening to a higher number (I think 50 is about right).  Right now, I believe the cutoff is at 10 games, and it exists to prevent players from creating another account to play more screening games.  Raising the limit would make this harder to do.  Virtually all (or possibly all) the current participants have played far more than 50 games, and virtually everybody who's played less won't have the experience necessary to provide decisive wins in the screening (in my opinion).  This is very easy to implement and I think it'll help make the screening at least a little more secure.

So my first alternative idea is to make things more inclusive, and the second is to make it less so.  Both directions have their pros and cons.

Nice post Fritz, I'm curious about what others think :)

Title: Re: Challenge Screening Rules
Post by Belteshazzar on Mar 15th, 2015, 4:08pm

on 03/15/15 at 14:26:17, Fritzlein wrote:
The current rules seek to meet two objectives simultaneously.  One is to discourage developers from trying to win the Computer Championship instead of targeting the Arimaa Challenge.  It might be possible to create an anti-bot bot what would do well in the Computer Championship but be easily exploitable by humans.  Taking the top two finishers from the Computer Championship and selecting the one that scores better against humans in the Screening is intended to give the opportunity to win the Challenge to the bot that is most geared towards doing so.


I remarked earlier that if sharp lost the screening, the screening process would have to be reconsidered.  Although not very likely, it is still perhaps possible that Z could win the screening, even though it's clearly weaker than sharp.  What about simply putting the Computer Championship winner up for screening by itself (especially if it decisively won the CC)?  If it does poorly, then the second place finisher gets a chance.

Title: Re: Challenge Screening Rules
Post by browni3141 on Mar 15th, 2015, 10:57pm
I definitely think that the screening games should have a delay. I don't like that active spectating is discouraged because the games are not delayed. Even though I'm not discouraged, some people are and I miss their comments.

Title: Re: Challenge Screening Rules
Post by deep_blue on Mar 16th, 2015, 12:04am
I might agree to browni. Then I also could be in the chatroom while my games without anyone by mistake could make a comment on that game. ;)

Title: Re: Challenge Screening Rules
Post by Boo on Mar 16th, 2015, 1:13pm

Quote:
I definitely think that the screening games should have a delay.


+1

Personally I would like an option to play screening games with 1min per move. Devoting 8 hours is problematic.

Title: Re: Challenge Screening Rules
Post by Janzert on Mar 16th, 2015, 2:07pm
Regarding screening delay; I can't think of any disadvantages and there seem to be a few advantages to having screening games delayed.

For the overall screening process; I would love it if someone knows how to calculate a likelihood of superiority (https://chessprogramming.wikispaces.com/Match+Statistics#Likelihood%20of%20superiority) for the two bots from each of the past screenings. My feeling is that the screening is only marginally better than a coin toss at choosing the better "anti-human" bot. Further given the constraints of time, community size and involvement, etc. that it is simply impossible to design a screening that consistently gives a good discrimination between the two bots. Also I think winning the Computer Championship should give a lot better reward towards a shot at the challenge than just a tie break. In summary, I think the current screening method is quite unfair to the CC winner. Also any change that further limits the number of participants is, in general, most likely going to make it worse. Given the constraints we have to work with, it seems much better to simply let the CC winner directly have a shot at the challenge.

I do think it's a good idea to expose the challenger to human play before the challenge takes place. It also seems that there would be a general interest to play all the CC bots on the championship hardware and settings. So I would propose replacing the screening with a period where all the CC bots are available to play. It would be nice if there was some controls put into place to only allow one bot running on a given server at a time. But if something went wrong and multiple bots ran at a time or a game ended due to a server error there wouldn't be a need for any corrective action on the game (except for a possible unrating). In order to keep from moving the goalposts too much the challenge defenders should probably be restricted from playing the CC winner. The possibility  of extra games played against the challenger is somewhat counter balanced by assurance of being the challenger if you win the CC and I expect that the number of games will probably remain within the same order of magnitude anyway.

Janzert

Title: Re: Challenge Screening Rules
Post by browni3141 on Mar 16th, 2015, 2:26pm

on 03/16/15 at 14:07:41, Janzert wrote:
Regarding screening delay; I can't think of any disadvantages and there seem to be a few advantages to having screening games delayed.

For the overall screening process; I would love it if someone knows how to calculate a likelihood of superiority (https://chessprogramming.wikispaces.com/Match+Statistics#Likelihood%20of%20superiority) for the two bots from each of the past screenings. My feeling is that the screening is only marginally better than a coin toss at choosing the better "anti-human" bot. Further given the constraints of time, community size and involvement, etc. that it is simply impossible to design a screening that consistently gives a good discrimination between the two bots. Also I think winning the Computer Championship should give a lot better reward towards a shot at the challenge than just a tie break. In summary, I think the current screening method is quite unfair to the CC winner. Also any change that further limits the number of participants is, in general, most likely going to make it worse. Given the constraints we have to work with, it seems much better to simply let the CC winner directly have a shot at the challenge.

Janzert


There are also a few things we can do to encourage more screening games. Lowering the time control to 1m/move like Boo suggested or 45s/move would certainly help, and I know there are objections to this because it's not the same as the Challenge time control, but right now we just don't get enough games for a fair contest, so I think it is a good sacrifice.

Also, we might be able to increase the number of pairs allowed per player if we shorten the tc. Players that don't want to play even one game now might be willing to play six at 45s/move. Six short games can be less of a commitment than one long one. It's a bit of a gambit given the possibility that only a few people will play all of their pairs and the rest will not change behavior, but again I think it's worth it to try to get more games.

Another thing that could be done is offering an incentive for players who complete all four games, such as a free digital copy of chessandgo's book (given his approval, of course) or a discount on the next years WC entry, or a free postal mixer entry, etc.

If we feel that the screening isn't fair to the WCC winner, we could either consider the WCC a qualifier for the main event, or increase the advantage of the WCC winner in the screening.

Title: Re: Challenge Screening Rules
Post by odin73 on Mar 16th, 2015, 3:18pm

Quote:
Several informal discussions of the rules for the Challenge Screening phase have popped up recently.  It has been a long time since the current format was chosen; since then we have gained a lot of experience with it, and also circumstances have changed.  It seems time to revisit the reasons behind the rules thoroughly and formally..
That said, any suggested changes will be considered...... Let the ideas flow!

Let me state some short comments:

1. I think the screening with two bots is obsolete. It´s a heritage from antigue times when bot bashing gave easy and obvious opportunities to win for human players. Nowadays this time is over. Just let´s go for the strongest CC bot.

2. When only one bot is played a further issue becomes more important to be focused: time. For players with a real life, it´s rather unlikely to find the time for 4 long lasting games. Just go with one bot, i.e. 2 games per set. If anybody wants to play a bot for serveral times, let him/her to do so. The more games the defenders are able to check for the bot´s behavior the better.

3. Don´t restrict the screening to a few players (as proposed by kzb52). In last years even weaker players contributed with some games to the process how to win against a bot. Remember aurelian, he managed to win against Ziltoid with a rating difference of 900 rating points.

4. I think we don´t need a delay for a screening game. It should be handled like any other normal HvB game when comments in the chat are common and welcome (also absolutely no issue when a single bot is played). Screening is for bot bashing, for nothing else.

5.  Time issue #2: Think twice about the time control. 1min/move or 90s per move may be more attractive for some players. Let the player chose (as stated by Boo recently) at least between 1min and 2min/move. 2min/move for the Challenge is ok.
Fritzlein, you may find it funny to play 7-8 hours non-stop, me and most player´s rather don´t, imo.

6. Collateral issue: This year the old 2014´s bots were exposed rather late to common playing. This may be an issue since there wasn´t much time for everybody (also the defenders) to adapt on the bot´s 2014 playing strength and so maybe make it more difficult to be prepared for the new 2015´s bots.

Enjoy!
8)

Title: Re: Challenge Screening Rules
Post by Belteshazzar on Mar 16th, 2015, 8:12pm

on 03/16/15 at 15:18:59, odin73 wrote:
1. I think the screening with two bots is obsolete. It´s a heritage from antigue times when bot bashing gave easy and obvious opportunities to win for human players. Nowadays this time is over. Just let´s go for the strongest CC bot.

2. When only one bot is played a further issue becomes more important to be focused: time. For players with a real life, it´s rather unlikely to find the time for 4 long lasting games. Just go with one bot, i.e. 2 games per set. If anybody wants to play a bot for serveral times, let him/her to do so. The more games the defenders are able to check for the bot´s behavior the better.


Good points.  However, Fritz expressed concern above that if the CC winner were automatically the defender, that might cause bot developers to focus too much on performance against other bots.  So we should still have some kind of safeguard in place.  I would suggest that the CC winner be screened alone, and if it turns out to be susceptible to bot-bashing, the second place finisher could then be screened.

Title: Re: Challenge Screening Rules
Post by browni3141 on Mar 16th, 2015, 9:41pm

on 03/16/15 at 20:12:13, Belteshazzar wrote:
Good points.  However, Fritz expressed concern above that if the CC winner were automatically the defender, that might cause bot developers to focus too much on performance against other bots.  So we should still have some kind of safeguard in place.  I would suggest that the CC winner be screened alone, and if it turns out to be susceptible to bot-bashing, the second place finisher could then be screened.


How could this be done in a fair way? Who gets to decide whether the CC winner is more bashable than the runner up might be without even playing them both?

Title: Re: Challenge Screening Rules
Post by supersamu on Mar 17th, 2015, 5:42pm
I have read the previous forum posts but want to make a summary of what is important to me.

I for myself am very much opposed to anything that can be argued to be a disadvantage of the challenger bot. I would really dislike having to hear that we are moving goalposts.

How can we get a better likelihood of choosing the strongest bot for the challenge and simultaneously not disadvantage the bots compared to the old screening rules?

- Encourage Players to play more games, for example by giving a monetary incentive

- Let Players be able to play games at shorter time controls, possibly let games with shorter time controls not count as one full pair for purposes of determining which bot is better, but definitely don't allow Players to play more than 4 games in the screening. (moving goalposts)

- Not finishing a screening pair could cost Arimaa points

- I don't like letting Players be able to play games at shorter time controls without them counting towards the screening, because then Players could try weird bot-bashing techniques where even if the bot manages to win, it gains nothing towards the screening. In that light, even letting shorter games count for less could be seen as encouraging bot-bashing in games with shorter time control.

Title: Re: Challenge Screening Rules
Post by clyring on Mar 17th, 2015, 6:21pm

on 03/17/15 at 17:42:15, supersamu wrote:
- Not finishing a screening pair could cost Arimaa points
I could also see this scaring me away from starting a pair if I'm not totally sure I will be able to finish it. In that light, maybe not the best way to boost participation.

Title: Re: Challenge Screening Rules
Post by Janzert on Mar 18th, 2015, 8:22pm

on 03/16/15 at 14:07:41, Janzert wrote:
I would love it if someone knows how to calculate a likelihood of superiority (https://chessprogramming.wikispaces.com/Match+Statistics#Likelihood%20of%20superiority) for the two bots from each of the past screenings.


I realized I could simply use bayesElo, although I'm not completely sure its calculation is correct for this situation. But here's what it comes up with for all the past screenings:


Code:
2009: 79% bot_clueless > bot_Gnobot
2010: 65% bot_marwin > bot_clueless
2011: 59% bot_marwin > bot_sharp
2012: 75% bot_briareus > bot_marwin
2013: 65% bot_marwin > bot_ziltoid
2014: 60% bot_ziltoid > bot_sharp


Those numbers certainly reinforce my dislike for the screening.

Janzert

Title: Re: Challenge Screening Rules
Post by deep_blue on Mar 19th, 2015, 9:56am
Here are my thoughts:
1. I would agree to a screening where one could play as many games as one wants. Yes, one could just play long enough to find a single weakness BUT a good artificial intellegence should play strong enough to not have such easy to exploit weaknesses.
Allowing this would solve a bunch of problems, first noone (except maybe bot programmers) would want to create a duplicate account. Also there would be many more game which would clearly give a more precise result of the playing strength of the bots etc.
2. I would agree with shorter time controls being possible.
3. I disagree with odin that only one bot should be allowed in the Screening. Firstly the WCC is a short event with much luck involved when it's close. Generally I see nothing wrong with playing two bots and if one proves to do better against humans than against bots, why not?
4. I consider odin's idea not delaying and still proposing moves in the chat to be interesting. Humanity wants to find a bot's weaknesses so I could imagine working together for this.
5. It would definitely be wrong to penalise not finishing pairs. Giving a free WC entry when finishing all playable pair sounds interesting though.

6. New idea: In case of unlimited games one could give players who have played less screening games so far a higher priority of being allowed to start games in case many players want to play at once.

Title: Re: Challenge Screening Rules
Post by Fritzlein on Mar 19th, 2015, 10:37am
Janzert, what does bayesElo say about the likelihood of superiority of the computer champion based just on the Computer Championship games in 2009-2014?

Title: Re: Challenge Screening Rules
Post by Janzert on Mar 20th, 2015, 8:05pm
Ok, here's the WCC LoS between first and second each year:


Code:
2009: 83% bot_clueless > bot_Gnobot
2010: 67% bot_marwin > bot_clueless
2011: 58% bot_sharp > bot_marwin
2012: 60% bot_marwin > bot_briareus
2013: 69% bot_ziltoid > bot_marwin
2014: 77% bot_sharp > bot_ziltoid
2015: 97% bot_sharp > bot_Z


So WCC is barely any better at getting confident between the top 2 spots. The difference to me is the WCC is straightforward to change to get better differentiation if so desired.

Also for grins I ran a combined LoS using both the Screening and WCC games.


Code:
2009: 90% bot_clueless > bot_Gnobot
2010: 73% bot_marwin > bot_clueless
2011: 53% bot_marwin > bot_sharp
2012: 66% bot_briareus > bot_marwin
2013: 50% bot_ziltoid > bot_marwin
2014: 57% bot_sharp > bot_ziltoid


Janzert

Edit: And since I now have the scripts to do it fairly easily for any event in the database here's the same for past WCs as well:


Code:
2009: 66% chessandgo > Fritzlein
2010: 67% chessandgo > Fritzlein
2011: 83% chessandgo > Adanac
2012: 73% hanzack > chessandgo
2013: 79% chessandgo > Boo
2014: 94% chessandgo > browni3141

Title: Re: Challenge Screening Rules
Post by Janzert on Mar 21st, 2015, 2:42pm
Here's probably the situation that makes me think that I wouldn't really like the screening almost no matter how it is setup. Consider if bot_a wins the WCC, bot_a and bot_b go to the screening, bot_b wins the screening and then wins the challenge.

As either bot's author I wouldn't be terribly happy (less so as bot_a's though ;)). And I think would open up quite a bit of dissatisfaction all around.

Janzert

Title: Re: Challenge Screening Rules
Post by rbarreira on Mar 21st, 2015, 7:11pm

on 03/20/15 at 20:05:20, Janzert wrote:
Ok, here's the WCC LoS between first and second each year:


Code:
2009: 83% bot_clueless > bot_Gnobot
2010: 67% bot_marwin > bot_clueless
2011: 58% bot_sharp > bot_marwin
2012: 60% bot_marwin > bot_briareus
2013: 69% bot_ziltoid > bot_marwin
2014: 77% bot_sharp > bot_ziltoid
2015: 97% bot_sharp > bot_Z


So WCC is barely any better at getting confident between the top 2 spots. The difference to me is the WCC is straightforward to change to get better differentiation if so desired.

Also for grins I ran a combined LoS using both the Screening and WCC games.


Code:
2009: 90% bot_clueless > bot_Gnobot
2010: 73% bot_marwin > bot_clueless
2011: 53% bot_marwin > bot_sharp
2012: 66% bot_briareus > bot_marwin
2013: 50% bot_ziltoid > bot_marwin
2014: 57% bot_sharp > bot_ziltoid


Janzert

Edit: And since I now have the scripts to do it fairly easily for any event in the database here's the same for past WCs as well:


Code:
2009: 66% chessandgo > Fritzlein
2010: 67% chessandgo > Fritzlein
2011: 83% chessandgo > Adanac
2012: 73% hanzack > chessandgo
2013: 79% chessandgo > Boo
2014: 94% chessandgo > browni3141


This is why I wish the WCC could have many more games in it. 2015 was the only recent year when we can say with any certainty that the best bot won the WCC. The screening seems like a great idea in theory, in practice not so much. Frankly this is part of the reason why I find it hard to justify pouring a lot of time into bot development. It's disheartening to work a lot only to see the bot play just a few important games per year.

To put it in a more harsh way, the current format of the tournaments does not respect the hard work developers put into Arimaa.

If the servers / infrastructure were reliable enough, I would be asking for a WCC with many more lives per bot, but unfortunately this does not seem possible since there are way too many games with server / lag / zombie process problems. While there's a reasonably easy way to solve this last problem (a script to kill processes / reboot the servers after each game), the first two seem harder to tackle without significant work from Omar.

Title: Re: Challenge Screening Rules
Post by Janzert on Mar 21st, 2015, 11:20pm
If I could wish for anything, I would love it if the CC were 1-2 weeks of solid, 24 hour a day, back to back games. Maybe with increasing TCs like the WC has to get even more games. Then 2-3 weeks of human exposure, for all the CC bots on the CC hardware. With the challenge being played against the CC winner following that.

The CC winner could still have the defenders banned from playing it. I was also initially going to propose a maximum game cap. But given 3 weeks time, a maximum of 2 concurrent games, and an average of 2 hour games, it's actually impossible to get more than 504 games. About 12 times the number of games for the most popular screening year so far in 2011 with 85 games total. 45 for sharp and 40 for marwin. So in practice the increase in human exposure would almost certainly be well within an order of magnitude.

Janzert

Title: Re: Challenge Screening Rules
Post by rbarreira on Mar 22nd, 2015, 4:47am

on 03/21/15 at 23:20:17, Janzert wrote:
If I could wish for anything, I would love it if the CC were 1-2 weeks of solid, 24 hour a day, back to back games. Maybe with increasing TCs like the WC has to get even more games. Then 2-3 weeks of human exposure, for all the CC bots on the CC hardware. With the challenge being played against the CC winner following that.

The CC winner could still have the defenders banned from playing it. I was also initially going to propose a maximum game cap. But given 3 weeks time, a maximum of 2 concurrent games, and an average of 2 hour games, it's actually impossible to get more than 504 games. About 12 times the number of games for the most popular screening year so far in 2011 with 85 games total. 45 for sharp and 40 for marwin. So in practice the increase in human exposure would almost certainly be well within an order of magnitude.

Janzert


I agree 100% with this.

Title: Re: Challenge Screening Rules
Post by deep_blue on Mar 22nd, 2015, 8:39am
I completely agree with rbarreira that the WCC should have many more games. Also I can understand very well if a programmer is more motivated when there is a longer WCC.
But then I don't see why we should stop the Screening. What if instead the WCC was a, say, decem-elimiation (or however this is called; 10 losses means you are out) and the winner gets his lives that are left as bonus to the Screening points (so you are rewarded for a clear WCC win).

Title: Re: Challenge Screening Rules
Post by lightvector on Mar 22nd, 2015, 9:20am

on 03/21/15 at 23:20:17, Janzert wrote:
If I could wish for anything, I would love it if the CC were 1-2 weeks of solid, 24 hour a day, back to back games. Maybe with increasing TCs like the WC has to get even more games. Then 2-3 weeks of human exposure, for all the CC bots on the CC hardware. With the challenge being played against the CC winner following that.

The CC winner could still have the defenders banned from playing it. I was also initially going to propose a maximum game cap. But given 3 weeks time, a maximum of 2 concurrent games, and an average of 2 hour games, it's actually impossible to get more than 504 games. About 12 times the number of games for the most popular screening year so far in 2011 with 85 games total. 45 for sharp and 40 for marwin. So in practice the increase in human exposure would almost certainly be well within an order of magnitude.

Janzert


on 03/22/15 at 04:47:35, rbarreira wrote:
I agree 100% with this.


I support this for future years as well.

If people think that the screening should still be used to discriminate between and select one of the bots, I think it's important that a format is chosen that encourages many, many more games than currently. Which I think means, among other things, offering faster time controls for the screening games (e.g. - the user has a choice between 30s/move, 60s/move, and 2m/move for each game pair, and the limit on pairs is more than 2).



Title: Re: Challenge Screening Rules
Post by deep_blue on Mar 22nd, 2015, 10:05am
Well, if even the bot programmer with the best chances for the Challenge win agrees to my suggestion to play more games then I think that should be done. Also many Screening games could be a good oportunity to fix a bot's weaknesses?!
And it would be an increase of serious games like rbarreira wanted it. The only question is if there would be so many more games but that's of course no reason to not try out. But in case of shorter time controls I think those should count less (which also would be some incentive to play the slower time controls when one has the time to do(since that is the Challenge time control...))

Title: Re: Challenge Screening Rules
Post by supersamu on Mar 22nd, 2015, 11:00am
It would also be possible that either the final two or the WCC winner may decide which screening format to use.
In the case that the two top developers can't agree on one of the formats, the default format (the one we use right now is used).
Then there is no danger of Omar being accused of moving the goalposts.
It could also be an additional bonus for the WCC winner to be able to choose the screening format.
I also would like to see a WCC with more games.

Title: Re: Challenge Screening Rules
Post by browni3141 on Mar 22nd, 2015, 7:45pm

on 03/22/15 at 11:00:05, supersamu wrote:
It would also be possible that either the final two or the WCC winner may decide which screening format to use.
In the case that the two top developers can't agree on one of the formats, the default format (the one we use right now is used).
Then there is no danger of Omar being accused of moving the goalposts.
It could also be an additional bonus for the WCC winner to be able to choose the screening format.
I also would like to see a WCC with more games.


I think this creates a conflict of interest.

Title: Re: Challenge Screening Rules
Post by Janzert on Mar 23rd, 2015, 2:04am
I finally had the probably obvious idea to take a look at how the various bots have fared in the gameroom since being added to the server each year. Using only rated HvB games here is what it looks like (number of games played in parentheses):

Code:
Rated HvB games against CC server bot:
2009: ? bot_clueless (125) ? bot_gnobot (2)
2010: ? bot_marwin (303) ? bot_clueless (0)
2011: 56% bot_marwin (75) > bot_sharp (78)
2012: 58% bot_marwin (35) > bot_briareus (60)
2013: 98% bot_ziltoid (51) > bot_marwin (20)
2014: 52% bot_ziltoid (11) > bot_sharp (14)


Not surprisingly there isn't much appetite to play the bots at the CC time control. This might also suggest that simply increasing the screening period would do little good?

Next I decided to take all the games played by all the various timecontrol versions available on the server. This of course is the dirtiest data yet, but does provide a large number of games. To expand on that thought a bit, the WCC of course provides the best structure for determining strength in that it's a fully structured tournament including head to head games. The screening allows a self selecting population of opponents that is allowed to stop early. The gameroom has self selecting opponents that also fully select the number of games to play over a long period of time where their strength is likely to vary significantly and for which bayeselo does nothing to account for.

Anyway here is a combined table, also listing the number of games played by each bot.


Code:
Year  Screening   .   .   .   .   .   | WCC   .   .   .   .   .   .   . | Gameroom
2009: 79% clueless (27) > gnobot (26) | 83% clueless (8) > gnobot (8)   | 99% gnobot (864) > clueless (1642)
2010: 65% marwin (30) > clueless (25) | 67% marwin (10) > clueless (9)  | 100% marwin (3545) > clueless (338)
2011: 59% marwin (40) > sharp (45)   .| 58% sharp (9) > marwin (8)   .  | 99% sharp (3007) > marwin (1291)
2012: 75% briareus (41) > marwin (39) | 60% marwin (10) > briareus (11) | 93% briareus (1040) > marwin (901)
2013: 65% marwin (25) > ziltoid (28)  | 69% ziltoid (10) > marwin (9)   | 95% ziltoid (957) > marwin (789)
2014: 60% ziltoid (36) > sharp (35)   | 77% sharp (9) > ziltoid (10)  . | 99% sharp (710) > ziltoid (280)


Janzert

Title: Re: Challenge Screening Rules
Post by Janzert on Mar 23rd, 2015, 2:47am
Now to back up to a higher level. First let me note in this and most of my posts when talking about the screening I mean primarily the process of taking the top two WCC finishers and playing them against humans to then determine which of those two plays in the Challenge.  I am not arguing against or really even talking about exposing the eventual challenger to human opponents before the actual challenge.

Why do we have a screening? It seems the purpose is to avoid an 'anti-bot' bot from winning the WCC thus depriving a more deserving challenger of the chance. So it should relieve the bot developers from worrying about that scenario and allow them to simply concentrate on making the bot play the best possible to give the best chance at winning the challenge. The screening, if working as intended, is supposed to make the challenge harder for humanity to defend. In that case the bot developers should be the primary proponents for the screening. Yet that seems to be pretty clearly not the case. Although it's a fairly small number that have chimed in here, we seem to be the most interested in getting rid of it completely. If the group that the screening is suppose to benefit is opposed to it what are the benefits of having it?

Putting aside the ability to get a screening in place that does a good job of picking the stronger anti-human bot. To me the screening feels like double jeopardy for the WCC winner. It seems if the bot was able to win the WCC it should automatically get the privilege of facing the Challenge.

Janzert

Title: Re: Challenge Screening Rules
Post by rbarreira on Mar 23rd, 2015, 5:36pm

on 03/23/15 at 02:47:34, Janzert wrote:
Why do we have a screening? It seems the purpose is to avoid an 'anti-bot' bot from winning the WCC thus depriving a more deserving challenger of the chance.


Besides all the good points you made, allow me to pile on here. Even this supposed benefit of the screening is very tenuous, when you consider that it doesn't work anymore as soon as there are two or more "anti-bot bots" participating in the WCC.

But the more important point is that "perfect is the enemy of good" - even if we accept that finding the best "anti-human" bot is an important thing to focus on, as you said there's little prospect of the screening actually doing that (due to lacking enough games from strong enough humans). Furthermore, the idea of allowing shorter time controls to increase the number of screening games might solve this, but then again it might make it worse as well -- humans have a harder time beating top bots at shorter TCs, making the games less informative if there are few human wins.

To sum up... given all the data we have I agree that the best practical chance of finding the best bot for the challenge would come from having more WCC games.

Title: Re: Challenge Screening Rules
Post by Fritzlein on Mar 24th, 2015, 1:28am

on 03/21/15 at 23:20:17, Janzert wrote:
The CC winner could still have the defenders banned from playing it. I was also initially going to propose a maximum game cap.

If individuals are not limited in the number of Screening games they are allowed play, should we worry that a small number of individuals will monopolize the two servers, such that many people won't find the bot ever available at a convenient time?


Quote:
But given 3 weeks time, a maximum of 2 concurrent games, and an average of 2 hour games, it's actually impossible to get more than 504 games. About 12 times the number of games for the most popular screening year so far in 2011 with 85 games total. 45 for sharp and 40 for marwin. So in practice the increase in human exposure would almost certainly be well within an order of magnitude.

Wait, are you saying that as long as we don't let humans play ten times as many games as before, you aren't worried about moving the goalposts?  Say, triple the number of Screening games against the eventual Challenger would not be a problem?  (Note that having a single bot available for play on both servers instead of two bots in itself doubles the exposure of that bot.)

I have always wanted as many different humans as possible to participate in the Screening, but somehow to me that feels less hard on the bots than allowing a few individuals to play over and over and over.  Not only does the former make the Screening more of a community event than the latter, allowing the latter would, I think, significantly increase the ability of humans to defend the Challenge due to the efforts of dedicated bot-bashers.

One can make a case that the current exposure of the bots is too little, but it seems clear-cut that to increase it now creates bad publicity.  Surely there is a way to change the format to address your concerns about it without also making it harder for the Challenge to be won?

Title: Re: Challenge Screening Rules
Post by deep_blue on Mar 24th, 2015, 9:50am
I still think that Screening is a good thing and that a much larger WCC should be used that the left lives of the winner count as points for it.
Fritzlein, I seem to have forgot to mention it: My idea was that players with lower games so far have higher priority to play games. That means that one server can only played by players who have less games than some arbitrary number (like e.g. 1). The other server(s) can be used by everyone and like this it's very unlikely that someone with few games doesn't get any more games. (since it's very rare anyway that there is more than one game at the same time)

Title: Re: Challenge Screening Rules
Post by Janzert on Mar 25th, 2015, 1:40pm
To distinguish from a screening period that is used to determine the challenger I'm going to start calling the scenario where the bots are available for play but the challenger is already determined an exhibition period.


on 03/24/15 at 01:28:45, Fritzlein wrote:
If individuals are not limited in the number of Screening games they are allowed play, should we worry that a small number of individuals will monopolize the two servers, such that many people won't find the bot ever available at a convenient time?


Maybe, maybe not. If it is a problem, or even just enough concern that it could be a problem, there are a number of straightforward and relatively easy to implement ways to mitigate it. To name a few, cool down periods (static or increasing), a queue, threshold after which the player can only start a game if both servers are currently free.

One important point to note is, should a player circumvent whatever the control is in place, it doesn't in any way threaten the integrity of the exhibition period.

Regarding: Maximum games a bot could be played would be ~500 or ~10 times the highest screening to date.

Quote:
Wait, are you saying that as long as we don't let humans play ten times as many games as before, you aren't worried about moving the goalposts?  Say, triple the number of Screening games against the eventual Challenger would not be a problem?  (Note that having a single bot available for play on both servers instead of two bots in itself doubles the exposure of that bot.)


(The 2 server setup is accounted for in the above number; 3 weeks * 7 days * 24 hours * 2 servers / 2 hours per game = 504 games)

I'm saying that as a bot developer going from a screening period that has historically had a maximum of ~50 games played against the bot to an exhibition period that could theoretically have up to ~500 games, this is a tradeoff I don't even have to think about. It's an instant yes, please make the change.

One other thing to note is that this proposal is just doubling the theoretical maximum number of games that could be played against the challenger. Of course the current setup would be a bit harder to approach that theoretical limit since it requires the coordination of more than 120 players. :)

Three times the number of games in practice would still be an instant yes. Somewhere around five times the number of games actually played I might start needing to actually think about the proposition. ;)


Quote:
I have always wanted as many different humans as possible to participate in the Screening, but somehow to me that feels less hard on the bots than allowing a few individuals to play over and over and over.  Not only does the former make the Screening more of a community event than the latter, allowing the latter would, I think, significantly increase the ability of humans to defend the Challenge due to the efforts of dedicated bot-bashers.


One scenario that has always worried me about the challenge is to have a bot win it and then shortly after that a bot basher finds a simple trick that always beats the bot. Where the trick could be taught to a beginner player easily without needing to understand good Arimaa play in general. The screening was suppose to help mitigate this but I don't think 2 games per human really accomplishes much. The general development of the bots and bot bashing has done more to reassure me. But allowing the bot bashers to really bot bash for a few weeks would be even better.


Quote:
One can make a case that the current exposure of the bots is too little, but it seems clear-cut that to increase it now creates bad publicity.  Surely there is a way to change the format to address your concerns about it without also making it harder for the Challenge to be won?


Where does that bad publicity come from? If the group the change is suppose hurt are the primary proponents for it, how can someone say the change was made maliciously against that group?

In reality I doubt any change to the screening is likely to hardly even be noticed outside of the Arimaa community proper. The most detailed accounting of Arimaa happenings I know of, outside of an Arimaa specific site, is the Wikipedia page (http://en.wikipedia.org/wiki/Arimaa). Currently the Wikipedia page, while mentioning the change to 3 challenge defenders, doesn't have a single mention of the screening. Which leads to the rather weird situation of having no explanation for why the WCC winner is not always the challenger. I expect someone will probably fix this now that I've mentioned it though. :)

Janzert

Title: Re: Challenge Screening Rules
Post by Boo on Mar 26th, 2015, 2:27am
For me it seems strange that not all games are taken into account while determining bot's strength. I am talking about uncompleted pairs. The sample of games is small already, and the current system makes it even smaller.
Maybe awarding 0.5 for being ahead in uncompleted pair or counting total win% would be better solutions.

Title: Re: Challenge Screening Rules
Post by Janzert on Apr 1st, 2015, 8:41pm
Pulling a related comment out of another thread to reply here.


on 04/01/15 at 01:44:09, Fritzlein wrote:
Incidentally, my expectation that a win-on-score formula [against sharp2015] exists even though we haven't quite discovered it yet...


I feel very much the same.


Quote:
...reinforces my belief that it would be meaningfully unfair to bots to allow individuals unlimited Screening games in which to work out formulaic wins.


Yet have the opposite reaction from it. ;)

For me one of the big reasons to expose the challenge bot to human players pre-challenge is to try and avoid the situation where a bot wins the challenge and then a 'formulaic' method to beat the bot is found shortly thereafter. As a developer having to guard against humans developing bot bashing tricks doesn't bother me because that should be a much easier goal, and really sub-task of the primary goal of developing a bot that plays strategically and tactically superior to humans anyway.

Janzert

Title: Re: Challenge Screening Rules
Post by Belteshazzar on Apr 17th, 2015, 8:03am
What about a screening for potential human defenders?  Chessandgo seemed like a safe choice, but his losing streak to Sharp2014_Fast would have been a huge red flag had it occurred before he was selected.  In the perhaps unlikely event that the Challenge survives past this year, anyone who wants to defend should have to undergo some kind of screening against the previous year's bots (perhaps the Fast versions, to save time) running on the dedicated server.  If more than three people pass, the tiebreaker could be how they finish in the WC.  If less than three people pass, the other top WC finishers could get the remaining spots.          

Title: Re: Challenge Screening Rules
Post by mattj256 on Apr 17th, 2015, 2:53pm

on 04/17/15 at 08:03:15, Belteshazzar wrote:
What about a screening for potential human defenders?

I thought that Omar and Aamir currently choose who will be the challengers each year?  I don't have a problem with that because they are putting up most of the prize fund.  

I think regardless of the outcome of the tournament, I can't fault them for selecting a six-time world champion as a defender.  And if lightvector wins I will tip my cap to him and say that he absolutely deserves the prize.  I hope harvestsnow and browni3141 have more success than chessandgo in defending the challenge!

I think it's premature to talk about next year's defenders until we know how Sharp does this year.  If browni3141 and harvestsnow both defend the challenge it will be a very different conversation than if Sharp runs the table.

Title: Re: Challenge Screening Rules
Post by Belteshazzar on Apr 17th, 2015, 5:28pm

on 04/17/15 at 14:53:08, mattj256 wrote:
I think it's premature to talk about next year's defenders until we know how Sharp does this year.  If browni3141 and harvestsnow both defend the challenge it will be a very different conversation than if Sharp runs the table.


If both defend the challenge, we might forget how close it came.  I'd rather not see the challenge fall until a bot is truly unbeatable per the current understanding of Arimaa.  If the challenge survives this year, next year we should narrow down the players who are best suited to defend against a super-strong bot.  Maybe by then chessandgo will qualify, but this year he wouldn't have.    

Title: Re: Challenge Screening Rules
Post by Fritzlein on Apr 18th, 2015, 5:43pm
Well, now all this discussion is moot.  There is no need to worry about moving the goalposts after a goal has been scored.  :)  I suppose now the thing to do is simply put the winning bots up for a period of open play after the Computer Championship, while the servers are still being rented, and then immediately make them into regular server bots.  Assuming we still have an annual Computer Championship.

Title: Re: Challenge Screening Rules
Post by Belteshazzar on Apr 29th, 2015, 8:25pm
It looks like this is not moot after all.  Does anyone have any thoughts on my idea of screening potential Challenge defenders using the previous year's Fast bots running on the dedicated server?

Title: Re: Challenge Screening Rules
Post by Belteshazzar on May 1st, 2015, 8:58pm
All this talk about psychological pressure gave me another idea.  If the Challenge really does continue next year, the best defender should go first.  This year, that would have been browni.  Had he gone first in the first round, he likely would have felt less pressure.  Even if he had lost, the other two would have known that browni had a reasonable chance of winning his next two, so his loss wouldn't have put too much additional pressure on them.  If all three lose their first game, there will be great pressure in the second round regardless of the order, but if humans start 0-4 including two losses by the best defender, the challenge is probably lost anyway.

Title: Re: Challenge Screening Rules
Post by mattj256 on May 2nd, 2015, 12:00am

on 05/01/15 at 20:58:40, Belteshazzar wrote:
If the Challenge really does continue next year, the best defender should go first.
Yes that is a good idea and there is real value on minimizing psychological pressure on the strongest human defender.  (If you're going to go this route I would say you might as well let the strongest defender choose which position they want.)  The sixth game in particular must have been intense, because the challenge really WAS riding on browni's shoulders.

In the current format each round lasts one week, so it takes three weeks to play nine games.  If you start reordering games by player strength then you have to either accept a longer tournament or ask the humans to play at times that are less than ideal.  The first option is a little bad and the second option is very bad.

Title: Re: Challenge Screening Rules
Post by deep_blue on May 2nd, 2015, 11:33am
Now that the Challenge has been won I am not sure if we need the Screening anymore. Alternatively we could do the following:
All players are allowed to play as many games as they want against the top 2 bots. This at the same time IS the challenge. So anyone who plays at least 3 games against the bot has a mini match that counts. If you win 3/5 or 4/7 you of course also won. So we could have the whole community trying to win the challenge back from the bots. Also we would see when even the second best bot beats humanity convincingly.



Arimaa Forum » Powered by YaBB 1 Gold - SP 1.3.1!
YaBB © 2000-2003. All Rights Reserved.