Arimaa Forum (http://arimaa.com/arimaa/forum/cgi/YaBB.cgi)
Arimaa >> Events >> 2013 WCC R5 sharp vs marwin timeout
(Message started by: omar on Mar 5th, 2013, 8:30am)

Title: 2013 WCC R5 sharp vs marwin timeout
Post by omar on Mar 5th, 2013, 8:30am
bot_marwin timed out in this game. From the web server logs it seems like the web server process restarted due to the threading module crashing. See the errors at time: Mar 05 03:16:04

http://silver.arimaa.com/logs/httpd/error_log

Searching for this error lead me to:

https://bugzilla.redhat.com/show_bug.cgi?id=703495

Though it was reported as an OS bug, seems it a Python bug and was fixed in Python version 3.3. However, the default version of Python on the server is 2.6.6.

From the web server logs it seems like it also happened at Mar 03 08:18:38 and bot_marwin (or any other bots) was not running at that time.

So I would categorize this as a server fault and resume the game from the most recent position with the same reserve times. I add 10 seconds to the reserve of the player to move (bot_marwin) to compensate for the delay between setting up the game and starting the bots.


Title: Re: 2013 WCC R5 sharp vs marwin timeout
Post by omar on Mar 5th, 2013, 8:50am
I tried to resume the game, but sharp didn't join the game and timed out; although marwin did join the game and made a move. So I tried to resume this game and it seems both bots joined and are playing now.

Title: Re: 2013 WCC R5 sharp vs marwin timeout
Post by Janzert on Mar 5th, 2013, 11:28am
Omar, if you follow the bug report linked above upstream to the wsgi report you'll see that these messages are harmless and generated as wsgi is being shutdown. So they are not the cause of the problem. Apache seems to be shutting down in this case because it received a SIGHUP (finding the cause of that might lead to the real problem).

Certainly something out of the ordinary happened on the server and the restart seems to be completely appropriate. But the wsgi error is also not the cause just another result. I'd be happy to try and dig into it further if you'd like so possibly it can be prevented from happening again, but there doesn't seem to be any more clues in these logs as to the original cause.

Janzert

Title: Re: 2013 WCC R5 sharp vs marwin timeout
Post by omar on Mar 5th, 2013, 8:19pm
That's really strange. I wonder what's sending SIGHUP to the web server. I took a quick look but didn't see what might be causing this. It would be great if you could look at it as well. I'll PM you the login info.

Title: Re: 2013 WCC R5 sharp vs marwin timeout
Post by Janzert on Mar 6th, 2013, 6:58am
I haven't been on the server to check yet, but as I was going to sleep last night I'm 90% sure I figured it out. My guess right now is that this was caused by log rotation. After this morning's WCC game I'll try to get on and confirm the problem.

The easy fix is just don't have games between 3am and 4am server time. The correct fix would be to have the cgi script that starts up the bot daemonize itself, basically detaching from the parent process of the http server. This should take a dozen lines of code or less to accomplish, although I haven't looked at the script yet*.

Janzert

* If the script does this already then I'm back to stumped as to the cause. But I really think this will the root cause right now.

Title: Re: 2013 WCC R5 sharp vs marwin timeout
Post by omar on Mar 7th, 2013, 12:17am
I thought it might be a log rotate also, but since the logs continued I figured that might not be it.

Demonizing the bot process sounds like a good idea. I'll look into that.

Title: Re: 2013 WCC R5 sharp vs marwin timeout
Post by Janzert on Mar 8th, 2013, 10:06am
Only 2 of the ssl logs* were rotated, if any logs are rotated the process has to restart though.

Also Omar allowed me to add a bit more logging on the server. Specifically the server state log now includes a 'ps -efjH' call at the end. If you look at the output, for example here at the very end (http://silver.arimaa.com/logs/20130308/14/3901), you can see that while the bot runner scripts are changing the user and parent process the process group and session** are being retained from the web server. So any signal sent to the web server will also be sent to the bots.

Janzert

* ssl_access and ssl_request

** 4th and 5th column of the ps command output

Title: Re: 2013 WCC R5 sharp vs marwin timeout
Post by rbarreira on Mar 11th, 2013, 10:22am
Could this problem affect screening games?

Title: Re: 2013 WCC R5 sharp vs marwin timeout
Post by Janzert on Mar 11th, 2013, 10:35am
Yes, it would be possible for this to happen again right now.

I'm (slowly) working on the modifications needed to prevent it in the future, but nothing has been changed yet.

Janzert

Title: Re: 2013 WCC R5 sharp vs marwin timeout
Post by Janzert on Mar 13th, 2013, 2:40am
I finally got time and all the kinks worked out of my patch to fix this issue and sent it off to Omar. So hopefully in the near future this should no longer be a concern.

Janzert

Title: Re: 2013 WCC R5 sharp vs marwin timeout
Post by tize on Mar 13th, 2013, 2:14pm
Nice done Janzert. And I think the "finally" should be a "already".



Arimaa Forum » Powered by YaBB 1 Gold - SP 1.3.1!
YaBB © 2000-2003. All Rights Reserved.