Author |
Topic: 2013 WCC R5 sharp vs marwin timeout (Read 1715 times) |
|
omar
Forum Guru
Arimaa player #2
Gender:
Posts: 1003
|
|
2013 WCC R5 sharp vs marwin timeout
« on: Mar 5th, 2013, 8:30am » |
Quote Modify
|
bot_marwin timed out in this game. From the web server logs it seems like the web server process restarted due to the threading module crashing. See the errors at time: Mar 05 03:16:04 http://silver.arimaa.com/logs/httpd/error_log Searching for this error lead me to: https://bugzilla.redhat.com/show_bug.cgi?id=703495 Though it was reported as an OS bug, seems it a Python bug and was fixed in Python version 3.3. However, the default version of Python on the server is 2.6.6. From the web server logs it seems like it also happened at Mar 03 08:18:38 and bot_marwin (or any other bots) was not running at that time. So I would categorize this as a server fault and resume the game from the most recent position with the same reserve times. I add 10 seconds to the reserve of the player to move (bot_marwin) to compensate for the delay between setting up the game and starting the bots.
|
|
IP Logged |
|
|
|
omar
Forum Guru
Arimaa player #2
Gender:
Posts: 1003
|
|
Re: 2013 WCC R5 sharp vs marwin timeout
« Reply #1 on: Mar 5th, 2013, 8:50am » |
Quote Modify
|
I tried to resume the game, but sharp didn't join the game and timed out; although marwin did join the game and made a move. So I tried to resume this game and it seems both bots joined and are playing now.
|
|
IP Logged |
|
|
|
Janzert
Forum Guru
Arimaa player #247
Gender:
Posts: 1016
|
|
Re: 2013 WCC R5 sharp vs marwin timeout
« Reply #2 on: Mar 5th, 2013, 11:28am » |
Quote Modify
|
Omar, if you follow the bug report linked above upstream to the wsgi report you'll see that these messages are harmless and generated as wsgi is being shutdown. So they are not the cause of the problem. Apache seems to be shutting down in this case because it received a SIGHUP (finding the cause of that might lead to the real problem). Certainly something out of the ordinary happened on the server and the restart seems to be completely appropriate. But the wsgi error is also not the cause just another result. I'd be happy to try and dig into it further if you'd like so possibly it can be prevented from happening again, but there doesn't seem to be any more clues in these logs as to the original cause. Janzert
|
« Last Edit: Mar 5th, 2013, 11:29am by Janzert » |
IP Logged |
|
|
|
omar
Forum Guru
Arimaa player #2
Gender:
Posts: 1003
|
|
Re: 2013 WCC R5 sharp vs marwin timeout
« Reply #3 on: Mar 5th, 2013, 8:19pm » |
Quote Modify
|
That's really strange. I wonder what's sending SIGHUP to the web server. I took a quick look but didn't see what might be causing this. It would be great if you could look at it as well. I'll PM you the login info.
|
|
IP Logged |
|
|
|
Janzert
Forum Guru
Arimaa player #247
Gender:
Posts: 1016
|
|
Re: 2013 WCC R5 sharp vs marwin timeout
« Reply #4 on: Mar 6th, 2013, 6:58am » |
Quote Modify
|
I haven't been on the server to check yet, but as I was going to sleep last night I'm 90% sure I figured it out. My guess right now is that this was caused by log rotation. After this morning's WCC game I'll try to get on and confirm the problem. The easy fix is just don't have games between 3am and 4am server time. The correct fix would be to have the cgi script that starts up the bot daemonize itself, basically detaching from the parent process of the http server. This should take a dozen lines of code or less to accomplish, although I haven't looked at the script yet*. Janzert * If the script does this already then I'm back to stumped as to the cause. But I really think this will the root cause right now.
|
|
IP Logged |
|
|
|
omar
Forum Guru
Arimaa player #2
Gender:
Posts: 1003
|
|
Re: 2013 WCC R5 sharp vs marwin timeout
« Reply #5 on: Mar 7th, 2013, 12:17am » |
Quote Modify
|
I thought it might be a log rotate also, but since the logs continued I figured that might not be it. Demonizing the bot process sounds like a good idea. I'll look into that.
|
|
IP Logged |
|
|
|
Janzert
Forum Guru
Arimaa player #247
Gender:
Posts: 1016
|
|
Re: 2013 WCC R5 sharp vs marwin timeout
« Reply #6 on: Mar 8th, 2013, 10:06am » |
Quote Modify
|
Only 2 of the ssl logs* were rotated, if any logs are rotated the process has to restart though. Also Omar allowed me to add a bit more logging on the server. Specifically the server state log now includes a 'ps -efjH' call at the end. If you look at the output, for example here at the very end, you can see that while the bot runner scripts are changing the user and parent process the process group and session** are being retained from the web server. So any signal sent to the web server will also be sent to the bots. Janzert * ssl_access and ssl_request ** 4th and 5th column of the ps command output
|
« Last Edit: Mar 8th, 2013, 10:06am by Janzert » |
IP Logged |
|
|
|
rbarreira
Forum Guru
Arimaa player #1621
Gender:
Posts: 605
|
|
Re: 2013 WCC R5 sharp vs marwin timeout
« Reply #7 on: Mar 11th, 2013, 10:22am » |
Quote Modify
|
Could this problem affect screening games?
|
|
IP Logged |
|
|
|
Janzert
Forum Guru
Arimaa player #247
Gender:
Posts: 1016
|
|
Re: 2013 WCC R5 sharp vs marwin timeout
« Reply #8 on: Mar 11th, 2013, 10:35am » |
Quote Modify
|
Yes, it would be possible for this to happen again right now. I'm (slowly) working on the modifications needed to prevent it in the future, but nothing has been changed yet. Janzert
|
|
IP Logged |
|
|
|
Janzert
Forum Guru
Arimaa player #247
Gender:
Posts: 1016
|
|
Re: 2013 WCC R5 sharp vs marwin timeout
« Reply #9 on: Mar 13th, 2013, 2:40am » |
Quote Modify
|
I finally got time and all the kinks worked out of my patch to fix this issue and sent it off to Omar. So hopefully in the near future this should no longer be a concern. Janzert
|
|
IP Logged |
|
|
|
tize
Forum Guru
Arimaa player #3121
Gender:
Posts: 118
|
|
Re: 2013 WCC R5 sharp vs marwin timeout
« Reply #10 on: Mar 13th, 2013, 2:14pm » |
Quote Modify
|
Nice done Janzert. And I think the "finally" should be a "already".
|
|
IP Logged |
|
|
|
|