Downtime Saturday 21/11/2009 for servers #8-11, 08:32-21:50 GMT (13h 18m), Incident report |
![]() ![]() |
Downtime Saturday 21/11/2009 for servers #8-11, 08:32-21:50 GMT (13h 18m), Incident report |
Nov 22 2009, 02:01 AM
Post
#1
|
|
![]() Owner of [DumB] Group: Management Posts: 9,150 Joined: 4-January 04 From: Heaven. Member No.: 37 |
Incident type: Hardware failure
Servers affected: DumB-003 (VM1, VM2). Game Servers #8*, #9*, #10, #11 Duration: 13 hours, 18 minutes * Only affected during replacement/rebuild. Aprox. 16:00-21:00 (~5h). What happened? At 08:32 this morning a crucial component in one of our co-located server machines used for game servers unexpectedly failed. Our back end monitoring immediately notified me about this, and after investigating the issue with DC technicians it was determined that a replacement part was needed. It was not easy to get a hold of this on such short notice for same day delivery on a Saturday, however the part was eventually tracked down, ordered and delivered at the DC late Saturday afternoon. Rebuilding the server unit and getting the system back online and fully operational was not a trivial task, however we managed to get everything fully restored by 21:50 GMT. Have any measures been taken to limit the effect of a similar incident in the future? Yes. We now have spare stock to allow speedy replacement in the unlikely event of the same component failing again in the near future (in this or any of the other co-located servers we have). I have also improved our snapshot procedure so that the service can more easily be restored temporarily on another host while we work on resolving the main issue. Are there anything we could have done better during the incident? Yes. Unfortunately I was not at home while this happened, which made communication a little difficult while working to resolve the issue as soon as possible. Some may therefore have been unaware of what was going on, and that the issue was being worked on, up until early afternoon. Information flow will be improved in the future. Due to a mis-tagged power socket, game servers #1, #3, #4 and #6 suffered a momentary outage at 15:51. The mistake was quickly picked up, and all 4 servers we back online by 15:57. The sockets are now properly tagged to avoid future mix-ups. ############ A little over 13 hours of downtime is however not that bad for a community server and a failure of this nature in the middle of a weekend, so at the end of the day I think we can be reasonably pleased with the recovery. If you missed a Saturday of gaming due to the issues, we hope to see you back on the servers Sunday morning to catch up on all the missed playtime! If you come across any issues, please make sure to report them here - and we will do our best to resolve them. Happy gaming! PS: Hardware does not come free! We would greatly appreciate your support in form of a donation (of any size) to help cover both unforeseen expenses like these, and normal running costs. Thanks in advance! (click me) -------------------- There are no more barriers to cross. All I have in common with the uncontrollable and the insane, the vicious and the evil, all the mayhem I have caused and my utter indifference toward it I have now surpassed. My pain is constant and sharp and I do not hope for a better world for anyone, in fact I want my pain to be inflicted on others. I want no one to escape, but even after admitting this there is no catharsis, my punishment continues to elude me and I gain no deeper knowledge of myself; no new knowledge can be extracted from my telling. This confession has meant nothing. - Patrick Bateman (American Psycho)
|
|
|
|
![]() ![]() |
| Lo-Fi Version | Time is now: 7th September 2010 - 10:20 AM |