Full Disclosure

Started by Llewellyn, October 13, 2012, 11:22:49 PM

Previous topic - Next topic

Llewellyn

Fellow Sylvanites,

Seeing as our Realm was down for more than just an hour or two, I figured it would be intelligent to let you know exactly what happened, why, and how it's being prevented in the future.

The server is configured in such a way that speed and save times are optimized, however at an increased risk.  This risk is reduced by performing nightly backups every evening to a hard drive array with data redundancy. This gives us some of the fastest world saves possible every hour of every day while only risking a few hours of play time if the worst should happen.  Power surges and brown outs, unfortunately, can cause some severe issues with this or any other method of combining multiple hard drives together be it for speed or for backups, which we use both.  Seeing as I frequently have power issues, the Uninterruptible Power Supply (UPS) that I purchased should reduce if not eliminate this issue.  Ironically, the one that just arrived was defective and the following day we had a power blip which took out the shard's hard drive.

When trying to repair the failed disk array, one of the admins remotely shutdown the machine instead of reboot (this is unfortunately easier than you would expect as it has to be done via command line).  The up note was the restart was to verify the array was now functioning as intended, the bad news was unless we had another power blip or someone drove too many miles to restart the machine, we were going to have to wait for myself to return from vacation.  Ironically, I believe there was a second power blip and the server turned itself back on and the drive was indeed functioning.  One of the admins made off site backups and local backups once the server had come back online so there was a back-up plan were something to happen again.  

There are three issues here to address: power blips that make the system unstable and cause random reboots, better remote management in case I am on travel or vacation again, and better managed backups.  Here are the things which we are going to do to help prevent this issue from happening again.

1)  The power issue is currently being resolved once a non-defective UPS is shipped.  This should help mitigate the random longer downtimes due to the shard needed to be restarted by an admin.
2)  Remote management will be better configured with redundant means of access enabled, including remote starting and shutting down of the server. This should complete the remote management as every other action needed could be preformed sans fixing hardware obviously.
3) We will be looking into offsite backup solutions, and more frequent and better verified local backup solutions. We were lucky this time that the backups were there, but if there is a more catastrophic failure that affects the entire machine I want additional options. Off site backups may have a monthly cost of something like $5 a month, but it guarantees that we will have all of our data and the shard even if a hurricane comes (or my house catches on fire), let alone a power dip.

Again, sorry for the long wait, it couldn't have happened at a worse time, but realize that we don't take this kind of thing lightly.  If an hour or more of play time is lost, and it is something that was preventable, we will repair the issue and not band-aid it or assume it's "now working fine magically".

Now that I have returned home from my vacation, I have transferred the shard to an alternate server and I am going to completely overhaul our normal machine so that it is working perfectly. Please bear with us as this back-up machine might not be able to handle the loads as well and there might be some random dropped connections. But rest assured I will be working on this as I type this (transferring data) and I will be continuing to do so until this gets fixed.

Llewellyn