« Outage at Chicago Datacenter - 9/18/20Outage / Restore - Saturn Server (7/18/2020) »

Saturn Server Reason For Outage 7/17-7/19/20

July 20, 2020 at 2:29 PM

This report will summarize and go into detail about the outage and recovery efforts on the Saturn server between 7/17/20 and 7/19/20. Note timeframe references are in our local time (CDT, GMT-5).

Summary:

On the night of Thursday 7/16/20 we had scheduled a server reboot to help improve server stability following an issue encountered the day before. The uptime of this server was over 1,000 days and the issue the night before led us to believe a reboot would be helpful to clear out a few old processes that may have been impacting server stability. During the scheduled reboot, the RAID arrays hosting both primary and backup drive data were no longer present aside from two disks on the main RAID 10 array. The data was not recoverable in this state, and so we began recovery efforts on the morning of 7/17 via re-image and restore from the latest backups taken the day of the reboot. Exceptionally slow speeds were cited going into the second half of the data restoration process, which led us to investigate possible underlying complications, ultimately finding that the server recovery had been executed on the wrong array. Options were evaluated for how to proceed, and it was deemed most suitable to re-initiate recovery again on the correct disks which began on the morning of 7/18. Recovery from backup was then completed in the early afternoon of 7/19. 

In depth:

Prior to a reboot on a server with a significant uptime, it is prudent to be sure proper backups are in order and all systems are operating as expected. We verified JetBackup (the backup service on the Saturn server) had taken a full set of successful backups on the same day as the reboot was to be performed, as expected and without complications on any accounts. We also verified via our hardware monitoring services that no issues were being reported. In this case, it is important to note what we were seeing reported by the RAID array at this time which was the following: OK (CTR, LD, PD, CV) -- Broken down, this effectively meant no issues were present on any of the key points related to the RAID configuration or hardware. Virtual drives (the RAID arrays), physical disks, the CacheVault (I/O caching and backup unit), and the controller itself all were scanned within minutes of the reboot and did not report any issues. This monitoring occurs every 5 minutes and all points of concern are checked each time. We were very careful in this case to double-check these monitoring points before the reboot because the server did have several disks replaced last year.

Following the reboot, all disks which were hot swapped in the year before had fallen out of their respective RAID arrays. The only two drives remaining with a visible RAID configuration belonged to an unrecoverable portion of the primary RAID array. At this point we began our best efforts to try to recover from the failed RAID arrays, but several hours later after making no progress and evaluating other options with experts in the field, we decided to start a full rebuild of the operating system (OS) & recovery from backups. We also updated the RAID card firmware during this process as we suspected it may have been to blame for the RAID failure, and we tested the integrity of future reboots on the installed drives by doing some stress testing prior to beginning the recovery effort (though that is not to say we will be rebooting the server again anytime soon - for sanity's sake!).

During the initial re-configuration of the server, a mistake was made which put the backup HDD drives in RAID 0 instead of RAID 1, which caused the wrong drives to be interpreted as the primary array during the OS re-install process. This mistake was not realized until late in the evening that same day when it became apparent there was something wrong with the recovery speed. We then evaluated amongst ourselves, and other experts whom we consult with, as to whether trying to clone the drives onto the main array or simply restarting the recovery process entirely was the best course of action. The clone process ultimately could have failed, and would have likely taken several hours to complete and more backups still needed to be restored afterward (about 45% of data remained to be restored). The restarting of the recovery process entirely would be a more certain course of action and afterward the restore from backup would be much faster. Another option was to continue with the restore process and then try the clone or transfer later from the restored data once everything was back online, but this came with other risks (the data was on a RAID 0 array under extreme load being the primary concern) and would have meant *extremely* slow speeds for the duration. If anything went wrong with that process, we would have been forced to restart recovery completely once again.

It was finally decided to restart the recovery process entirely - on the correct drives, and retaining data from the prior restore operation. We considered this to be the best course of action to not only get websites back online as quickly as possible but also to limit further downtime during and following the restore process. The recovery process was finally completed on 7/19/20 in the early afternoon around 1PM. Server responsiveness and stability then immediately returned to normal.

Going forward:

We do everything we can to prevent issues like RAID failure from becoming an actuality. RAID failure is the most feared situation in any hosting environment. We checked the array status, we checked the hardware backing the I/O cache, we checked the physical disks - all of this just minutes prior to the scheduled reboot. In fact, RAID arrays on all of our systems are checked every 5-10 minutes and drive replacement occurs within 24-72 hours of failure, caching is disabled immediately if battery backups or CacheVault systems fail, etc.

Recovery from backups is rarely a quick or easy operation in a shared hosting environment, and we are exceptionally relieved with the outcome in this regard especially as it relates to data integrity. We regret the oversight made during the first recovery attempt, and this is where we will be making changes to our own internal processes. Generally we reference other active systems and an overall recovery guideline - we keep configurations similar between our shared hosting environments where possible - but this was half of the equation which led to the selection of the wrong drive during the recovery process in this case. We will now be retaining separate recovery plans for each individual system rather than a blanket set of guidelines. We do not want to make worst case scenarios worse, and this will ensure such complications do not occur again.

As with any major downtime, lessons are learned and will now always influence our future preparations and actions. This was easily the worst disaster we've had to manage in our history in web hosting as GeekStorage, and the second worst in my personal experience of nearly 20 years. Our disaster recovery plans were tested heavily throughout the past few days, and although a mistake was made preventing a more timely return to service, minimal loss of data and configurations occurred and for that we are very thankful to have planned sufficiently.

We're also now working on sourcing more coffee.

Thank you:

We know stress is constant during an outage and at most times concrete answers are unavailable which all means everyone is angry or frustrated or both. We want to thank everyone on the Saturn server for being extremely understanding and patient throughout the outage and recovery effort. 

We understand outages such as this, compounded also by the network outage earlier this month, can shake your faith in us as your hosting provider, and the same for your business and customers. We hope the above helps to provide an understanding of what happened and how such problems will be prevented to the utmost degree possible in the future, and this information can be passed on to your clients to hopefully alleive their concerns going forward, as well.