Outage / Restore - Saturn Server (7/18/2020)

July 18, 2020 at 12:04 PM

We were able to identify the reason backup restoration was slow yesterday on the Saturn server, and why the server itself was very unstable during the restore process. Unfortunately, the cause of this required us to do another re-image of the server. Please note, no data has been lost. If you made significant changes to your sites yesterday after the restoration of your account was completed, please contact us later today and we can retrieve that data. The good news is the restores are going much more quickly today (the rate at which we expected them to be going yesterday). We will prioritize the same accounts requested yesterday as best as possible. We were able to run about 5 concurrent restore operations yesterday and server stability was very poor. After resolving the issue with the server speed (more details on that later when I have more time to write here) , we can now run 15-20 concurrent restore operations and each one is going faster individually and server stability is much better (slowdowns will generally be related to file locks and network saturation due to backup transfers rather than I/O as it was yesterday).

I understand this news is extremely difficult to hear - I wanted to avoid this at all costs and we considered other options, but this was likely to be the least complicated and most speedy solution we had available. I know some of you are waiting on support ticket responses today, and I apologize for the delay on that. As we finish restoring the prioritized accounts again we will update the associated tickets to confirm. If you want to yell at us (or me personally), I completely understand, just please know I am focused on getting you back online and with the speed to do it more quickly behind the scenes today.

Update 3:05AM (7/19): The restore process is going along smoothly. Though the restores are not yet in a place we can give a super accurate timeframe for completion, they are going faster now that most larger accounts have been restored and we expect full restoration to be complete by early afternoon or earlier, barring any slowdowns (or speedups). All of the prioritization requests have been processed, so if you do have any customer sites which you would like to be prioritized, please let us know and we can move them closer to the top of the list without much trouble at this point. Service speed should be quite good however there are still, as we just saw a few minutes ago, system crons and certain operations which can lock things up for a bit with the extra transfer load added in.

Update 12:08PM (7/19): We are getting very near to the end of the restore process. As long as there aren't a surprising number of large accounts remaining, we should be finishing the process very soon and then speeds will go back to normal without intermittent delays. The restores are on track to finish within 1-2 hours.

Update 12:56PM (7/19): The restoration process has finished at this time. We will be composing a full detail of the reason for outage and problems encountered with the first recovery process as soon as possible - please expect this information within 24-48 hours as we do need to monitor for issues and handle ongoing requests in the immediate term. A very brief synopsis before the full report can be compiled: Despite completely normal reports from the RAID array and drives prior to the scheduled reboot on Thursday 7/16 which occurred around 11:15PM CDT, both RAID arrays (HDD backup drives and primary SSD drives) completely fell apart aside from two disks during the reboot. Essentially, drives which had been hot-swapped in since the prior reboot (replacing failing or failed drives) had dropped out of their respective RAID arrays. RAID array recovery was deemed impossible (or would simply take too long to find the experts necessary to even consider attempting it) from that point, and so a restore was initiated around 4AM that morning. The first recovery operation ended up being scrapped due to a mis-configuration which ended up causing the operating system to be placed on the HDDs rather than the SSDs. We considered mirroring the HDDs to the SSDs at that point but it was effectively a similar recovery time, in all likelihood, versus simply re-imaging again and starting fresh with the SSDs. There is a lot more to add to this, but this is kept brief in the interest of providing timely support for the next 24-48 hours while it is needed most.

A couple of important notes:

The backup restoration is complete, however, if you do notice any issues with the integrity of your files, please use the JetBackup utility in cPanel to re-restore the files or, if you made changes during the first restore process and those changes are no longer present, open a support ticket and we can get the files re-copied to your account. The only data lost would be from between the time of the latest JetBackup in your cPanel and the first outage - for all accounts this should be less than a 24 hour window of loss.
RvSkin is not yet re-enabled. We are just waiting on an update from the RV staff so we can get this re-installed properly.
In a few cases, we've noticed some domains are encountering issues with the PHP version in use. If you do notice any PHP issues on your account, first try to adjust the PHP version to the preferred version for your site via the cPanel -> MultiPHP Manager area.
If you notice any other issues related to services on the Saturn server at this point, please let us know. Such issues would not be related to the restore operation and instead would mean we might've missed something somewhere. Most notably things like specially requested PHP modules or server utilities may need to be reinstalled as they would not be on our typical checklist for server re-imaging.
IP assignments are likely not restored in most cases. Please contact us if you need any domains assigned to different IPs than they were restored onto. DNS TTL should have been reduced during the restores so IP changes should only cause up to ~1hr downtime compared to the typical up-to-24-hours for such changes, and we can schedule these operations for specific timeframes upon request.

Update 9:15AM (7/20): RV has resolved the issue preventing installation of RvSkin and this feature should now be properly activated.

Posted in Chicago DC by Matt Eli