Saturn Server Reason For Outage 7/17-7/19/20

July 20, 2020 at 2:29 PM

This report will summarize and go into detail about the outage and recovery efforts on the Saturn server between 7/17/20 and 7/19/20. Note timeframe references are in our local time (CDT, GMT-5).

Summary:

On the night of Thursday 7/16/20 we had scheduled a server reboot to help improve server stability following an issue encountered the day before. The uptime of this server was over 1,000 days and the issue the night before led us to believe a reboot would be helpful to clear out a few old processes that may have been impacting server stability. During the scheduled reboot, the RAID arrays hosting both primary and backup drive data were no longer present aside from two disks on the main RAID 10 array. The data was not recoverable in this state, and so we began recovery efforts on the morning of 7/17 via re-image and restore from the latest backups taken the day of the reboot. Exceptionally slow speeds were cited going into the second half of the data restoration process, which led us to investigate possible underlying complications, ultimately finding that the server recovery had been executed on the wrong array. Options were evaluated for how to proceed, and it was deemed most suitable to re-initiate recovery again on the correct disks which began on the morning of 7/18. Recovery from backup was then completed in the early afternoon of 7/19. 

In depth:

Prior to a reboot on a server with a significant uptime, it is prudent to be sure proper backups are in order and all systems are operating as expected. We verified JetBackup (the backup service on the Saturn server) had taken a full set of successful backups on the same day as the reboot was to be performed, as expected and without complications on any accounts. We also verified via our hardware monitoring services that no issues were being reported. In this case, it is important to note what we were seeing reported by the RAID array at this time which was the following: OK (CTR, LD, PD, CV) -- Broken down, this effectively meant no issues were present on any of the key points related to the RAID configuration or hardware. Virtual drives (the RAID arrays), physical disks, the CacheVault (I/O caching and backup unit), and the controller itself all were scanned within minutes of the reboot and did not report any issues. This monitoring occurs every 5 minutes and all points of concern are checked each time. We were very careful in this case to double-check these monitoring points before the reboot because the server did have several disks replaced last year.

Following the reboot, all disks which were hot swapped in the year before had fallen out of their respective RAID arrays. The only two drives remaining with a visible RAID configuration belonged to an unrecoverable portion of the primary RAID array. At this point we began our best efforts to try to recover from the failed RAID arrays, but several hours later after making no progress and evaluating other options with experts in the field, we decided to start a full rebuild of the operating system (OS) & recovery from backups. We also updated the RAID card firmware during this process as we suspected it may have been to blame for the RAID failure, and we tested the integrity of future reboots on the installed drives by doing some stress testing prior to beginning the recovery effort (though that is not to say we will be rebooting the server again anytime soon - for sanity's sake!).

During the initial re-configuration of the server, a mistake was made which put the backup HDD drives in RAID 0 instead of RAID 1, which caused the wrong drives to be interpreted as the primary array during the OS re-install process. This mistake was not realized until late in the evening that same day when it became apparent there was something wrong with the recovery speed. We then evaluated amongst ourselves, and other experts whom we consult with, as to whether trying to clone the drives onto the main array or simply restarting the recovery process entirely was the best course of action. The clone process ultimately could have failed, and would have likely taken several hours to complete and more backups still needed to be restored afterward (about 45% of data remained to be restored). The restarting of the recovery process entirely would be a more certain course of action and afterward the restore from backup would be much faster. Another option was to continue with the restore process and then try the clone or transfer later from the restored data once everything was back online, but this came with other risks (the data was on a RAID 0 array under extreme load being the primary concern) and would have meant *extremely* slow speeds for the duration. If anything went wrong with that process, we would have been forced to restart recovery completely once again.

It was finally decided to restart the recovery process entirely - on the correct drives, and retaining data from the prior restore operation. We considered this to be the best course of action to not only get websites back online as quickly as possible but also to limit further downtime during and following the restore process. The recovery process was finally completed on 7/19/20 in the early afternoon around 1PM. Server responsiveness and stability then immediately returned to normal.

Going forward:

We do everything we can to prevent issues like RAID failure from becoming an actuality. RAID failure is the most feared situation in any hosting environment. We checked the array status, we checked the hardware backing the I/O cache, we checked the physical disks - all of this just minutes prior to the scheduled reboot. In fact, RAID arrays on all of our systems are checked every 5-10 minutes and drive replacement occurs within 24-72 hours of failure, caching is disabled immediately if battery backups or CacheVault systems fail, etc.

Recovery from backups is rarely a quick or easy operation in a shared hosting environment, and we are exceptionally relieved with the outcome in this regard especially as it relates to data integrity. We regret the oversight made during the first recovery attempt, and this is where we will be making changes to our own internal processes. Generally we reference other active systems and an overall recovery guideline - we keep configurations similar between our shared hosting environments where possible - but this was half of the equation which led to the selection of the wrong drive during the recovery process in this case. We will now be retaining separate recovery plans for each individual system rather than a blanket set of guidelines. We do not want to make worst case scenarios worse, and this will ensure such complications do not occur again.

As with any major downtime, lessons are learned and will now always influence our future preparations and actions. This was easily the worst disaster we've had to manage in our history in web hosting as GeekStorage, and the second worst in my personal experience of nearly 20 years. Our disaster recovery plans were tested heavily throughout the past few days, and although a mistake was made preventing a more timely return to service, minimal loss of data and configurations occurred and for that we are very thankful to have planned sufficiently.

We're also now working on sourcing more coffee.

Thank you:

We know stress is constant during an outage and at most times concrete answers are unavailable which all means everyone is angry or frustrated or both. We want to thank everyone on the Saturn server for being extremely understanding and patient throughout the outage and recovery effort. 

We understand outages such as this, compounded also by the network outage earlier this month, can shake your faith in us as your hosting provider, and the same for your business and customers. We hope the above helps to provide an understanding of what happened and how such problems will be prevented to the utmost degree possible in the future, and this information can be passed on to your clients to hopefully alleive their concerns going forward, as well.

Outage / Restore - Saturn Server (7/18/2020)

July 18, 2020 at 12:04 PM

 

We were able to identify the reason backup restoration was slow yesterday on the Saturn server, and why the server itself was very unstable during the restore process. Unfortunately, the cause of this required us to do another re-image of the server. Please note, no data has been lost. If you made significant changes to your sites yesterday after the restoration of your account was completed, please contact us later today and we can retrieve that data. The good news is the restores are going much more quickly today (the rate at which we expected them to be going yesterday). We will prioritize the same accounts requested yesterday as best as possible. We were able to run about 5 concurrent restore operations yesterday and server stability was very poor. After resolving the issue with the server speed (more details on that later when I have more time to write here) , we can now run 15-20 concurrent restore operations and each one is going faster individually and server stability is much better (slowdowns will generally be related to file locks and network saturation due to backup transfers rather than I/O as it was yesterday).

I understand this news is extremely difficult to hear - I wanted to avoid this at all costs and we considered other options, but this was likely to be the least complicated and most speedy solution we had available. I know some of you are waiting on support ticket responses today, and I apologize for the delay on that. As we finish restoring the prioritized accounts again we will update the associated tickets to confirm. If you want to yell at us (or me personally), I completely understand, just please know I am focused on getting you back online and with the speed to do it more quickly behind the scenes today.

Update 3:05AM (7/19): The restore process is going along smoothly. Though the restores are not yet in a place we can give a super accurate timeframe for completion, they are going faster now that most larger accounts have been restored and we expect full restoration to be complete by early afternoon or earlier, barring any slowdowns (or speedups). All of the prioritization requests have been processed, so if you do have any customer sites which you would like to be prioritized, please let us know and we can move them closer to the top of the list without much trouble at this point. Service speed should be quite good however there are still, as we just saw a few minutes ago, system crons and certain operations which can lock things up for a bit with the extra transfer load added in.

Update 12:08PM (7/19): We are getting very near to the end of the restore process. As long as there aren't a surprising number of large accounts remaining, we should be finishing the process very soon and then speeds will go back to normal without intermittent delays. The restores are on track to finish within 1-2 hours.

Update 12:56PM (7/19): The restoration process has finished at this time. We will be composing a full detail of the reason for outage and problems encountered with the first recovery process as soon as possible - please expect this information within 24-48 hours as we do need to monitor for issues and handle ongoing requests in the immediate term. A very brief synopsis before the full report can be compiled: Despite completely normal reports from the RAID array and drives prior to the scheduled reboot on Thursday 7/16 which occurred around 11:15PM CDT, both RAID arrays (HDD backup drives and primary SSD drives) completely fell apart aside from two disks during the reboot. Essentially, drives which had been hot-swapped in since the prior reboot (replacing failing or failed drives) had dropped out of their respective RAID arrays. RAID array recovery was deemed impossible (or would simply take too long to find the experts necessary to even consider attempting it) from that point, and so a restore was initiated around 4AM that morning. The first recovery operation ended up being scrapped due to a mis-configuration which ended up causing the operating system to be placed on the HDDs rather than the SSDs. We considered mirroring the HDDs to the SSDs at that point but it was effectively a similar recovery time, in all likelihood, versus simply re-imaging again and starting fresh with the SSDs. There is a lot more to add to this, but this is kept brief in the interest of providing timely support for the next 24-48 hours while it is needed most.

A couple of important notes:

  • The backup restoration is complete, however, if you do notice any issues with the integrity of your files, please use the JetBackup utility in cPanel to re-restore the files or, if you made changes during the first restore process and those changes are no longer present, open a support ticket and we can get the files re-copied to your account. The only data lost would be from between the time of the latest JetBackup in your cPanel and the first outage - for all accounts this should be less than a 24 hour window of loss.
  • RvSkin is not yet re-enabled. We are just waiting on an update from the RV staff so we can get this re-installed properly.
  • In a few cases, we've noticed some domains are encountering issues with the PHP version in use. If you do notice any PHP issues on your account, first try to adjust the PHP version to the preferred version for your site via the cPanel -> MultiPHP Manager area.
  • If you notice any other issues related to services on the Saturn server at this point, please let us know. Such issues would not be related to the restore operation and instead would mean we might've missed something somewhere. Most notably things like specially requested PHP modules or server utilities may need to be reinstalled as they would not be on our typical checklist for server re-imaging.
  • IP assignments are likely not restored in most cases. Please contact us if you need any domains assigned to different IPs than they were restored onto. DNS TTL should have been reduced during the restores so IP changes should only cause up to ~1hr downtime compared to the typical up-to-24-hours for such changes, and we can schedule these operations for specific timeframes upon request.

Update 9:15AM (7/20): RV has resolved the issue preventing installation of RvSkin and this feature should now be properly activated.

Reboot for Saturn Server Scheduled for 7/16/20 between 11:00PM and 12:00AM CDT (GMT-5) [Update - Unscheduled Maintenance & Restoration]

July 16, 2020 at 12:47 AM

An issue was encountered around 12:10AM CDT which has caused a short downtime on the Saturn server. We are checking through various configurations and software versions to help improve stability. And, as preventative maintenance, we will be issuing a reboot to the server on 7/16/20 between 11:00PM and 12:00AM CDT (GMT-5). This server has a very high uptime and some old processes may be to blame for the issues encountered this morning. Expected downtime is between 5-10 minutes. 

Update 11:22PM: The server reboot has been issued, however, we are seeing a possible hardware issue which is preventing the server from coming back online properly. We are working with staff on site to get this issue resolved as soon as possible and we will post further updates here as soon as we have a solution or more information to provide.

-- 7/17/20 --

Update 12:39AM: We are still working to resolve the issue on the Saturn server. This may be a situation where we need to restore from the most recent backup, and we will begin that process around 1AM if we are unable to remedy by other means which we are still investigating. We apologize for this outage and thank you for your patience while we work to get service restored.

Update: 2:05AM: We are working on the disaster recovery process now and will begin restoring from backups as soon as possible. We expect data loss to be no more than 24 hours as the last backup did complete successfully overnight last night. This will be a much faster recovery solution than trying to work with hardware vendors to identify and remedy (if possible) the issue we encountered with the RAID array on the server. We are not worried about this issue recurring at this time due to the circumstances involved, but we will be investigating further once recovery is complete to ensure as best as possible the issue does not occur again.

Update 4:52AM: The server has been re-provisioned and accounts are coming back online one by one as they are restored at this time. We expect the entire process to take 3-6 hours and will update here with more information as/when possible. 

Update 6:56AM: The process is taking longer than expected, though the restores are going smoothly. We do expect some IP assignment issues and are trying to remedy those as quickly as possible. The restoration process is roughly 20% complete at this time.

Update 10:54AM: We are still looking at a good ways to go with the restoration process, and we are doing our best to prioritize primary reseller accounts where possible so resellers can get in and get in touch with their customers as quickly as possible. That said, there are a lot of factors slowing down the process: high inode usage, various system tasks scanning new files, etc. We do anticipate several more hours of account restores to be completed, but the closer we get to the end of the process the more accounts will be online and active. For any users who were seeing issues with CloudFlare connections, this has been fixed now. There is an issue with all accounts being generated on the primary shared IP of the server, but we have reduced TTL for the DNS zones and can correct IP assignments, after restores are completed, with minimal downtime. This has taken much longer than anticipated, and we will have a full report of the cause of the issue, but right now we're focused on getting everything back to normal on the Saturn server. Please note there may be load spikes when larger accounts are restored or when certain indexing/quota operations are performed on the server; these will be intermittent until service is fully restored. If you have questions or are in very dire need of a quick restore of your account, please contact us at the help desk and we will do everything possible to help as quickly as we can.

Update 1:17PM: Still working on restorations and working out some minor issues. One concern to report is it looks like Roundcube settings may have been lost with the new backup software we've been using. We still do not have a firm ETA, but it looks like we're getting through some of the larger accounts which should mean we see the speed pick up soon.

Update 4:41PM: As mentioned before, we will still see some stability issues during larger account restores. As it happens, these tend to coincide with one another causing even slower speeds along the way. We're doing what we can to keep the process rolling as quickly as possible.

Update 7:09PM: Continuing to process account restorations. We have not hit any major issues, though the process is obviously taking much longer than initially anticipated. The primary issue is not related to backup transfer speed (from the backup server to the hosting server) but rather it seems to be more a byproduct of the backup restoration process itself once the data is transferred to the hosting server. We will keep this in mind in future tweaks to our backup configurations, but changing how the backup restoration process works is something outside of our control. To give some more context to what happened overnight before I compile more information and give a more detailed report (likely within 48 hours): Once the server went down for the scheduled reboot last night, drives which had been hot swapped in since the last reboot (replacing failed drives late last year) were dropped from the RAID configuration and we were unable to rebuild the array with the data in tact at that point despite our best attempts to do so. We rebuilt the array fresh and performed some tests to ensure the issue would not recur on subsequent reboots, and applied firmware updates in case the issue lied there originally. We also checked our logging services which indicated all components of the RAID arrays and hardware were in working order prior to the reboot (these points are checked every few minutes automatically by our monitoring software).

We have not yet been able to conduct a thorough investigation since we are working on restoring websites as quickly as possible, but as soon as we have fully recovered the server and had a chance to ensure no lingering issues remain for user websites, we will provide as much information as we can regarding the cause for the outage. I want to personally thank everyone for their patience throughout this issue today. We still have a ways to go, but we're getting there slowly but surely. Further updates will be posted here.

If you do notice anything broken following the restore of one of your accounts, note you can self-service restores of missing files or databases via JetBackup in cPanel, though they will sit at the end of the queue until all accounts are finished restoring. Missing files or databases would've been a result of timeouts or service restarts during restore processes rather than actual missing data from the backups - so don't worry, the data is there. This only occurred for a handful of accounts, most of which we caught and remedied but we do intend to scan for anything missed by the initial restore after the full restore operation is finished.

Update 11:44PM: The restore process is ongoing. Please note there may be intermittent service interruption still throughout the night as the remaining accounts are restored and other maintenance is completed.

-- 7/18/20 --

Update 7:35AM: We have identified the source of the slow speed issue and it is not entirely the backup restoration process to blame. A couple options are being evaluated but we do expect a bit more downtime and then a much speedier experience after while restores are continued. We will keep you posted here with more information as soon as possible.

Update 7:40AM: We have evaluated the available options. There will be some more downtime beginning very shortly to address this issue. We do expect a few hours downtime while we get things sorted. We apologize again for the trouble, unfortunately there have been some hardware issues and then some mistakes afterwards which are leading to this, but we do know where we're going and we're going to get there much more quickly than we were otherwise. Thank you for your patience.

Update 9:38AM: Please note ticket responses may be delayed for some time as we are actively working to get service restored promptly and back to full steam. We apologize for the downtime, but we do believe this is preferred to the available alternatives at this time. Further updates to come soon.

Update 11:34AM: I'll start tracking this in a separate blog post. Please check the main Service Announcements page for the next update in the next 30-60 minutes.

Network Connectivity Issues (Resolved - 7/2/2020)

July 2, 2020 at 2:27 AM

Around  12:00AM (CDT) we identified a network connectivity issue which was affecting traffic to all of our US-based hosting services. After getting in touch with our upstream provider within a few minutes, they were aware of an issue with two routers on their network and were working to restore connectivity. Connectivity was restored around 1:30AM (CDT). We are now awaiting a detailed report from the datacenter regarding the issue which we will also post here on the Service Announcements blog as soon as possible. We would like to thank everyone for their patience and understanding; please know this is not normal (neither for us nor our upstream provider) and we do expect provisions will be put in place to prevent similar issues from occurring again in the future.

Reboot for Neptune Server Scheduled for 6/30/2020 between 11:00PM and 12:00AM

June 30, 2020 at 6:22 PM

We have identified an issue on the Neptune server which requires an urgent maintenance reboot overnight tonight. We will perform the reboot between 11:00PM and 12:00AM on 6/30/2020 (tonight). The expected downtime should be less than 15 minutes, and we expect the maintenance to remedy speed issues with email attachment sending operations and backup operations/management.