Reboot for Saturn Server Scheduled for 7/16/20 between 11:00PM and 12:00AM CDT (GMT-5) [Update

Reboot for Saturn Server Scheduled for 7/16/20 between 11:00PM and 12:00AM CDT (GMT-5) [Update - Unscheduled Maintenance & Restoration]

July 16, 2020 at 12:47 AM

An issue was encountered around 12:10AM CDT which has caused a short downtime on the Saturn server. We are checking through various configurations and software versions to help improve stability. And, as preventative maintenance, we will be issuing a reboot to the server on 7/16/20 between 11:00PM and 12:00AM CDT (GMT-5). This server has a very high uptime and some old processes may be to blame for the issues encountered this morning. Expected downtime is between 5-10 minutes.

Update 11:22PM: The server reboot has been issued, however, we are seeing a possible hardware issue which is preventing the server from coming back online properly. We are working with staff on site to get this issue resolved as soon as possible and we will post further updates here as soon as we have a solution or more information to provide.

-- 7/17/20 --

Update 12:39AM: We are still working to resolve the issue on the Saturn server. This may be a situation where we need to restore from the most recent backup, and we will begin that process around 1AM if we are unable to remedy by other means which we are still investigating. We apologize for this outage and thank you for your patience while we work to get service restored.

Update: 2:05AM: We are working on the disaster recovery process now and will begin restoring from backups as soon as possible. We expect data loss to be no more than 24 hours as the last backup did complete successfully overnight last night. This will be a much faster recovery solution than trying to work with hardware vendors to identify and remedy (if possible) the issue we encountered with the RAID array on the server. We are not worried about this issue recurring at this time due to the circumstances involved, but we will be investigating further once recovery is complete to ensure as best as possible the issue does not occur again.

Update 4:52AM: The server has been re-provisioned and accounts are coming back online one by one as they are restored at this time. We expect the entire process to take 3-6 hours and will update here with more information as/when possible.

Update 6:56AM: The process is taking longer than expected, though the restores are going smoothly. We do expect some IP assignment issues and are trying to remedy those as quickly as possible. The restoration process is roughly 20% complete at this time.

Update 10:54AM: We are still looking at a good ways to go with the restoration process, and we are doing our best to prioritize primary reseller accounts where possible so resellers can get in and get in touch with their customers as quickly as possible. That said, there are a lot of factors slowing down the process: high inode usage, various system tasks scanning new files, etc. We do anticipate several more hours of account restores to be completed, but the closer we get to the end of the process the more accounts will be online and active. For any users who were seeing issues with CloudFlare connections, this has been fixed now. There is an issue with all accounts being generated on the primary shared IP of the server, but we have reduced TTL for the DNS zones and can correct IP assignments, after restores are completed, with minimal downtime. This has taken much longer than anticipated, and we will have a full report of the cause of the issue, but right now we're focused on getting everything back to normal on the Saturn server. Please note there may be load spikes when larger accounts are restored or when certain indexing/quota operations are performed on the server; these will be intermittent until service is fully restored. If you have questions or are in very dire need of a quick restore of your account, please contact us at the help desk and we will do everything possible to help as quickly as we can.

Update 1:17PM: Still working on restorations and working out some minor issues. One concern to report is it looks like Roundcube settings may have been lost with the new backup software we've been using. We still do not have a firm ETA, but it looks like we're getting through some of the larger accounts which should mean we see the speed pick up soon.

Update 4:41PM: As mentioned before, we will still see some stability issues during larger account restores. As it happens, these tend to coincide with one another causing even slower speeds along the way. We're doing what we can to keep the process rolling as quickly as possible.

Update 7:09PM: Continuing to process account restorations. We have not hit any major issues, though the process is obviously taking much longer than initially anticipated. The primary issue is not related to backup transfer speed (from the backup server to the hosting server) but rather it seems to be more a byproduct of the backup restoration process itself once the data is transferred to the hosting server. We will keep this in mind in future tweaks to our backup configurations, but changing how the backup restoration process works is something outside of our control. To give some more context to what happened overnight before I compile more information and give a more detailed report (likely within 48 hours): Once the server went down for the scheduled reboot last night, drives which had been hot swapped in since the last reboot (replacing failed drives late last year) were dropped from the RAID configuration and we were unable to rebuild the array with the data in tact at that point despite our best attempts to do so. We rebuilt the array fresh and performed some tests to ensure the issue would not recur on subsequent reboots, and applied firmware updates in case the issue lied there originally. We also checked our logging services which indicated all components of the RAID arrays and hardware were in working order prior to the reboot (these points are checked every few minutes automatically by our monitoring software).

We have not yet been able to conduct a thorough investigation since we are working on restoring websites as quickly as possible, but as soon as we have fully recovered the server and had a chance to ensure no lingering issues remain for user websites, we will provide as much information as we can regarding the cause for the outage. I want to personally thank everyone for their patience throughout this issue today. We still have a ways to go, but we're getting there slowly but surely. Further updates will be posted here.

If you do notice anything broken following the restore of one of your accounts, note you can self-service restores of missing files or databases via JetBackup in cPanel, though they will sit at the end of the queue until all accounts are finished restoring. Missing files or databases would've been a result of timeouts or service restarts during restore processes rather than actual missing data from the backups - so don't worry, the data is there. This only occurred for a handful of accounts, most of which we caught and remedied but we do intend to scan for anything missed by the initial restore after the full restore operation is finished.

Update 11:44PM: The restore process is ongoing. Please note there may be intermittent service interruption still throughout the night as the remaining accounts are restored and other maintenance is completed.

-- 7/18/20 --

Update 7:35AM: We have identified the source of the slow speed issue and it is not entirely the backup restoration process to blame. A couple options are being evaluated but we do expect a bit more downtime and then a much speedier experience after while restores are continued. We will keep you posted here with more information as soon as possible.

Update 7:40AM: We have evaluated the available options. There will be some more downtime beginning very shortly to address this issue. We do expect a few hours downtime while we get things sorted. We apologize again for the trouble, unfortunately there have been some hardware issues and then some mistakes afterwards which are leading to this, but we do know where we're going and we're going to get there much more quickly than we were otherwise. Thank you for your patience.

Update 9:38AM: Please note ticket responses may be delayed for some time as we are actively working to get service restored promptly and back to full steam. We apologize for the downtime, but we do believe this is preferred to the available alternatives at this time. Further updates to come soon.

Update 11:34AM: I'll start tracking this in a separate blog post. Please check the main Service Announcements page for the next update in the next 30-60 minutes.

Posted in Chicago DC by Matt Eli