Category: "Chicago DC"

Outage / Restore - Saturn Server (7/18/2020)

July 18, 2020 at 12:04 PM

 

We were able to identify the reason backup restoration was slow yesterday on the Saturn server, and why the server itself was very unstable during the restore process. Unfortunately, the cause of this required us to do another re-image of the server. Please note, no data has been lost. If you made significant changes to your sites yesterday after the restoration of your account was completed, please contact us later today and we can retrieve that data. The good news is the restores are going much more quickly today (the rate at which we expected them to be going yesterday). We will prioritize the same accounts requested yesterday as best as possible. We were able to run about 5 concurrent restore operations yesterday and server stability was very poor. After resolving the issue with the server speed (more details on that later when I have more time to write here) , we can now run 15-20 concurrent restore operations and each one is going faster individually and server stability is much better (slowdowns will generally be related to file locks and network saturation due to backup transfers rather than I/O as it was yesterday).

I understand this news is extremely difficult to hear - I wanted to avoid this at all costs and we considered other options, but this was likely to be the least complicated and most speedy solution we had available. I know some of you are waiting on support ticket responses today, and I apologize for the delay on that. As we finish restoring the prioritized accounts again we will update the associated tickets to confirm. If you want to yell at us (or me personally), I completely understand, just please know I am focused on getting you back online and with the speed to do it more quickly behind the scenes today.

Update 3:05AM (7/19): The restore process is going along smoothly. Though the restores are not yet in a place we can give a super accurate timeframe for completion, they are going faster now that most larger accounts have been restored and we expect full restoration to be complete by early afternoon or earlier, barring any slowdowns (or speedups). All of the prioritization requests have been processed, so if you do have any customer sites which you would like to be prioritized, please let us know and we can move them closer to the top of the list without much trouble at this point. Service speed should be quite good however there are still, as we just saw a few minutes ago, system crons and certain operations which can lock things up for a bit with the extra transfer load added in.

Update 12:08PM (7/19): We are getting very near to the end of the restore process. As long as there aren't a surprising number of large accounts remaining, we should be finishing the process very soon and then speeds will go back to normal without intermittent delays. The restores are on track to finish within 1-2 hours.

Update 12:56PM (7/19): The restoration process has finished at this time. We will be composing a full detail of the reason for outage and problems encountered with the first recovery process as soon as possible - please expect this information within 24-48 hours as we do need to monitor for issues and handle ongoing requests in the immediate term. A very brief synopsis before the full report can be compiled: Despite completely normal reports from the RAID array and drives prior to the scheduled reboot on Thursday 7/16 which occurred around 11:15PM CDT, both RAID arrays (HDD backup drives and primary SSD drives) completely fell apart aside from two disks during the reboot. Essentially, drives which had been hot-swapped in since the prior reboot (replacing failing or failed drives) had dropped out of their respective RAID arrays. RAID array recovery was deemed impossible (or would simply take too long to find the experts necessary to even consider attempting it) from that point, and so a restore was initiated around 4AM that morning. The first recovery operation ended up being scrapped due to a mis-configuration which ended up causing the operating system to be placed on the HDDs rather than the SSDs. We considered mirroring the HDDs to the SSDs at that point but it was effectively a similar recovery time, in all likelihood, versus simply re-imaging again and starting fresh with the SSDs. There is a lot more to add to this, but this is kept brief in the interest of providing timely support for the next 24-48 hours while it is needed most.

A couple of important notes:

  • The backup restoration is complete, however, if you do notice any issues with the integrity of your files, please use the JetBackup utility in cPanel to re-restore the files or, if you made changes during the first restore process and those changes are no longer present, open a support ticket and we can get the files re-copied to your account. The only data lost would be from between the time of the latest JetBackup in your cPanel and the first outage - for all accounts this should be less than a 24 hour window of loss.
  • RvSkin is not yet re-enabled. We are just waiting on an update from the RV staff so we can get this re-installed properly.
  • In a few cases, we've noticed some domains are encountering issues with the PHP version in use. If you do notice any PHP issues on your account, first try to adjust the PHP version to the preferred version for your site via the cPanel -> MultiPHP Manager area.
  • If you notice any other issues related to services on the Saturn server at this point, please let us know. Such issues would not be related to the restore operation and instead would mean we might've missed something somewhere. Most notably things like specially requested PHP modules or server utilities may need to be reinstalled as they would not be on our typical checklist for server re-imaging.
  • IP assignments are likely not restored in most cases. Please contact us if you need any domains assigned to different IPs than they were restored onto. DNS TTL should have been reduced during the restores so IP changes should only cause up to ~1hr downtime compared to the typical up-to-24-hours for such changes, and we can schedule these operations for specific timeframes upon request.

Update 9:15AM (7/20): RV has resolved the issue preventing installation of RvSkin and this feature should now be properly activated.

Reboot for Saturn Server Scheduled for 7/16/20 between 11:00PM and 12:00AM CDT (GMT-5) [Update - Unscheduled Maintenance & Restoration]

July 16, 2020 at 12:47 AM

An issue was encountered around 12:10AM CDT which has caused a short downtime on the Saturn server. We are checking through various configurations and software versions to help improve stability. And, as preventative maintenance, we will be issuing a reboot to the server on 7/16/20 between 11:00PM and 12:00AM CDT (GMT-5). This server has a very high uptime and some old processes may be to blame for the issues encountered this morning. Expected downtime is between 5-10 minutes. 

Update 11:22PM: The server reboot has been issued, however, we are seeing a possible hardware issue which is preventing the server from coming back online properly. We are working with staff on site to get this issue resolved as soon as possible and we will post further updates here as soon as we have a solution or more information to provide.

-- 7/17/20 --

Update 12:39AM: We are still working to resolve the issue on the Saturn server. This may be a situation where we need to restore from the most recent backup, and we will begin that process around 1AM if we are unable to remedy by other means which we are still investigating. We apologize for this outage and thank you for your patience while we work to get service restored.

Update: 2:05AM: We are working on the disaster recovery process now and will begin restoring from backups as soon as possible. We expect data loss to be no more than 24 hours as the last backup did complete successfully overnight last night. This will be a much faster recovery solution than trying to work with hardware vendors to identify and remedy (if possible) the issue we encountered with the RAID array on the server. We are not worried about this issue recurring at this time due to the circumstances involved, but we will be investigating further once recovery is complete to ensure as best as possible the issue does not occur again.

Update 4:52AM: The server has been re-provisioned and accounts are coming back online one by one as they are restored at this time. We expect the entire process to take 3-6 hours and will update here with more information as/when possible. 

Update 6:56AM: The process is taking longer than expected, though the restores are going smoothly. We do expect some IP assignment issues and are trying to remedy those as quickly as possible. The restoration process is roughly 20% complete at this time.

Update 10:54AM: We are still looking at a good ways to go with the restoration process, and we are doing our best to prioritize primary reseller accounts where possible so resellers can get in and get in touch with their customers as quickly as possible. That said, there are a lot of factors slowing down the process: high inode usage, various system tasks scanning new files, etc. We do anticipate several more hours of account restores to be completed, but the closer we get to the end of the process the more accounts will be online and active. For any users who were seeing issues with CloudFlare connections, this has been fixed now. There is an issue with all accounts being generated on the primary shared IP of the server, but we have reduced TTL for the DNS zones and can correct IP assignments, after restores are completed, with minimal downtime. This has taken much longer than anticipated, and we will have a full report of the cause of the issue, but right now we're focused on getting everything back to normal on the Saturn server. Please note there may be load spikes when larger accounts are restored or when certain indexing/quota operations are performed on the server; these will be intermittent until service is fully restored. If you have questions or are in very dire need of a quick restore of your account, please contact us at the help desk and we will do everything possible to help as quickly as we can.

Update 1:17PM: Still working on restorations and working out some minor issues. One concern to report is it looks like Roundcube settings may have been lost with the new backup software we've been using. We still do not have a firm ETA, but it looks like we're getting through some of the larger accounts which should mean we see the speed pick up soon.

Update 4:41PM: As mentioned before, we will still see some stability issues during larger account restores. As it happens, these tend to coincide with one another causing even slower speeds along the way. We're doing what we can to keep the process rolling as quickly as possible.

Update 7:09PM: Continuing to process account restorations. We have not hit any major issues, though the process is obviously taking much longer than initially anticipated. The primary issue is not related to backup transfer speed (from the backup server to the hosting server) but rather it seems to be more a byproduct of the backup restoration process itself once the data is transferred to the hosting server. We will keep this in mind in future tweaks to our backup configurations, but changing how the backup restoration process works is something outside of our control. To give some more context to what happened overnight before I compile more information and give a more detailed report (likely within 48 hours): Once the server went down for the scheduled reboot last night, drives which had been hot swapped in since the last reboot (replacing failed drives late last year) were dropped from the RAID configuration and we were unable to rebuild the array with the data in tact at that point despite our best attempts to do so. We rebuilt the array fresh and performed some tests to ensure the issue would not recur on subsequent reboots, and applied firmware updates in case the issue lied there originally. We also checked our logging services which indicated all components of the RAID arrays and hardware were in working order prior to the reboot (these points are checked every few minutes automatically by our monitoring software).

We have not yet been able to conduct a thorough investigation since we are working on restoring websites as quickly as possible, but as soon as we have fully recovered the server and had a chance to ensure no lingering issues remain for user websites, we will provide as much information as we can regarding the cause for the outage. I want to personally thank everyone for their patience throughout this issue today. We still have a ways to go, but we're getting there slowly but surely. Further updates will be posted here.

If you do notice anything broken following the restore of one of your accounts, note you can self-service restores of missing files or databases via JetBackup in cPanel, though they will sit at the end of the queue until all accounts are finished restoring. Missing files or databases would've been a result of timeouts or service restarts during restore processes rather than actual missing data from the backups - so don't worry, the data is there. This only occurred for a handful of accounts, most of which we caught and remedied but we do intend to scan for anything missed by the initial restore after the full restore operation is finished.

Update 11:44PM: The restore process is ongoing. Please note there may be intermittent service interruption still throughout the night as the remaining accounts are restored and other maintenance is completed.

-- 7/18/20 --

Update 7:35AM: We have identified the source of the slow speed issue and it is not entirely the backup restoration process to blame. A couple options are being evaluated but we do expect a bit more downtime and then a much speedier experience after while restores are continued. We will keep you posted here with more information as soon as possible.

Update 7:40AM: We have evaluated the available options. There will be some more downtime beginning very shortly to address this issue. We do expect a few hours downtime while we get things sorted. We apologize again for the trouble, unfortunately there have been some hardware issues and then some mistakes afterwards which are leading to this, but we do know where we're going and we're going to get there much more quickly than we were otherwise. Thank you for your patience.

Update 9:38AM: Please note ticket responses may be delayed for some time as we are actively working to get service restored promptly and back to full steam. We apologize for the downtime, but we do believe this is preferred to the available alternatives at this time. Further updates to come soon.

Update 11:34AM: I'll start tracking this in a separate blog post. Please check the main Service Announcements page for the next update in the next 30-60 minutes.

Network Connectivity Issues (Resolved - 7/2/2020)

July 2, 2020 at 2:27 AM

Around  12:00AM (CDT) we identified a network connectivity issue which was affecting traffic to all of our US-based hosting services. After getting in touch with our upstream provider within a few minutes, they were aware of an issue with two routers on their network and were working to restore connectivity. Connectivity was restored around 1:30AM (CDT). We are now awaiting a detailed report from the datacenter regarding the issue which we will also post here on the Service Announcements blog as soon as possible. We would like to thank everyone for their patience and understanding; please know this is not normal (neither for us nor our upstream provider) and we do expect provisions will be put in place to prevent similar issues from occurring again in the future.

Reboot for Neptune Server Scheduled for 6/30/2020 between 11:00PM and 12:00AM

June 30, 2020 at 6:22 PM

We have identified an issue on the Neptune server which requires an urgent maintenance reboot overnight tonight. We will perform the reboot between 11:00PM and 12:00AM on 6/30/2020 (tonight). The expected downtime should be less than 15 minutes, and we expect the maintenance to remedy speed issues with email attachment sending operations and backup operations/management.

Unscheduled Reboot for Neptune Shared Hosting Server (11/11/2019)

November 11, 2019 at 5:02 PM

We apologize for a short unscheduled outage today on the Neptune server. We have issued an unscheduled reboot on Neptune today around 4:54PM CDT. We are still investigating the cause of the issue, but we believe unusually high memory usage to be the cause at this time. We will be actively monitoring for a recurrence of the situation which brought about the issue, and we do not currently expect any further downtime for the Neptune server. Service should be up again as of 5:00PM CDT (total downtime around 10 minutes).

Update (5:17PM): Customers hosted on IP address 162.249.125.46 may be offline still as a result of the same issue above. We have determined the memory spike was caused by a DDoS attack, which continues at this time. We are working on detaching the attacked domain from the server and using other mitigation tactics to get customers on 162.249.125.46 back online as quickly as possible. Thank you for your patience.

Update (6:33PM): Users on 162.249.125.46 should see their sites back online at this time. Please note services may feel sluggish on making connection attempts due to the ongoing DDoS attack. Services are up and running (speedily), but the network service is taking a bit longer to respond because it is bearing the brunt of the attack right now in the form of thousands of active connection attempts. We have configured the server temporarily to be able to handle this style of attack, although the downside is there are minor hiccups in the connection we are noticing as a result. We will revert to normal settings once the attack has subsided, and we thank you for your patience.