Category: "Chicago DC"

Service Migration - February 2024

January 30, 2024 at 12:57 PM

Coming in the month of February, we will be migrating all US-based hosting services (Shared Unlimited, Shared Performance, Reseller, VPS, and Dedicated) to a new datacenter. It is also important to know of some expected changes coming to all accounts. Migrated services will still be hosted in the Chicago area, though some features may change depending on your account type. More information is included below.

Migration

Shared & Reseller Hosting: All shared hosting accounts will be migrated over to new hardware in a new datacenter location in the Chicago area. As part of this migration, IPs will be renumbered for all existing shared hosting clients. All hosting account services will automatically be proxied from the old servers to the new servers as part of the migration service, so we do not expect any downtime for shared hosting clients as long as third-party DNS and proxy services (where in use) are updated to the new hosted IP addresses as quickly as possible following the account migration. Further details on this process will be sent to shared account holders directly via E-mail by Monday, February, 5th, 2024.

Shared hosting customers will also be moving to an updated operating system (CloudLinux 8) which will feature access to the latest security updates, database & PHP software versions, as well as improved performance overall. We do not expect to reduce feature support with this update; older PHP versions currently supported will still be available on our shared hosting systems, though the .htaccess directives may change. We will take measures to ensure .htaccess directives are automatically updated where necessary so your selected PHP version(s) remain in tact through the upgrade.

VPS Hosting: All VPS hosting accounts will be migrated over to new hardware and software. IPs will be renumbered, and we will be moving from OnApp to SolusVM for VPS management. We want to keep our VPS services stable and with access to the latest OS templates, which we believe we can achieve to a greater degree with SolusVM at this time, and so we will be moving away from OnApp during the migration. Because each VPS will need to be re-imaged during this migration (to be ported over to SolusVM), we will need to handle VPS migrations on a case-by-case basis and we will be in touch with each VPS client by Sunday, February 11th, 2024 to outline the migration process and what to expect.

Expected downtime will also be on a case-by-case basis; cPanel systems should experience no downtime in most cases. New IPs will be assigned to all VPS services, so changes will need to be made in nearly all cases, so please be sure to keep an eye on your emails and make sure our domain is whitelisted at your email provider so you receive the important details about your service migration. Most cPanel VPS customers will also benefit from this migration in the form of an OS upgrade - CentOS 6 and 7 will be replaced by AlmaLinux 8, offering the latest security benefits and software versions for your VPS.

Dedicated Servers: Dedicated servers will be physically moved from their current location to the new datacenter and will not require IP renumbering. However, this does mean there will be downtime for dedicated server clients as we perform the physical relocation of hardware and transition IP routing over to the new facility. We will send further information on the process and what to expect by Sunday, February 11th, 2024. For clients wishing to transition to a new/updated operating system with cPanel in the near term, we will have alternative options available which would also require less (or no) downtime, but would require IP renumbering and other considerations. We will outline these options in the upcoming email concerning the migration specifics for dedicated server clients.

Service Changes

SpamExperts: We will be retiring support for SpamExperts shortly before the migration window for shared hosting, VPS, and dedicated server customers. We have seen great performance from SpamExperts in filtering inbound spam, but the feature usage has been lower than we expected and the software itself is closed to our administration which has caused a number of headaches for us and for customers over the past few years. We will be implementing stronger spam prevention measures on email systems which presently rely on SpamExperts, and we will be looking at robust alternatives again in the near future.

JetBackup: We will be moving to a newer version of JetBackup which we expect to be more stable for our US-based shared hosting clients. EU servers may also be included in the configuration. Additional backup features will become available after the migration, including possible custom backup and location options for individual clients.

DDoS Protection: Although we have not advertised this feature, US-based hosting clients have benefited from 10gbps DDoS protection since our last encounter with DDoS attackers a few years ago. The new facility features 40gbps DDoS protection which is a significant bump up to our defenses against ransom and takedown attacks directed at us and our customers.

Thank You

We want to say thank you to everyone who has stuck with us through several migrations now, and we hope you continue to enjoy our services in the future. We have seen better connectivity speeds and latency tests at the new facility, and we are looking forward to being able to set some issues aside as part of this migration. If you do have any questions for us, we are always available via email at [email protected] so please feel free to get in touch with us if you have any questions.

4/25/21 Extended Maintenance Report

April 26, 2021 at 7:49 PM

We want to thank everyone for their patience during the extended maintenance last night and this morning which impacted our US-based hosting services. We would also like to apologize for going far beyond the planned maintenance window. We did schedule this maintenance well in advance, and believed strongly that it would be a 2-3 hour operation, and we added another hour for any possible complications to be cleared up. Unfortunately some aspects of the planned maintenance were changed without being communicated to us prior and these changes effectively caused a 2-3 hour job to become a 6-8 hour job.

All of our US server hardware was to be moved to a location with better infrastructure, and this was communicated to us by our datacenter provider over the past several months. Our biggest concern when moving equipment between locations is always making sure downtime is kept to a minimum - we've managed our fair share of hardware migrations and there are always complications or things we did not expect that end up slowing down the process substantially. Most commonly this can relate to trying to maintain similar cable configurations during migration, hardware being jostled during transport which can increase the risk of hardware failure upon returning the hardware to service, and things one wouldn't necessarily expect to take very long like removing the servers from the cabinet and re-racking them in a new cabinet. In February we received more details regarding the expected process for the relocation, and we were informed the datacenter would be taking care of everything and the process was to be a full cabinet migration without removing and re-racking servers between locations, and the work would be performed by a team that does this type of work professionally. Not only is this the fastest and most reliable method of relocating live servers, but it avoids most of the concerns we typically have in regards to the entire process.

We performed a planned upgrade to the OnApp software hosting our VPS services shortly before the migration was to begin, and unfortunately the update had to be rolled back which caused a delay of almost two hours before the maintenance began. Around 11:45PM CDT the migration was underway and we were still anticipating 2-4 hours for the entire process from that time. After 4 hours it was clear something was not going quite to plan, and when communicating with the technicians on site we were informed they were almost done racking our equipment at the new location and to expect systems to be booting up within an hour. At this point we knew the work was going to go on significantly longer than expected and one hour was likely very optimistic, so we tried our best to communicate this via Twitter while our main site and services were still down.

After 90 minutes from the time we received the ETA from the datacenter, the re-racking had been completed, and then the re-wiring began which took another 30 minutes or so. One of the older PDUs needed to be replaced which took roughly 15 minutes. Services were finally being brought back online around 8 hours after the downtime began, with almost all customers being back online after 8-9 hours total downtime.

In short, instead of migrating cabinets of hardware in their entirety from the old facility to the new facility, as we were told to expect, each piece of equipment was individually moved by hand. Instead of moving large cabinets, dozens of individual pieces of equipment were disconnected, removed, re-racked, and re-connected. There are far more possible complications with this strategy, and of course a substantially longer amount of time is required to perform the work. If we knew beforehand the plan had changed away from a full cabinet relocation, we would have been on site ourselves and split the migration into two separate parts instead of trying to get all the work done in one night. We are frustrated this happened the way it happened, but we are also glad the work is complete and no more hardware relocations are expected in the near future.

We want to apologize again for the maintenance being pushed back and going over twice as long as originally planned. We have confirmed all shared, reseller, dedicated servers, and managed VPS services are fully restored and operational. If you still see any issues of any kind please let us know in the help desk or send us an email at [email protected] and we will investigate immediately.

Scheduled Maintenance - 4/25/21

April 7, 2021 at 1:22 PM

On April 25, 2021 maintenance will be performed to migrate all physical hardware at our US location beginning at 10PM CST (GMT-5) and we expect the migration window to last 2-4 hours. Our US-based hosting services will be down for the duration of the planned maintenance. We do expect slightly improved network performance and reliability following the relocation of hardware to our datacenter provider's premier location. Some considerations are included below for each service type currently active at our US location.

Shared Hosting (Unlimited & Performance) and Reseller Hosting: Our US-based shared and reseller hosting services will be down for 2-4 hours for the maintenance window. We will shutdown servers manually about 10 minutes prior to the beginning of the maintenance window, and service will automatically return once the relocation concludes.

VPS Hosting: All VPS hosting services will be down for 2-4 hours for the maintenance window, and we will begin shutting down VPS services about 30 minutes prior to the maintenance window, at which time the OnApp control panel may also be inaccessible. All VPS services will be brought back online automatically following the maintenance window.

Dedicated Servers: We would advise dedicated server customers to shutdown their servers or put their sites into maintenance mode 10-30 minutes prior to the maintenance window to help ensure clean shutdowns can be performed. Servers will be automatically brought back online following maintenance.

We understand this is a significant downtime, so we have scheduled the maintenance window for overnight hours to help ensure as few customers as possible are negatively impacted by this relocation. If you have any questions or concerns, please contact us directly by opening a support ticket under your account or by emailing us at [email protected].

Update (10:06PM 4/25): Please note we will be kicking off the downtime a bit later than anticipated tonight. We have encountered an issue during a planned update we were to perform just prior to the maintenance, and we are getting that rolled back before the maintenance gets fully underway.

Update (10:42AM 4/26): The maintenance and subsequent service verification checks have been completed at this time. Most users should have been back online around 7:45AM. VPS and dedicated customers about 30-60 minutes thereafter. If you are still seeing any issues and you are on our VPS services, please reboot your VPS and if this does not help contact us at the help desk. If you see issues on shared, reseller, or dedicated server hosting, please contact us via the help desk or [email protected] for assistance. All services should be fully online and operational at this time, and we would like to apologize for the maintenance taking much longer than expected. An additional blog post regarding why the scheduled maintenance window was much more optimistic than we anticipated will come later this evening once we have a chance to triple-check all servers and services, and handle pending support requests.

Outage at Chicago Datacenter - 9/18/20

September 18, 2020 at 11:37 PM

Shared hosting customers may be noticing DNS problems this evening as we have experienced an attack on our DNS servers. We have taken steps to get service restored but it may take some time before all domains begin resolving again (we recommend clearing your DNS cache or restarting your device to help speed this process along).

Reseller, VPS, and Dedicated Server customers may also have seen an outage lasting roughly 30 minutes during the time when we were working with the datacenter to resolve the DNS connectivity issues. These services should now be back to normal operation.

Saturn Server Reason For Outage 7/17-7/19/20

July 20, 2020 at 2:29 PM

This report will summarize and go into detail about the outage and recovery efforts on the Saturn server between 7/17/20 and 7/19/20. Note timeframe references are in our local time (CDT, GMT-5).

Summary:

On the night of Thursday 7/16/20 we had scheduled a server reboot to help improve server stability following an issue encountered the day before. The uptime of this server was over 1,000 days and the issue the night before led us to believe a reboot would be helpful to clear out a few old processes that may have been impacting server stability. During the scheduled reboot, the RAID arrays hosting both primary and backup drive data were no longer present aside from two disks on the main RAID 10 array. The data was not recoverable in this state, and so we began recovery efforts on the morning of 7/17 via re-image and restore from the latest backups taken the day of the reboot. Exceptionally slow speeds were cited going into the second half of the data restoration process, which led us to investigate possible underlying complications, ultimately finding that the server recovery had been executed on the wrong array. Options were evaluated for how to proceed, and it was deemed most suitable to re-initiate recovery again on the correct disks which began on the morning of 7/18. Recovery from backup was then completed in the early afternoon of 7/19. 

In depth:

Prior to a reboot on a server with a significant uptime, it is prudent to be sure proper backups are in order and all systems are operating as expected. We verified JetBackup (the backup service on the Saturn server) had taken a full set of successful backups on the same day as the reboot was to be performed, as expected and without complications on any accounts. We also verified via our hardware monitoring services that no issues were being reported. In this case, it is important to note what we were seeing reported by the RAID array at this time which was the following: OK (CTR, LD, PD, CV) -- Broken down, this effectively meant no issues were present on any of the key points related to the RAID configuration or hardware. Virtual drives (the RAID arrays), physical disks, the CacheVault (I/O caching and backup unit), and the controller itself all were scanned within minutes of the reboot and did not report any issues. This monitoring occurs every 5 minutes and all points of concern are checked each time. We were very careful in this case to double-check these monitoring points before the reboot because the server did have several disks replaced last year.

Following the reboot, all disks which were hot swapped in the year before had fallen out of their respective RAID arrays. The only two drives remaining with a visible RAID configuration belonged to an unrecoverable portion of the primary RAID array. At this point we began our best efforts to try to recover from the failed RAID arrays, but several hours later after making no progress and evaluating other options with experts in the field, we decided to start a full rebuild of the operating system (OS) & recovery from backups. We also updated the RAID card firmware during this process as we suspected it may have been to blame for the RAID failure, and we tested the integrity of future reboots on the installed drives by doing some stress testing prior to beginning the recovery effort (though that is not to say we will be rebooting the server again anytime soon - for sanity's sake!).

During the initial re-configuration of the server, a mistake was made which put the backup HDD drives in RAID 0 instead of RAID 1, which caused the wrong drives to be interpreted as the primary array during the OS re-install process. This mistake was not realized until late in the evening that same day when it became apparent there was something wrong with the recovery speed. We then evaluated amongst ourselves, and other experts whom we consult with, as to whether trying to clone the drives onto the main array or simply restarting the recovery process entirely was the best course of action. The clone process ultimately could have failed, and would have likely taken several hours to complete and more backups still needed to be restored afterward (about 45% of data remained to be restored). The restarting of the recovery process entirely would be a more certain course of action and afterward the restore from backup would be much faster. Another option was to continue with the restore process and then try the clone or transfer later from the restored data once everything was back online, but this came with other risks (the data was on a RAID 0 array under extreme load being the primary concern) and would have meant *extremely* slow speeds for the duration. If anything went wrong with that process, we would have been forced to restart recovery completely once again.

It was finally decided to restart the recovery process entirely - on the correct drives, and retaining data from the prior restore operation. We considered this to be the best course of action to not only get websites back online as quickly as possible but also to limit further downtime during and following the restore process. The recovery process was finally completed on 7/19/20 in the early afternoon around 1PM. Server responsiveness and stability then immediately returned to normal.

Going forward:

We do everything we can to prevent issues like RAID failure from becoming an actuality. RAID failure is the most feared situation in any hosting environment. We checked the array status, we checked the hardware backing the I/O cache, we checked the physical disks - all of this just minutes prior to the scheduled reboot. In fact, RAID arrays on all of our systems are checked every 5-10 minutes and drive replacement occurs within 24-72 hours of failure, caching is disabled immediately if battery backups or CacheVault systems fail, etc.

Recovery from backups is rarely a quick or easy operation in a shared hosting environment, and we are exceptionally relieved with the outcome in this regard especially as it relates to data integrity. We regret the oversight made during the first recovery attempt, and this is where we will be making changes to our own internal processes. Generally we reference other active systems and an overall recovery guideline - we keep configurations similar between our shared hosting environments where possible - but this was half of the equation which led to the selection of the wrong drive during the recovery process in this case. We will now be retaining separate recovery plans for each individual system rather than a blanket set of guidelines. We do not want to make worst case scenarios worse, and this will ensure such complications do not occur again.

As with any major downtime, lessons are learned and will now always influence our future preparations and actions. This was easily the worst disaster we've had to manage in our history in web hosting as GeekStorage, and the second worst in my personal experience of nearly 20 years. Our disaster recovery plans were tested heavily throughout the past few days, and although a mistake was made preventing a more timely return to service, minimal loss of data and configurations occurred and for that we are very thankful to have planned sufficiently.

We're also now working on sourcing more coffee.

Thank you:

We know stress is constant during an outage and at most times concrete answers are unavailable which all means everyone is angry or frustrated or both. We want to thank everyone on the Saturn server for being extremely understanding and patient throughout the outage and recovery effort. 

We understand outages such as this, compounded also by the network outage earlier this month, can shake your faith in us as your hosting provider, and the same for your business and customers. We hope the above helps to provide an understanding of what happened and how such problems will be prevented to the utmost degree possible in the future, and this information can be passed on to your clients to hopefully alleive their concerns going forward, as well.