4/25/21 Extended Maintenance Report

April 26, 2021 at 7:49 PM

We want to thank everyone for their patience during the extended maintenance last night and this morning which impacted our US-based hosting services. We would also like to apologize for going far beyond the planned maintenance window. We did schedule this maintenance well in advance, and believed strongly that it would be a 2-3 hour operation, and we added another hour for any possible complications to be cleared up. Unfortunately some aspects of the planned maintenance were changed without being communicated to us prior and these changes effectively caused a 2-3 hour job to become a 6-8 hour job.

All of our US server hardware was to be moved to a location with better infrastructure, and this was communicated to us by our datacenter provider over the past several months. Our biggest concern when moving equipment between locations is always making sure downtime is kept to a minimum - we've managed our fair share of hardware migrations and there are always complications or things we did not expect that end up slowing down the process substantially. Most commonly this can relate to trying to maintain similar cable configurations during migration, hardware being jostled during transport which can increase the risk of hardware failure upon returning the hardware to service, and things one wouldn't necessarily expect to take very long like removing the servers from the cabinet and re-racking them in a new cabinet. In February we received more details regarding the expected process for the relocation, and we were informed the datacenter would be taking care of everything and the process was to be a full cabinet migration without removing and re-racking servers between locations, and the work would be performed by a team that does this type of work professionally. Not only is this the fastest and most reliable method of relocating live servers, but it avoids most of the concerns we typically have in regards to the entire process.

We performed a planned upgrade to the OnApp software hosting our VPS services shortly before the migration was to begin, and unfortunately the update had to be rolled back which caused a delay of almost two hours before the maintenance began. Around 11:45PM CDT the migration was underway and we were still anticipating 2-4 hours for the entire process from that time. After 4 hours it was clear something was not going quite to plan, and when communicating with the technicians on site we were informed they were almost done racking our equipment at the new location and to expect systems to be booting up within an hour. At this point we knew the work was going to go on significantly longer than expected and one hour was likely very optimistic, so we tried our best to communicate this via Twitter while our main site and services were still down.

After 90 minutes from the time we received the ETA from the datacenter, the re-racking had been completed, and then the re-wiring began which took another 30 minutes or so. One of the older PDUs needed to be replaced which took roughly 15 minutes. Services were finally being brought back online around 8 hours after the downtime began, with almost all customers being back online after 8-9 hours total downtime.

In short, instead of migrating cabinets of hardware in their entirety from the old facility to the new facility, as we were told to expect, each piece of equipment was individually moved by hand. Instead of moving large cabinets, dozens of individual pieces of equipment were disconnected, removed, re-racked, and re-connected. There are far more possible complications with this strategy, and of course a substantially longer amount of time is required to perform the work. If we knew beforehand the plan had changed away from a full cabinet relocation, we would have been on site ourselves and split the migration into two separate parts instead of trying to get all the work done in one night. We are frustrated this happened the way it happened, but we are also glad the work is complete and no more hardware relocations are expected in the near future.

We want to apologize again for the maintenance being pushed back and going over twice as long as originally planned. We have confirmed all shared, reseller, dedicated servers, and managed VPS services are fully restored and operational. If you still see any issues of any kind please let us know in the help desk or send us an email at [email protected] and we will investigate immediately.

Posted in Chicago DC by Matt Eli