Node2 - Reason for Downtime (08/05/2010)

August 5, 2010 at 3:48 PM

To begin, very early this morning at 3:04AM CDT (GMT-5) we received an alert that Node2 had gone offline. This was a hard crash, unlike most that are a matter of a simple reboot, after this crash, the server did not come back up without reverting to a previous kernel due to issues with the newer kernels loading. At the time, we just loaded up the old kernel and let it be to keep downtime to a minimum. This initial outage lasted about 40 minutes  (~60 minutes for an individual VPS + quota re-check). After a few more hours, at 7:56AM CDT we received another notification that Node2 had gone offline. We immediately began going over possible problems and solutions to determine what wasn't working and what could help get the server back online and fully functional to avoid further outages or any issues with service speed.

It was quickly determined that the only working kernel was an old kernel that seemed to have trouble allowing the server to utilize all of the installed RAM. For a VPS node, it is imperative that all of the installed RAM is available to the node without issue, otherwise we encounter subsequent crashes due to the RAM + SWAP being completely utilized.

Throughout the morning we have been going over many different possible solutions. Our first plan to resolve the issue came in figuring out what was (seemingly) wrong with the kernels that wouldn't boot. This came down to trying many different kernels, kernel parameters, and even compiling software that seemed to be interfering with the boot process. Despite all of this, nothing at all changed the behavior of the problems we were seeing.

After many hours of trying for a software solution, it was clear something wasn't right with the RAM. We tried changing BIOS options, but still the entirety of the RAM was not being recognized by the server. Ultimately the problem was pinpointed to be a faulty RAM stick. This led to all of the problems we saw throughout the software troubleshooting process. After hours of troubleshooting, replacing the RAM at around 1:50PM CDT finally resolved the problems on Node2.

We have seen very few hardware failures present as software/kernel boot problems in this nature. We admittedly spent far too much time banging our heads against the wall with a number of solutions that were not showing any promise, when we should have been questioning what triggered the initial outage earlier in the night. In the end, Node2 was brought back online at 1:57PM CDT, with VPS services coming back online shortly thereafter.

We would like to thank everyone on Node2 for their patience throughout the problems today. As our way of thanking you for your patience and honoring our SLA, you may request a one month service credit for this outage by e-mailing [email protected].

Node2 - Emergency Maintenance (08/05/2010)

August 5, 2010 at 9:50 AM

Due to issues this morning on Node2, we are running some emergency software updates and expect ~45 minutes downtime during the subsequent service resets. We apologize for this unexpected downtime, and would like to thank everyone for their patience and understanding.

Update 10:50AM: We are encountering some problems running the updates, we hope to have service on Node2 restored ASAP.

Update 12:42AM: We are still working to get the problems resolved, unfortunately we still do not have a solid ETA.

Node5 - Scheduled Maintenance (08/03/2010)

July 30, 2010 at 12:34 PM

To address some possible hardware issues, we have scheduled maintenance for Node5 on August 3rd at 1AM CST. We expect this maintenance to last no longer than one hour. The maintenance window will allow for the replacement of a hard drive cable on Node5, which should resolve some issues we are seeing, and should also have a minor positive impact on performance once completed.

The maintenance for Node5 is still on schedule and will begin at 1AM this coming morning as planned.

The Node5 maintenance is now completed. :)

Los Angeles - Unplanned Outage (07/16/2010)

July 16, 2010 at 7:40 PM

We are currently experiencing an outage at our Los Angeles location, which has rendered services unavailable for the below servers. We will post updates as we receive them from the datacenter, and would like to thank everyone for their patience and understanding.

Metis
Iris
Omega
Alpha
Aries
Node1
Node2
Node4

7:50PM Update: It appears there has been a power outage at the Los Angeles datacenter. The power has been restored however as of yet we are not seeing our servers coming back online. We are awaiting further updates from the datacenter at this time.

7:56PM Update: These servers are coming back online at this time:

Metis
Iris
Omega
Node1
Node2
Node4

Note for VPS servers: Since this was an unclean shutdown, your VPS will be coming back online within ~20 minutes from the actual Node coming online, after which the node will perform quota maintenance within several hours, causing your VPS to go offline again for 10-30 minutes, and then come back online finally after the quota maintenance is completed. We apologize for this inconvenience.

8:51PM Update: The server listed below is back online at this time. We are working on finishing a FSCK on the Aries server, and the Omega server is pending further investigation regarding a disk issue. We hope to have further positive updates for you shortly.

Alpha

8:58PM Update: The server listed below is back online at this time. We are still waiting on further updates regarding the Omega server.

Aries

9:58PM Update: We are still working on getting the issues on the Omega server resolved.

11:40PM Update: We are still waiting on an update from the datacenter regarding the hardware issue that seems to have come up on the Omega server after the power outage earlier today. As of yet we still have not receive a response from the datacenter and we do not have an ETA to provide. It is likely the datacenter is backlogged with many other similar requests due to the power outage, so it is difficult to guess how long it may take to get this issue resolved. We promise to keep the blog as up to date as possible with new information as we receive it.

12:54AM Update: The datacenter has still not been able to get to our issue to bring the Omega server back online. Unfortunately there is still no ETA for a resolution.

3:48AM Update: The datacenter is currently looking in to the hardware issue on Omega. If the issue is what we expect, the server should be back online within the next hour. Another update to follow as soon as we receive word back from the datacenter.

4:58AM Update: Unfortunately we have not yet received an update from the datacenter regarding the root cause of the issue. We are still waiting to hear back from their end.

5:42AM Update: The datacenter has let us know that they will get back to the issues on Omega as soon as they can, there are still many pending issues on other servers that are slowing down the process of investigating the troubles on the Omega server.

7:42AM Update: We unfortunately do not have any significant updates to provide at this time regarding the Omega server. We can confirm there is a hardware issue on the Omega server, sadly this means that the server will be at the end of the datacenter support queue, since there are far more servers still offline that can be fixed more quickly that they need to attend to. We still do not have any sort of time frame as to when the Omega server will be back online, as we still do not know what exactly needs to be replaced and thus how much more work is required to get the server back online once the hardware is replaced. Optimistically, once the datacenter gets the faulty hardware replaced, we hope to have the server online very quickly afterwards.

12:54AM Update: Great news, Omega is back online! After a hard drive replacement, the primary raid array is back to normal operating status. Do note that hard drive functions may be slowed for some time as the RAID array is rebuilt on to the new hard drive, this can last up to a few days. We would like to thank everyone for their patience and understanding throughout this process.

Metis Server - Unplanned Outage (05/25/2010) (RESOLVED)

May 25, 2010 at 3:06 PM

While resolving some service issues on the Metis server today we encountered a problem that required the server to be rebooted immediately, during the reboot it was determined that there was filesystem corruption on the primary hard drive, and we are waiting on a filesystems check to complete at this time. The estimated total downtime will be around 90 minutes, which leaves approximately 30 minutes for the remainder of the filesystems check. If this timeframe is to be extended we will update this notice here immediately.

UPDATE:

We have determined there are further problems to the filesystems on the Metis server which require an additional filesystems check and repair, we do not have an exact ETA for resolution at this time, however we expect between 30 minutes and two hours to complete the remaining filesystems repairs.

UPDATE:

The filesystem repairs to the Metis server have been completed at this time, and service is coming back online now for users on the Metis server.