All Servers - Planned Maintenance (11/29/2010)

November 28, 2010 at 2:09 PM

All shared & reseller hosting servers will be down for scheduled maintenance on 11/29/2010 for roughly 10-15 minutes per server as we issue reboots to implement new software. The downtime will occur between 10:00PM and 11:59PM CST (GMT-6), and will last roughly 10-15 minutes for each individual server.

This maintenance is to enable CloudLinux on all of our shared & reseller hosting machines.

This update will improve service speeds significantly across all shared & reseller hosting services, and implement improved resource tracking & management per account. Users of the "x3" cPanel theme will also find new statistics in their cPanel display, corresponding to their CPU Resource Usage and Concurrent Connections.

If you have any questions regarding this maintenance, please e-mail us at [email protected].

Update @ 2200 hrs: Scheduled maintenance has begun on-time. Estimated downtime for each server is approximately 10-15 minutes.

Update @ 2240 hrs: Scheduled maintenance has been completed and all Shared Hosting and Reseller Hosting servers are back online, running CloudLinux!

Chicago - Downtime (09/12/2010)

September 12, 2010 at 12:02 PM

During major network upgrades at the Chicago datacenter very early this morning (2AM CDT) we began to experience heavy packet loss on all Chicago servers. These servers include:

Apollo
Node3
Node5
Node6
Zeus

After around 7AM CDT the heavy packet loss turned in to a total inability to access the network. The datacenter is aware of this problem, and working with their hardware vendor to get the issue resolved as soon as possible.

This outage has stemmed from a scheduled multiple core router upgrade that was expected to not impact service availability for more than a few minutes at a time. Unfortunately, some unforeseen complications have resulted in many issues this morning. A link to a news feed from the datacenter can be found below, all updates will be posted as they are available at the link below, and also more information regarding the cause for the outage is available here.

https://support.steadfast.net/index.php?_m=news&_a=viewnews&newsid=285

2:57PM CDT Update: We have receive an ETA of 2 hours for resolution of the network issues from the datacenter at this time.

5:45PM CDT Update: Service at the Chicago location should now be back online, there may still be a little bit of packet loss as the datacenter is still working on the network hardware. We have gone ahead and scheduled a hard drive replacement on the Apollo (reseller) server for 10:30PM, this maintenance will last roughly 30 minutes and should improve speeds significantly once the new hard drive has been mirrored in the RAID array.

Node2 - Reason for Downtime (08/05/2010)

August 5, 2010 at 3:48 PM

To begin, very early this morning at 3:04AM CDT (GMT-5) we received an alert that Node2 had gone offline. This was a hard crash, unlike most that are a matter of a simple reboot, after this crash, the server did not come back up without reverting to a previous kernel due to issues with the newer kernels loading. At the time, we just loaded up the old kernel and let it be to keep downtime to a minimum. This initial outage lasted about 40 minutes  (~60 minutes for an individual VPS + quota re-check). After a few more hours, at 7:56AM CDT we received another notification that Node2 had gone offline. We immediately began going over possible problems and solutions to determine what wasn't working and what could help get the server back online and fully functional to avoid further outages or any issues with service speed.

It was quickly determined that the only working kernel was an old kernel that seemed to have trouble allowing the server to utilize all of the installed RAM. For a VPS node, it is imperative that all of the installed RAM is available to the node without issue, otherwise we encounter subsequent crashes due to the RAM + SWAP being completely utilized.

Throughout the morning we have been going over many different possible solutions. Our first plan to resolve the issue came in figuring out what was (seemingly) wrong with the kernels that wouldn't boot. This came down to trying many different kernels, kernel parameters, and even compiling software that seemed to be interfering with the boot process. Despite all of this, nothing at all changed the behavior of the problems we were seeing.

After many hours of trying for a software solution, it was clear something wasn't right with the RAM. We tried changing BIOS options, but still the entirety of the RAM was not being recognized by the server. Ultimately the problem was pinpointed to be a faulty RAM stick. This led to all of the problems we saw throughout the software troubleshooting process. After hours of troubleshooting, replacing the RAM at around 1:50PM CDT finally resolved the problems on Node2.

We have seen very few hardware failures present as software/kernel boot problems in this nature. We admittedly spent far too much time banging our heads against the wall with a number of solutions that were not showing any promise, when we should have been questioning what triggered the initial outage earlier in the night. In the end, Node2 was brought back online at 1:57PM CDT, with VPS services coming back online shortly thereafter.

We would like to thank everyone on Node2 for their patience throughout the problems today. As our way of thanking you for your patience and honoring our SLA, you may request a one month service credit for this outage by e-mailing [email protected].

Node2 - Emergency Maintenance (08/05/2010)

August 5, 2010 at 9:50 AM

Due to issues this morning on Node2, we are running some emergency software updates and expect ~45 minutes downtime during the subsequent service resets. We apologize for this unexpected downtime, and would like to thank everyone for their patience and understanding.

Update 10:50AM: We are encountering some problems running the updates, we hope to have service on Node2 restored ASAP.

Update 12:42AM: We are still working to get the problems resolved, unfortunately we still do not have a solid ETA.