Tag Archives: downtime

What is High-Availability? Part 2 – Problem-Solving

 

2 node High Availability Cluster network diagram
2 node High Availability Cluster network diagram (Photo credit: Wikipedia)

High-availability, as we learned in the last installment, has changed conceptually since the days of yesteryear and, for that matter, even near-year. It no longer just refers to the full-access, all-hours, 24/7/365 immediate-response policies of a man looking for love in all the wrong places and some of the right ones. It’s no longer about a man with a well-groomed mustache offering shoulder massages at closing time.

No, in the world of computers, high-availability is a completely different matter. Instead, it deals specifically with the uptime of a network. To properly understand uptime, we must consider that it is not merely about eliminating incidences of failure within a network (because, per Microsoft, failures are by their nature unpredictable). Rather, it is also about high rates of recovery so that the system is not affected for an extended period. With sound recovery methods, data delivery remains consistent. That’s why I carry a slide-rule with me to re-straighten my hair part if someone gives me a noogie.

To bolster our understanding of high-availability in this tripartite miniseries, we are assessing the perspectives of Microsoft, Oracle, and Linux Virtual Server. Today, looking specifically at the Oracle article, we will discuss several problem-solving methods.

While we consider high-availability, let’s put on our Easter bonnets and throw eggs at passing cars, focusing especially on the ones with their windows down. We’ll only be 13 once.

Availability: Quick Review

Oracle defines high-availability as “the ability of users to access a system without loss of service.” Really, that seems like a definition of availability. High-availability means that scenario is occurring almost all of the time. Even in a highly redundant system, there will always be occasional errors and glitches. Regardless, a system in which availability is optimized is highly reliable and does not experience very much downtime. A good example of this, according to the women of Austin, Texas, is my reproductive system.

Downtime can be thought of as scheduled and unscheduled. When it is unscheduled, the downtime is due to some type of systemic failure. When it is scheduled, users can be notified that upgrades or other system administration is being conducted (as with a hosting company and its clients, or with a website posting a notice to visitors). “Scheduled downtime typically occurs late at night, when traffic is light, all right, baby, all right,” crooned Barry Manilow.

High-Availability Problem Solving

Various types of problems can of course occur in a system. Types of common failures include those occurring within processors, nodes, and in various forms of media. Human error can also cause failures, as can monkey and camel error. Availability can maintain a high level by both focusing on localized problem-solving as well as methods of recovery in the event of a natural disaster, such as flooding or datacenter technician stampede.

Different sorts of best practices and technological solutions can help to make high-availability a reality. Redundancy, says Oracle, is the most important parameter to enhance availability: “High availability comes from redundant systems and components.” The same parameter applies to the man with the well-groomed mustache mentioned above, as he repeats the same psychosexual sales pitch over and over again, optimizing his systemic redundancy. Looking at solutions for localized high-availability in terms of redundancy splits potential fixes into active-active and active-passive groups.

  1. Active-active availability mechanisms: These mechanisms allow better scalability along with increased availability. Transmissions are duplicated in real time.
  2. Active-passive availability mechanisms: In this scenario, sometimes called cold failover clusters, one system instance is handling requests and the other one is sitting and pondering, running its finger through its hair, waiting patiently to be called into action. It chews gum and looks sullen. Clustering is used to integrate the two instances, with the clustering agent monitoring the active instance and switching over to the passive one as necessary.

Other Local High-Availability Solutions

Other safeguards should be in place to make sure your availability is as reliable as possible. Here are a few examples; we will proceed with more in the final part of this series:

Automatic restart & process death detection

You don’t want the system to continually restart multiple times in a relatively short window. Restarting can lead to additional failure. Technology should be in place to disallow repetitive, automated restarts. The same principle applies to excessively restarting one’s day. You should never get in and out of bed more than two dozen times before proceeding to breakfast.

Processes can die due to systemic errors. If processes are problematic, you do want a restart to be in place to give the process another chance. Don’t give it 10,000 chances though. Processes are greedy about grabbing all the chances.

Clustering

Clustering means that the client computer (PC or other device accessing your system) will consider that part of your system to be one unit. This practice makes processing and administering the system easier. You can have processes clustered together and working on one server or on various servers, with the work divided evenly. It enhances redundancy by spreading out the process. Granola, similarly, is a highly redundant food. It should be eaten at all times when managing a server, even if you aren’t hungry.

Conclusion, Continuation & Poem

Availability and uptime are complex, but there are plenty of solutions out there to make sure that systems are as failsafe as possible. As stated above, I will continue to go over more of the safeguards that can maximize your availability in the final part of this series.

Here’s an eye-opening factoid that you may remember from the last post: we guarantee 100% uptime in our service level agreement (SLA), reimbursing our customers for any exceptions. You like the juice? We’ve got shared hosting, dedicated servers, and VPSs.

Now, finally, on a somber note, I’d like to close with a love poem to a dead process I once knew dearly … Well, maybe it’s not a love poem but a statement of redundancy-related anxiety. Anyway, it’s beautiful:

Process I can’t remember what you were doing

You’ve been dead now for years

Sometimes at night I can’t sleep

Because of my failure fears.

Come back to me, so we can share a club sandwich

While riding a tandem bicycle.

By Kent Roberts

What is High-Availability?

 

LVS official logo

So what is this new-fangled concept called “high-availability?” Traditionally, high-availability has been experienced by women in nightclubs, when a man has walked up and said to them, “Hey you, I just want you to know that I’m not like these other hard-to-get jokers in here. I’m available 24/7, around-the-clock, to come over to your place and give you a shoulder massage.”

In computer terms, high-availability is different. It refers to how fault-tolerant or resilient a network is, how capable it is of delivering a website accurately every time. If there is an error in one specific location of the software or hardware, that does not affect user experience because the system accounts for the difficulties and resolves them prior to delivery. It similar to a pizza place that checks to make sure there is no maliciously discarded bellybutton lint among the sausages and peppers before the pie goes out the door.

To better understand how high-availability works, let’s take a look at comments on the subject from Microsoft, Oracle, and Linux Virtual Server in this three-part series. While we study the topic, let’s pay an Olympic-trained athlete to swim in a pool that we’ve installed in a glass box over our heads, because a German study from the early 1970s indicates that it improves knowledge-retention.

Availability & Uptime

Okay, the swimmer is swimming. Thanks for chipping in $32,468. Let’s look at what availability is and how it relates to server uptime.

Availability is a general term that includes system failures, reliability, and recovery when anything does go awry. Availability is often phrased in terms of server uptime, whereas any instances of failure are considered downtime. Failure refers not just to when a system is inaccessible, but also to when it is not functioning correctly. My brain, for instance, has an average daily uptime of 23.8% even though I only sleep 90 minutes a night.

Uptime is basic math, and it can get a little boring to see every hosting company out there promoting their guaranteed 99.99% uptime. These figures, though, are significant. Just take a look at Microsoft’s figures for 99% uptime and 99.99% uptime.

With a 99% uptime guarantee, the website could experience as much as 14.4 minutes of downtime each day and 3.7 days of downtime each year. With a 99.9% uptime guarantee, those figures are cut to 86.4 seconds per day and 8.8 hours per year. Um… I don’t want to distract you, but did we forget to put breathing holes in the glass box? He looks like he’s under duress. The problem is, though, the German findings do not allow for any pauses or disruptions during the learning process, so we have to continue.

A brief note on uptime as it relates to us: It’s funny to think that any amount of “unscheduled downtime” (software updates and other server maintenance) is acceptable. That’s why we guarantee 100% uptime in our service level agreement (SLA) with all our customers (reimbursing for errors) – one reason our customer retention rate is over 90%.

Prediction & Availability

Optimizing for availability of a network is complex. Every aspect of the system, from the applications being used to the way that it is administered to how it’s deployed all make an impact on availability. Microsoft recommends that failures will always occur from time to time, and those failures will of course be unexpected. Predicting moments of downtime, then, is virtually impossible. Yeah, let’s… I guess get rid of that glass box. It’s a little depressing.

However, a system will automatically become more reliable as a network develops stronger recovery mechanisms. Microsoft points out, “If your system can recover from failures within 86.4 seconds, then you can have a failure every day and still achieve 99.9 percent availability.” I’ve used this same logic to explain to my wife why it’s acceptable for me to stare at the ceiling and shriek like a wounded and deranged animal for 86 seconds every day when I walk in the door from work.

Effect on Page Loads & Revenue

Availability can be thought of simply as uptime, but it can also be thought of in terms of transactions, such as those on an e-commerce site. The same math really applies to any situation when thought of in terms of pages failing to load or loading incorrectly.

A website with 99.9% availability or uptime that receives 10,000 data requests from visitors each day will experience 10 failures per day and 70 per week. The following is from a table Microsoft provides defining different availability figures as fulfilling the requirements of certain types of systems:

  • Commercial – 99.5%
  • Highly available – 99.9%
  • Fault resilient – 99.99%
  • Fault tolerant – 99.999%
  • Continuous – 100%

Conclusion, Continuation & Poem

Okay, so that gives us a basic starting point for exploring availability. Again, if you like the idea of 100% uptime, that’s our promise – and we put our money where our mouth is in our SLA (and also I put pennies in my mouth sometimes, because I like the way it tastes and can’t think of what else to do with them). Here are our solutions for shared hosting, dedicated servers, and VPSs.

We will move on with this subject in the second part of the series via discussion of the Oracle piece. I’m really sorry about the swimmer. That was a horrible idea on my part. Here is a poem to make you feel better:

Thank you for your time

I think you are very nice

Let’s all go to Tijuana

And eat some beans and rice.

By Kent Roberts

If the Internet is down, are you ready?

Disaster Management cycleFor many companies, the idea of the Internet going down is beyond a disaster. Multi-national, multi-office Enterprises and Mom-and-Pop E-commerce sites could be hit equally as hard if there was a multi-car pile-up on the Information Superhighway, and what separates those that survive and those that fade off into the sunset may be determined by the preparations made just in case. Whether it be a planned attack, an accident, or an act of God, The Business Roundtable, a Washington-based public policy advocacy group, suggests that in the next 10 years there is a 10% to 20% chance of a “breakdown of the critical information infrastructure.”

With so many pipelines delivering the Internet across the globe, it’s easy to ignore the idea of a significant loss of access to the Internet, but what is the cost benefit trade-off with this kind of thinking? It’s easier and less costly to be prepared rather than react; are you willing to gamble with your company?

Wikipedia defines disaster management as the discipline of dealing with and avoiding risks. That includes having several (secure) back-ups of important documents, applications, and web content – and confirming that these back-ups can be restored to fully working. Since the chaos of Y2k, most companies have probably started to think less and less about disaster management, but with heavier and heavier reliance on the Internet, it’s important to consider the worst case scenario: no Internet access at all. While it’s easy to overlook, being prepared beats going out of business.

For more on the topic, see The Internet Is Down — Now What and Disaster Recovery Made Easy (Well, Sort of).