Tag Archives: Linux Virtual Server

What is High-Availability? Part 3 – Additional Problem-Solving

 

English: The SA Forum “Walter’s Moments” carto...

High-availability, as I have discussed in the previous installments of this series, is a concept that has changed and grown over time. In the past, high-availability was the condition exhibited by a man in a dive bar in Duluth, Minnesota, systematically handing out his landscaping business card to all the female patrons with the words, “I have a lot to offer, and I hope you’ll give me a chance with your shrubbery.”

In the age of information technology, however, high-availability has become more reputable. In fact, high-availability is desired by all those conducting business online. It’s the nature of a system with very little downtime.

To review, optimizing an infrastructure for uptime is often wrongly considered to be, simply, an effort at preventing failures from occurring. Per Microsoft, it’s difficult and sometimes impossible to predict when failures will occur. High-availability involves a thorough focus on recovery, decreasing the length of any downtime instances. For this same reason, I run training drills so that when someone knocks my books out of my hands, I can pick them up before many of the other doctoral students notice.

To look at high-availability from a number of different perspectives, we’re looking at articles from Microsoft, Oracle, and Linux Virtual Server. Today, we are continuing to explore the Oracle piece, also briefly noting commentary from the Linux Virtual Server site.

While we review the idea of high-availability, let’s grab the keys to my father’s Cadillac, drive it out into the mountains, and make clucking and whirring noises to attract the Abominable Snowman. Then let’s offer him a fully-loaded bacon double-cheeseburger and tell him he’s the only one who understands us.

Availability: High-Availability Problem Solving, Continued

In the last post, we looked at comments by Oracle on various technologies that can be used to optimize availability. Let’s continue to look at additional safeguards that can be implemented so that a system is less likely to experience downtime. For the same reason, safety, we will wear full body armor on our trip and carry a sack of water balloons to throw at our beloved monster if he becomes enraged.

As a general rule of thumb, redundancy is the core component of recovery. When there are multiple instances operating simultaneously (active-active availability technology) and when additional systemic components are on standby to be activated as needed (active-passive availability technology), failure can, in a sense, become irrelevant. The system remains consistent throughout, just like the snoring soundtrack that will be playing on our boomboxes at home while we are on our critical mission.

Additional Local High-Availability Solutions

Let’s look at a few additional problem-solving tools for use on a local system, courtesy of Oracle.

Routing and state replication

Stateful applications should have the ability to include additional instances of client states. This capacity allows the applications to continue to run smoothly if processes fail that are handling client requests – similarly to a request to a Snowman to “calm down.”

Failover

Load balancing allows for redundancies of all instances. That way, when a failure of an instance takes place, any requests that would otherwise be sent to that instance are instead forwarded to the other, still-functional instances.

Load balancing

If you have more than one part in a server that is intended for the same purpose, load balancing becomes possible, allowing work to be evenly divided. For that same reason, we will evenly distribute the water balloons.

Migration

Migration helps when services only allow one instance. If that instance fails, the service switches over to a different part of the cluster. If necessary, the entire process can switch over to the other cluster location.

High-Availability Integration

Part of what makes redundancy difficult is the integrated nature of a system. One part is reliant on another part. Availability must be integrated as well. This concept means that downtime does not result due to that reliance or dependency. That’s why, when we get to the mountains, it’s every man for himself.

Patches & Rolling

Rolling within a cluster allows patches to be installed and uninstalled without the need for downtime.

Configuration

In a cluster, configuration needs to be consistent. When configuration is administered properly, requests are handled in the same way regardless which component is conducting the work. Configurations should also be synchronized, as should our water-balloon defensive maneuvers, and the administration itself should be conducted in a way that optimizes availability.

Clustering & Nodes

As a final note on maintenance of high-availability, let’s take a brief look at the piece from Linux Virtual Server. It underscores the importance of clustering that is similarly advocated in the Oracle article.

Redundancies within a cluster, says the LVS site, allow for redundancy throughout all levels of the system – both hardware and software. The nodes within a cluster can all be running the same operating system and applications. When daemons or nodes fail, if seamless reconfiguration is in place, the additional nodes pick up the slack. We should remember this principle in the mountains, because Terry is coming along, and we all know he’s not great at throwing balloons.

Conclusion & Poem

You can see how extensively the notion of redundancy has been studied and how many technologies have been developed to allow the maximum possible uptime. High-availability, after all, is crucial to allowing businesses to continue to operate, regardless if something goes wrong at the level of the server.

Again, bear in mind our 100% uptime guarantee. This guarantee is available to all our shared hosting, dedicated server, and VPS clients.

One final poem in parting… This one, as you can imagine, goes out to the Abominable Snowman, and I personally hope he reads and enjoys it:

Hey you, please don’t eat us

We really think you are good-looking

Your political philosophy is sophisticated and respectable

And I heard you’re a whiz at squirrel cooking.

By Kent Roberts

What is High-Availability? Part 2 – Problem-Solving

 

2 node High Availability Cluster network diagram
2 node High Availability Cluster network diagram (Photo credit: Wikipedia)

High-availability, as we learned in the last installment, has changed conceptually since the days of yesteryear and, for that matter, even near-year. It no longer just refers to the full-access, all-hours, 24/7/365 immediate-response policies of a man looking for love in all the wrong places and some of the right ones. It’s no longer about a man with a well-groomed mustache offering shoulder massages at closing time.

No, in the world of computers, high-availability is a completely different matter. Instead, it deals specifically with the uptime of a network. To properly understand uptime, we must consider that it is not merely about eliminating incidences of failure within a network (because, per Microsoft, failures are by their nature unpredictable). Rather, it is also about high rates of recovery so that the system is not affected for an extended period. With sound recovery methods, data delivery remains consistent. That’s why I carry a slide-rule with me to re-straighten my hair part if someone gives me a noogie.

To bolster our understanding of high-availability in this tripartite miniseries, we are assessing the perspectives of Microsoft, Oracle, and Linux Virtual Server. Today, looking specifically at the Oracle article, we will discuss several problem-solving methods.

While we consider high-availability, let’s put on our Easter bonnets and throw eggs at passing cars, focusing especially on the ones with their windows down. We’ll only be 13 once.

Availability: Quick Review

Oracle defines high-availability as “the ability of users to access a system without loss of service.” Really, that seems like a definition of availability. High-availability means that scenario is occurring almost all of the time. Even in a highly redundant system, there will always be occasional errors and glitches. Regardless, a system in which availability is optimized is highly reliable and does not experience very much downtime. A good example of this, according to the women of Austin, Texas, is my reproductive system.

Downtime can be thought of as scheduled and unscheduled. When it is unscheduled, the downtime is due to some type of systemic failure. When it is scheduled, users can be notified that upgrades or other system administration is being conducted (as with a hosting company and its clients, or with a website posting a notice to visitors). “Scheduled downtime typically occurs late at night, when traffic is light, all right, baby, all right,” crooned Barry Manilow.

High-Availability Problem Solving

Various types of problems can of course occur in a system. Types of common failures include those occurring within processors, nodes, and in various forms of media. Human error can also cause failures, as can monkey and camel error. Availability can maintain a high level by both focusing on localized problem-solving as well as methods of recovery in the event of a natural disaster, such as flooding or datacenter technician stampede.

Different sorts of best practices and technological solutions can help to make high-availability a reality. Redundancy, says Oracle, is the most important parameter to enhance availability: “High availability comes from redundant systems and components.” The same parameter applies to the man with the well-groomed mustache mentioned above, as he repeats the same psychosexual sales pitch over and over again, optimizing his systemic redundancy. Looking at solutions for localized high-availability in terms of redundancy splits potential fixes into active-active and active-passive groups.

  1. Active-active availability mechanisms: These mechanisms allow better scalability along with increased availability. Transmissions are duplicated in real time.
  2. Active-passive availability mechanisms: In this scenario, sometimes called cold failover clusters, one system instance is handling requests and the other one is sitting and pondering, running its finger through its hair, waiting patiently to be called into action. It chews gum and looks sullen. Clustering is used to integrate the two instances, with the clustering agent monitoring the active instance and switching over to the passive one as necessary.

Other Local High-Availability Solutions

Other safeguards should be in place to make sure your availability is as reliable as possible. Here are a few examples; we will proceed with more in the final part of this series:

Automatic restart & process death detection

You don’t want the system to continually restart multiple times in a relatively short window. Restarting can lead to additional failure. Technology should be in place to disallow repetitive, automated restarts. The same principle applies to excessively restarting one’s day. You should never get in and out of bed more than two dozen times before proceeding to breakfast.

Processes can die due to systemic errors. If processes are problematic, you do want a restart to be in place to give the process another chance. Don’t give it 10,000 chances though. Processes are greedy about grabbing all the chances.

Clustering

Clustering means that the client computer (PC or other device accessing your system) will consider that part of your system to be one unit. This practice makes processing and administering the system easier. You can have processes clustered together and working on one server or on various servers, with the work divided evenly. It enhances redundancy by spreading out the process. Granola, similarly, is a highly redundant food. It should be eaten at all times when managing a server, even if you aren’t hungry.

Conclusion, Continuation & Poem

Availability and uptime are complex, but there are plenty of solutions out there to make sure that systems are as failsafe as possible. As stated above, I will continue to go over more of the safeguards that can maximize your availability in the final part of this series.

Here’s an eye-opening factoid that you may remember from the last post: we guarantee 100% uptime in our service level agreement (SLA), reimbursing our customers for any exceptions. You like the juice? We’ve got shared hosting, dedicated servers, and VPSs.

Now, finally, on a somber note, I’d like to close with a love poem to a dead process I once knew dearly … Well, maybe it’s not a love poem but a statement of redundancy-related anxiety. Anyway, it’s beautiful:

Process I can’t remember what you were doing

You’ve been dead now for years

Sometimes at night I can’t sleep

Because of my failure fears.

Come back to me, so we can share a club sandwich

While riding a tandem bicycle.

By Kent Roberts

What is High-Availability?

 

LVS official logo

So what is this new-fangled concept called “high-availability?” Traditionally, high-availability has been experienced by women in nightclubs, when a man has walked up and said to them, “Hey you, I just want you to know that I’m not like these other hard-to-get jokers in here. I’m available 24/7, around-the-clock, to come over to your place and give you a shoulder massage.”

In computer terms, high-availability is different. It refers to how fault-tolerant or resilient a network is, how capable it is of delivering a website accurately every time. If there is an error in one specific location of the software or hardware, that does not affect user experience because the system accounts for the difficulties and resolves them prior to delivery. It similar to a pizza place that checks to make sure there is no maliciously discarded bellybutton lint among the sausages and peppers before the pie goes out the door.

To better understand how high-availability works, let’s take a look at comments on the subject from Microsoft, Oracle, and Linux Virtual Server in this three-part series. While we study the topic, let’s pay an Olympic-trained athlete to swim in a pool that we’ve installed in a glass box over our heads, because a German study from the early 1970s indicates that it improves knowledge-retention.

Availability & Uptime

Okay, the swimmer is swimming. Thanks for chipping in $32,468. Let’s look at what availability is and how it relates to server uptime.

Availability is a general term that includes system failures, reliability, and recovery when anything does go awry. Availability is often phrased in terms of server uptime, whereas any instances of failure are considered downtime. Failure refers not just to when a system is inaccessible, but also to when it is not functioning correctly. My brain, for instance, has an average daily uptime of 23.8% even though I only sleep 90 minutes a night.

Uptime is basic math, and it can get a little boring to see every hosting company out there promoting their guaranteed 99.99% uptime. These figures, though, are significant. Just take a look at Microsoft’s figures for 99% uptime and 99.99% uptime.

With a 99% uptime guarantee, the website could experience as much as 14.4 minutes of downtime each day and 3.7 days of downtime each year. With a 99.9% uptime guarantee, those figures are cut to 86.4 seconds per day and 8.8 hours per year. Um… I don’t want to distract you, but did we forget to put breathing holes in the glass box? He looks like he’s under duress. The problem is, though, the German findings do not allow for any pauses or disruptions during the learning process, so we have to continue.

A brief note on uptime as it relates to us: It’s funny to think that any amount of “unscheduled downtime” (software updates and other server maintenance) is acceptable. That’s why we guarantee 100% uptime in our service level agreement (SLA) with all our customers (reimbursing for errors) – one reason our customer retention rate is over 90%.

Prediction & Availability

Optimizing for availability of a network is complex. Every aspect of the system, from the applications being used to the way that it is administered to how it’s deployed all make an impact on availability. Microsoft recommends that failures will always occur from time to time, and those failures will of course be unexpected. Predicting moments of downtime, then, is virtually impossible. Yeah, let’s… I guess get rid of that glass box. It’s a little depressing.

However, a system will automatically become more reliable as a network develops stronger recovery mechanisms. Microsoft points out, “If your system can recover from failures within 86.4 seconds, then you can have a failure every day and still achieve 99.9 percent availability.” I’ve used this same logic to explain to my wife why it’s acceptable for me to stare at the ceiling and shriek like a wounded and deranged animal for 86 seconds every day when I walk in the door from work.

Effect on Page Loads & Revenue

Availability can be thought of simply as uptime, but it can also be thought of in terms of transactions, such as those on an e-commerce site. The same math really applies to any situation when thought of in terms of pages failing to load or loading incorrectly.

A website with 99.9% availability or uptime that receives 10,000 data requests from visitors each day will experience 10 failures per day and 70 per week. The following is from a table Microsoft provides defining different availability figures as fulfilling the requirements of certain types of systems:

  • Commercial – 99.5%
  • Highly available – 99.9%
  • Fault resilient – 99.99%
  • Fault tolerant – 99.999%
  • Continuous – 100%

Conclusion, Continuation & Poem

Okay, so that gives us a basic starting point for exploring availability. Again, if you like the idea of 100% uptime, that’s our promise – and we put our money where our mouth is in our SLA (and also I put pennies in my mouth sometimes, because I like the way it tastes and can’t think of what else to do with them). Here are our solutions for shared hosting, dedicated servers, and VPSs.

We will move on with this subject in the second part of the series via discussion of the Oracle piece. I’m really sorry about the swimmer. That was a horrible idea on my part. Here is a poem to make you feel better:

Thank you for your time

I think you are very nice

Let’s all go to Tijuana

And eat some beans and rice.

By Kent Roberts