What is High-Availability? Part 2 – Problem-Solving

 

2 node High Availability Cluster network diagram
 2 node High Availability Cluster network diagram (Photo credit: Wikipedia)

High-availability, as we learned in the last installment, has changed conceptually since the days of yesteryear and, for that matter, even near-year. It no longer just refers to the full-access, all-hours, 24/7/365 immediate-response policies of a man looking for love in all the wrong places and some of the right ones. It’s no longer about a man with a well-groomed mustache offering shoulder massages at closing time.

No, in the world of computers, high-availability is a completely different matter. Instead, it deals specifically with the uptime of a network. To properly understand uptime, we must consider that it is not merely about eliminating incidences of failure within a network (because, per Microsoft, failures are by their nature unpredictable). Rather, it is also about high rates of recovery so that the system is not affected for an extended period. With sound recovery methods, data delivery remains consistent. That’s why I carry a slide-rule with me to re-straighten my hair part if someone gives me a noogie.

To bolster our understanding of high-availability in this tripartite miniseries, we are assessing the perspectives of Microsoft, Oracle, and Linux Virtual Server. Today, looking specifically at the Oracle article, we will discuss several problem-solving methods.

While we consider high-availability, let’s put on our Easter bonnets and throw eggs at passing cars, focusing especially on the ones with their windows down. We’ll only be 13 once.

Availability: Quick Review

Oracle defines high-availability as “the ability of users to access a system without loss of service.” Really, that seems like a definition of availability. High-availability means that scenario is occurring almost all of the time. Even in a highly redundant system, there will always be occasional errors and glitches. Regardless, a system in which availability is optimized is highly reliable and does not experience very much downtime. A good example of this, according to the women of Austin, Texas, is my reproductive system.

Downtime can be thought of as scheduled and unscheduled. When it is unscheduled, the downtime is due to some type of systemic failure. When it is scheduled, users can be notified that upgrades or other system administration is being conducted (as with a hosting company and its clients, or with a website posting a notice to visitors). “Scheduled downtime typically occurs late at night, when traffic is light, all right, baby, all right,” crooned Barry Manilow.

High-Availability Problem Solving

Various types of problems can of course occur in a system. Types of common failures include those occurring within processors, nodes, and in various forms of media. Human error can also cause failures, as can monkey and camel error. Availability can maintain a high level by both focusing on localized problem-solving as well as methods of recovery in the event of a natural disaster, such as flooding or datacenter technician stampede.

Different sorts of best practices and technological solutions can help to make high-availability a reality. Redundancy, says Oracle, is the most important parameter to enhance availability: “High availability comes from redundant systems and components.” The same parameter applies to the man with the well-groomed mustache mentioned above, as he repeats the same psychosexual sales pitch over and over again, optimizing his systemic redundancy. Looking at solutions for localized high-availability in terms of redundancy splits potential fixes into active-active and active-passive groups.

  1. Active-active availability mechanisms: These mechanisms allow better scalability along with increased availability. Transmissions are duplicated in real time.
  2. Active-passive availability mechanisms: In this scenario, sometimes called cold failover clusters, one system instance is handling requests and the other one is sitting and pondering, running its finger through its hair, waiting patiently to be called into action. It chews gum and looks sullen. Clustering is used to integrate the two instances, with the clustering agent monitoring the active instance and switching over to the passive one as necessary.

Other Local High-Availability Solutions

Other safeguards should be in place to make sure your availability is as reliable as possible. Here are a few examples; we will proceed with more in the final part of this series:

Automatic restart & process death detection

You don’t want the system to continually restart multiple times in a relatively short window. Restarting can lead to additional failure. Technology should be in place to disallow repetitive, automated restarts. The same principle applies to excessively restarting one’s day. You should never get in and out of bed more than two dozen times before proceeding to breakfast.

Processes can die due to systemic errors. If processes are problematic, you do want a restart to be in place to give the process another chance. Don’t give it 10,000 chances though. Processes are greedy about grabbing all the chances.

Clustering

Clustering means that the client computer (PC or other device accessing your system) will consider that part of your system to be one unit. This practice makes processing and administering the system easier. You can have processes clustered together and working on one server or on various servers, with the work divided evenly. It enhances redundancy by spreading out the process. Granola, similarly, is a highly redundant food. It should be eaten at all times when managing a server, even if you aren’t hungry.

Conclusion, Continuation & Poem

Availability and uptime are complex, but there are plenty of solutions out there to make sure that systems are as failsafe as possible. As stated above, I will continue to go over more of the safeguards that can maximize your availability in the final part of this series.

Here’s an eye-opening factoid that you may remember from the last post: we guarantee 100% uptime in our service level agreement (SLA), reimbursing our customers for any exceptions. You like the juice? We’ve got shared hosting, dedicated servers, and VPSs.

Now, finally, on a somber note, I’d like to close with a love poem to a dead process I once knew dearly … Well, maybe it’s not a love poem but a statement of redundancy-related anxiety. Anyway, it’s beautiful:

Process I can’t remember what you were doing

You’ve been dead now for years

Sometimes at night I can’t sleep

Because of my failure fears.

Come back to me, so we can share a club sandwich

While riding a tandem bicycle.

By Kent Roberts

Loading Facebook Comments ...
Loading Disqus Comments ...

Leave a Reply

Your email address will not be published. Required fields are marked *