How do you define availability in a way that makes sense? People typically talk in terms of nines of availability, which expresses system availability uptime as a percentage of total system time. Companies aiming for high availability HA talk about four nines less than an hour of downtime a year , which is what we strived for during my time at Netflix. Five nines availability, which means only five minutes of downtime per year, is ambitious. There is probably not one company that currently hits this standard.
But as more and more critical services come online, it's reasonable to assume that the industry will shift toward higher standards , especially given the costs of downtime. For most companies, getting from two nines to three will have a notable business impact. This will provide significant improvement and create more reliable systems. One way to do this is via chaos engineering , which involves inflicting bugs intentionally to test system resilience and learn from the experience.
Moving to the next level is more difficult, both to achieve and maintain. So the big question is: How many nines are enough for your situation? Ultimately what your end users and customers care about is whether your site is available when they want or need to use it. At the same time, though, nines are an easy shorthand way to talk about the real issue: reliability. You need to track reliability somehow, but you want to do so while keeping the emphasis where it should be: on customer satisfaction and experience.
It isn't the nines that matter so much as the end-user experience, so let's move the discussion beyond "How many nines do I need? For example, Amazon's EC2 Cloud service offers However it includes much of the expected fine-print:.
For systems with four-nines A jump from three-nines to four-nines may save millions, and is a far easier business case. Another question is: does the business really need such high availability? Here mainframes may be victims of their own reliability. If a system has no outages for two years, the business will expect and demand that this continue. But the fact remains that outages still happen.
Most businesses with mature mainframe-based applications have developed procedures in the case of an outage. For example, ATMs may provide limited service while the back-end host is unavailable. Similarly retail stores continue to have a manual way of accepting credit card payments. The fact is that five-nines availability for an entire computing service is impossible to guarantee.
There is too little room for error and Black Swan or unexpected events are impossible to eliminate. Many service providers advertising such five-nines availability use fine print to make it an easier target.
Four-nines, or even three-nines may give them what they need for a far smaller price tag. It provides Mainframe articles for management and technical experts. It is published every November, February, May and August. The opinions in this article are solely those of the author, and do not necessarily represent the opinions of any other person or organisation. All trademarks, trade names, service marks and logos referenced in these articles belong to their respective companies.
Although Longpela Expertise may be paid by organisations reprinting our articles, all articles are independent. Longpela Expertise has not been paid money by any vendor or company to write any articles appearing in our e-zine. Ensuring 5 Nines Uptime Worldwide uptime meter of Stratus servers. Calculating Server Availability The total availability of all ftServer Systems is computed and updated daily based on all identified service incidents of any duration for all supported ftServer systems worldwide, during the preceding 6-month period.
Both hardware-related and software-related incidents are included in this availability measurement. Adding complexity can lead to some great features and benefits, but it decreases the chance for extreme uptime. An average engineer can build a static web page with a few pieces of text that enjoys much higher uptime than Facebook.
Building for high availability comes down to a series of tradeoffs and sacrifices. Complexity vs. One of the foundations of high availability is eliminating single points of failure. A single point of failure is any component of the system which would cause the rest of the system to fail if that individual component failed. The first example of this is load balancing between multiple servers. Users send traffic and it gets served by either server 1 or server 2.
The load balancer detects when server 1 is offline and sends all traffic to server 2. Something like this:. The most common strategy for failure monitoring and detection on redundant systems is this top-to-bottom technique. Here the top layer in this case, the load balancer does the work of monitoring the layer below. If a problem is detected, the load balancer redirects traffic to server 2. But now we have a new version of the same problem we started with: a single point of failure.
What happens if our load balancer fails? We could get around it by adding another load balancer in the top layer, which would look something like this:. Failover to a redundant load balancer involves a DNS change. This can take a while. This will automatically create a static IP address that floats between the load balancers.
If one is unavailable, traffic will be handled by the other load balancer. This setup is the foundation of how most teams are building for high availability. Below the servers may be databases and more servers. You may have more redundancies and more backup tools throughout the stack.
0コメント