Friday, January 22, 2010

On Redundancy - Part 1

Surely, in IT the word "redundancy" is probably used more than in any other industry. A fine example of that would be the invention of Lord Redund Redund at the uncyclopedia, an article that almost certainly was written by a hardcore IT geek.

There are multiple uses for true redundancy when it comes to IT, some more legitimate than others, but in essence it all boils down to one acronym: CYA. In fact, as someone who represents an utility computing company from a technical standpoint, sometimes it's hilarious to hear statements such as "our system has redundancy upon redundancy upon redundancy" as a sales pitch, especially considering that the statement itself is highly redundant.

So, let's take a look at forms of redundancy. Probably the closest synonym to it in the IT world would be "clustering" or "pooling", depending on the purpose. Clustering is utilized when multiple distinct systems in the same network are designed in such a way that they appear as a single system to an outside observer. A key factor in this technique is the fact that the system our observer sees can sustain a failure of at least one component, and easily recover its full capacity once that component is replaced. For example, clusters of HPC machines used in supercomputers can lose half of the infrastructure (which amounts to thousands of server nodes) without ever going offline - they will just operate on half-capacity until replacements arrive or additional provisions are taken.

In Enterprise IT, clustering is generally used in a different fashion. Instead of having one component of the infrastructure provide a business-critical task such as routing, DNS, Active Directory and countless others, we create at least two devices that can instantly switch roles and do each others' job.

In this case, the highest and most respected level of clustering would be "Concurrent Runtime" clustering, which means that 3+ systems receive the exact same inputs at exactly the same time, and execute the same code concurrently. Then, the outputs are synchronized, and if there are any discrepancies in the output between systems, a node majority will decide which output to select as the correct outcome of an action. Clusters of this level are so robust (and expensive) that they are only used in very specialized systems such as Space Shuttle components, although select variations make their way into the Enterprise realm in the form of mainframes, business-critical database server setups, core routers and other devices where near-100% availability by far overpowers the already prohibitive cost.

In the next part of the series, the roles of the more widely used clusters will be explained in greater detail.

No comments:

Post a Comment