IT Crashes: January 2010

Thursday, January 28, 2010

On Backups

On a late December night, a very sad girl was riding home in a half-empty bus. She was absently looking into the window on the running-by streets, while letting out occasional quiet sobs. Tears were slowly flowing down her cheeks. It seemed like anything that was going on around her didn't really matter... The grief made her lose track of countless overtime hours, and she couldn't even remember the last time she ate...

Her sorrow was unthinkable: a file server she was responsible for crashed, and the backup turned out to be corrupt.

Friday, January 22, 2010

On Virtualization

Surely, most folks in IT have heard about the tendency of virtualizing servers in order to save rack space, create an additional level of High Availability, and simplify the management of servers. I've jumped on the bandwagon a while ago, but have been careful enough to first virtualize the least critical servers and see how they perform in the long term, before even thinking about virtualizing the high-load production servers.

In my case, the perfect candidates for wide-stage production server virtualization were Terminal Servers, for the following reasons: The load on them is highly predictable, they are mostly memory-bound if there is a restriction on # of logged on users, and most importantly they are farm animals. The latter means that if there are any problems with the virtual servers, we can always go back to the idle physical ones, assuming we haven't already pulled them out of the datacenter.

There was an excellent study published by the LoginVSI project showing that terminal servers scale almost linearly in a virtual world, and figures such as 300 active users per dual quad-core physical host with enough RAM are very realistic and achievable.

Currently, there are 3 enterprise-ready vendors that can be used in a large-scale virtualization project: VMware, Citrix and Microsoft. They all have their benefits and drawbacks, but for the purpose of this project, due to various reasons that I might explain in another article, I chose Citrix XenServer 5.5 w/ Essentials.

My thoughts on XenServer so far: the technology is great, but support from the vendor takes way too long, if the issue is anything but primitive. Oh well, more in the next article.

On Redundancy - Part 1

Surely, in IT the word "redundancy" is probably used more than in any other industry. A fine example of that would be the invention of Lord Redund Redund at the uncyclopedia, an article that almost certainly was written by a hardcore IT geek.

There are multiple uses for true redundancy when it comes to IT, some more legitimate than others, but in essence it all boils down to one acronym: CYA. In fact, as someone who represents an utility computing company from a technical standpoint, sometimes it's hilarious to hear statements such as "our system has redundancy upon redundancy upon redundancy" as a sales pitch, especially considering that the statement itself is highly redundant.

So, let's take a look at forms of redundancy. Probably the closest synonym to it in the IT world would be "clustering" or "pooling", depending on the purpose. Clustering is utilized when multiple distinct systems in the same network are designed in such a way that they appear as a single system to an outside observer. A key factor in this technique is the fact that the system our observer sees can sustain a failure of at least one component, and easily recover its full capacity once that component is replaced. For example, clusters of HPC machines used in supercomputers can lose half of the infrastructure (which amounts to thousands of server nodes) without ever going offline - they will just operate on half-capacity until replacements arrive or additional provisions are taken.

In Enterprise IT, clustering is generally used in a different fashion. Instead of having one component of the infrastructure provide a business-critical task such as routing, DNS, Active Directory and countless others, we create at least two devices that can instantly switch roles and do each others' job.

In this case, the highest and most respected level of clustering would be "Concurrent Runtime" clustering, which means that 3+ systems receive the exact same inputs at exactly the same time, and execute the same code concurrently. Then, the outputs are synchronized, and if there are any discrepancies in the output between systems, a node majority will decide which output to select as the correct outcome of an action. Clusters of this level are so robust (and expensive) that they are only used in very specialized systems such as Space Shuttle components, although select variations make their way into the Enterprise realm in the form of mainframes, business-critical database server setups, core routers and other devices where near-100% availability by far overpowers the already prohibitive cost.

In the next part of the series, the roles of the more widely used clusters will be explained in greater detail.

Thursday, January 21, 2010

On Vendors

In most enterprise IT teams, the word "vendor" is spoken with a slight disgust (with a hint of disappointment) every time there is an issue that isn't about as straightforward as a brick falling off a bridge.

Also, as a general rule of thumb, the deeper you go into understanding your system, and as you learn your way through minor problems and annoyances in your to-be-implemented solution, the more complicated it is to actually resolve a problem with the vendor. You have to fight through layers and layers of break-fix technical support in order to get through to someone who actually understands the functioning of each component of their own software on an in-depth level.

You're lucky if the problem is something they've seen before, if it's in their knowledge-base, or if you can find it on Google. However, if you search your error message on Google (assuming there even is such a luxury point of reference as an error message), and the only thing you find is your own forum post on the vendor's website, then you're in trouble. Same applies to the "Your search did not match any documents" comment, that should be properly translated as "You're screwed. Good luck."

The worst case scenario is when you know more about the product than the vendor tech - through experience, for example. Maybe your hobby is to read log files from servers at night and decipher every single line until you understand exactly how they function - the "Enable Verbose Log" check box would be the holy grail in that case.

In this blog, aside from rambling about various day-to-day IT topics, I will also post my personal notes on vendor support, the highlights and downsides of my day/week/month/year, etc.

Every software is bound to crash, no matter how well it was written, which is why "IT Crashes" is the perfect name for this blog.

IT Crashes