Friday, October 8, 2010

On building SANs from scratch

One of my exciting projects has been building a new SAN from scratch. I've gone through a lot of design ideas, but in the end settled on Nexenta Enterprise. In fact, I built two appliances: a single unit for our DR site and a dual-controller HA Cluster with redundant interconnects for the primary site. Here's what I got for the main unit:

2 Supermicro SC823TQ-R500LPB enclosures for the controllers. They have 7 low profile slots in the back, and 6 3.5" disk bays in the front, and fit in 2U.
2 Supermicro X8DTH-iF motherboards. Dual socket Nehalem, 12 ram slots and 7 PCIe 8x slots are perfect for what I need, and I opted not to get onboard SAS controllers. The "i" in the name stands for a built-in IPMI controller, so I don't have to drive all the way to the datacenter if something comes up.
2 Xeon E5620 CPUs. Same clock speed as E5530, but cheaper and with more cache. At this time I don't think I need dual CPUs in the controllers.
2 2U passive heatsinks. Don't forget these - the parts are specific to the enclosures, so make sure to get the correct part number.
2 LSI 9200-8e SAS controllers. 48Gbps of combined theoretical throughput in two external mini-SAS ports should be plenty.
2 QLogic 2460 controllers in order to export the storage to my servers via Fibre Channel. Interesting observation: unlike with hard drives, vendor-specific FC cards from Dell or IBM are much cheaper than "retail" ones. While this may be borderline unsupported, the chips are the same and it generally makes no difference.
4 3x4Gb DDR3 ECC 1333Mhz RAM sticks. 24Gb per host should be plenty for now.
4 cheap 160Gb 3.5" 7.2k rpm drives from Western Digital for mirrored boot drives.

Here comes the interesting part: the storage enclosure and the drives. In order to get more spindles in less space, I opted for a 2U unit with 24 2.5" drive bays. The quest of choosing the enclosure has been kind of a bummer: the cheaper Supermicro SC216 that I originally wanted to get does not support SATA to SAS interposers (drive mounts are too short and there are no screw holes for the interposer-offset drives), and I needed those for the SATA SSD's I was planning to use for read and write cache in dual-head configuration. SAS SSD's weren't an option because they would cost more than everything else combined.

After some googling, I have found the recently released LSI-620J unit, which is double the price of the Supermicro one, but supports 6Gbps SAS on the backplane, and doesn't need to be assembled manually. Also got a second SAS expander for it, as well as rails - neither of those are included by default.

Aside from that, all I really needed for the base setup were 4 3ft SFF-8088 miniSAS cables.

For the drives, I ordered 17 600Gb Savvio 10k.4's - they also have native 6Gbps SAS support. I originally planned to get similar Toshiba's because of higher performance, but got a really good deal on the Savvio's, so I couldn't resist.

Originally, the plan was to get four 160Gb Intel X25M SSDs for read cache, and two 32Gb X25E's for write cache, but I have decided to wait a little: these SSDs are sold out in a lot of places, and there are rumors that Intel is about to release bigger and faster SSD drives, hopefully at a lower price.

And now, eye candy:







Sunday, July 25, 2010

On Home Labs

In reply to @CoilDomain's blog post, I have decided to embark on a little mental exercise myself: what would my home lab really be like if I had $50k to throw at it. And, of course, out of all places, I have turned to eBay as my main supplier of cheap hardware. I don't need support for a home lab, right?

First, of course, I would pick up a chassis to hold my beloved blade servers: a time-proven BladeCenter E - $1590.93

It already comes with a couple Cisco gigabit switches, and I'd get a couple more 2Kw PSUs ($100) in order to keep both power domains happy for what comes later. Also, a BladeCenter with all 4 power supply slots filled sounds A LOT quieter than one with non-redundant power. Compare a vacuum cleaner to a Boeing 747 that is about to take off.

Then, drop in a couple IBM 7870CCU HS22 blades ($1800 a piece), get 4 4Gb chips of RAM so I get 10 gigs per server ($132 per chip), and add two QLogic CIOv cards for those blades, at $563 each.

So why did I get a BladeCenter for just two blades? No reason…. except i want to add more servers! Since I really don't need to have my lab based on the latest and greatest Nehalem line of Xeons, I can just as well play with HS21XM blades that are:
a) based on the previous generation Xeon 5400 series, and
b) cheaper.

These look pretty good, I'll take three ($700 each). Now throw away the 1Gb ram sticks, and get six 4Gb kits at $100 each.

I think i'll play with iScsi/NFS with these blades. Sounds like fun. FC is good, but i'm "open to other options".

And, just for past reference, get a couple bare-metal HS20's with local drives. I can always use them as shelves in my garage, since they're only $175 a piece.

So, my total on servers is: $9994.93.

Of course, I forgot the FC BladeCenter modules, but I'll add those later as I'm pricing out the storage portion of this exercise.

Storage… Storage… so much has been argued about it in the past years. One thing I know for sure: in order to be on top of things, I can't afford to not have SSDs in my home lab. I mean, seriously, it's not like I'm building a production environment. I can always wait for the RMA on a ruined SSD (and believe me, IOmeter does a consistently good job at that if you know what I mean).

And when choosing storage that uses SSD for both read and write cache, and doesn't cost an arm and a leg, ZFS is a pretty obvious choice. To be more specific, Nexenta would fill in my blank, since it actually offers a convenient way to manage that storage. I dislike command line interfaces if I have a choice between using one and not using one (exclusively).

So, what I need is write-oriented SSDs for ZIL and read-oriented SSDs for L2ARC, and normally the choice would be pretty clear for SLC when it comes to writes, and MLC when it comes to reads. However, there is all the hype about the new SandForce controller that supposedly makes MLC flash as reliable as SLC in write-oriented workloads (10,000 random write IOPS are no joke), so I wouldn't mind playing with it as well.

So, I'll take a 50Gb OWC Mercury Extreme Pro RE for write cache ($210), a 100Gb drive of the same brand for read cache ($399), a 160Gb Intel X25M G2 ($405) to compare the read cache results, and a 32Gb Intel X25E ($380) for proper SLC write cache. This may seem like a waste of money, but I think that SSD is the future. Total: $1394 on SSD. Wait… Let me also grab a Crucial RealSSD 256Gb for $599 in order to give it a run for the money. $1993 it is. I expect the Crucial to fail within a couple months anyway, which means i can return it under warranty and buy new tires for my car.

Of course, no storage system will be complete without spinning rust (which ultimately holds all the data, the SSDs are added for data sprints). Since I think that SSD will soon take the crown for low-latency data access, there is no reason to splurge on 15k drives: 10k's will do just as well. So, 10x 10k 600Gb Seagate Savvio's 10.4's it is (@$471 a piece). I honestly don't want to go into details about SAS-to-SATA multiplexors and 24-disk 2U enclosures (partially because I'm under a couple NDA's), but let's just say that $2500 will cover that with ease, especially since for a home lab redundancy for SAS controllers is a bit too much.

I'm getting a little bit tired of pricing out all the bits and pieces already… I'm going to estimate $5,000 for a decent storage controller (dual quad Nehalem's, 32Gb RAM, FC card + quad gigabit, SAS controller) that will theoretically be able to utilize all of my SSD cache as well as de-dupe everything quickly and efficiently. This takes the total storage costs to $14,203.

Now comes the fun part: networking - both FC and Ethernet. A Cisco ASA 5505 can be found for around $400 on eBay; a Juniper SRX100B would be about $600; a QLogic SANBox 1400 for $900, a 2Gb FC switch module for the Bladecenter is another $600. And, for the most fun part, a fully populated Cisco 6509 chassis for only $1395 - that'll last me a while. :)

So, I have the switches and firewalls, but i don't yet have any routers. Hmm, D-Link? Netgear? Nah, I think I'll get a Cisco 2600 series: it's cheap, abundant, and has more than enough capacity for my home broadband connection. ($100 tops). In fact, I think I'll get three of those just to practice weird routing schemes in real life: Cisco emulators will work fine for everything else.

Total for the network portion: $4195. Well, in order to be totally realistic, let's say power/network/FC wiring for everything will add another grand.

So, the total for my dream home lab is $29393.

Did i leave anything important out? Don't think so. A tape library, maybe, but for a home lab that's just overkill. Packet shapers? Proxy servers? Deep packet inspection appliances and spam filters? That can all be done either as a virtual appliance or using open source software (and, a lot of times, both at once).

I think that this setup has plenty of hardware to simulate any production situation I could ever encounter, on a smaller scale. Except for, maybe, 100 VDI users going to http://buffalowildwings.com (courtesy @JoeShonk for the link) all at once. I've got a rudimentary, yet full-featured switched SAN, which can do FC, NFS, iScsi and CIFS; my network setup is probably way overkill; my storage kicks ass for anything i can think of throwing at it, and a total of 52 gigs of RAM shared across the "server farm" (not including the storage controller) are more than I've seen in most ~100 user production environments. Could I build this for less? Definitely.

More? Well, I could make the costs astronomical just to scare people away from IT. However, the truth is, you don't need brand new under-warranty equipment for a home lab - hell, you don't even need it for production a lot of times, if you take time to properly set everything up and make sure you have no single point of failure.

And, on top of everything, I still have over $20k to spend in case the hypothetical home lab suddenly needs an upgrade.

Oh wait, I forgot a rack. Super-conveniently, the cheapest IKEA corner tables can serve me well: http://lifehacker.com/5459719/build-a-network-rack-with-an-ikea-table

That's all.

Friday, April 23, 2010

On IBM's onsite support

IBM was called to our datacenter to replace a tape drive for a collocated IBM Power i520 box. We have several BladeCenters of our own, and quite a few blade servers. Anyway, 2 IBM Engineers came in to replace the Ultrium tape drive (here come the "how many engineers does it take... jokes"). They then proceeded to pull the serial numbers off our Blades and then call them in (without permission) to find that there is no IBM Hardware Maintenance coverage.

Then they started badgering our engineer as to who to talk to about the coverage - even though we explicitly told them we keep plenty of spares and don't need coverage on the old blades - it would cost more than the blade itself.

They had no business touching or pulling the serial numbers on those Blades. They are our property, not the clients', and just because they were called in for a colo box does not mean they can touch everything else.

The consultant we were working with actually filed a complaint to IBM, and the response from the IBM manager was the following:

He said they did nothing wrong and that once they were allowed in the cage, any equipment within was fair game to them. He kept asking why we had a problem with them checking the serial numbers. His attitude was very surprising.

When asked how to prevent this in the future since they are only ones to service the area, he said that when an IBM Engineer is given access to the cage he should be instructed that he is to service only the piece of equipment he was called in for and that he is not given permission to the other equipment in the cage.

Epic fail. This is exactly why we don't buy hardware maintenance coverage from IBM.

On efficient solutions and low-priority cases

Here's a fine example of helpdesk efficiency taken from a real ticketing system.

Day 1. Issue created by Tier1 and escalated to Tier2. Priority set to low.
Description: "JPEGs not opening with Office Picture Viewer (set as Client's default) when opened from email. Viewer opens, but displays x'ed out thumbnails instead of the actual image. Windows Picture and Fax viewer opens JPEGs fine from emails, as does OPV when opening JPEG from desktop. Could this be another terminal server server registry issue?"

Day 2. Issue looked at by Tier2 and assigned to a tech.

Day 5. Issue de-escalated from Tier2 with a comment "Is this still an issue?"

3 hours later. Issue re-escalated to Tier2 with a screen shot and a comment: "YouTellMe.jpg and check your inbox."

Day 12. Issue de-escalated back to Tier1. "It's working fine for a test user on [server name], what server is this happening on? Check where IE is storing the temporary intenet files, should be to their my documents folder."

9 minutes later. Issue re-escalated to Tier2: "NOT A FIX! Even if this is the case, we will get calls from multiple users and not know they have a problem until they call. This needs to be implemented globally. Also, this doesn't explain why the client only started experiencing this after our last rollout of new servers."

Day 16. Issue de-escalated back to Tier1: "This IS the fix. This is currently, as far as I am aware, the only user that has had this issue. Also, we don't know that other user's may have moved their temporary files save location and if we globally change it then their's won't work. If it changes back after log out and back in then that is one thing, but making a global change for one user's problem is not a solution. Did it work? If so then we have resolved the issue."

45 minutes later, re-escalated to Tier2: "That is not the only user. I had the same issue, and I hadn't made any changes to my temp internet files until you told me to. What about [another_username]?"

Day 24. Issue de-escalated to Tier1 again. "What about him, is he having the same issue? Have you tried doing what I told you for him and did it work?"

20 minutes later. re-escalated to Tier2: "The point is that is not the only user. I've moved a few others' folder location to see if that helped. Besides, who would move their temporary internet files? Per [tech_3], this is an easy global fix."

6 minutes later. [tech_3] comment: "Lemme know what the fix is exactly and we can blanket out the changes needed to all users."

Day 25. Original Tier2 owner replies: "Need to move their Temporary Internet Files to their My Documents. But this only seems to be an issue, from what has said, for users using Microsoft Office Document Imaging and seems somewhat random as I didn't have the issue with the test user I created."

1 hour later. [tech_3] replies: "I would change the location of the temporary internet files for a test user and monitor the registry changes. We can then use that to create a blanket REG file. Let me know if you need my help"

2 hours later. Tier2 owner de-escalates the incident: "This is not going to be able to be done globally and will have to be done on a per user instance. If this has been done for those that have needed it, ie. those who have complained about this particular issue, then close the ticket. It does not need to come back to Tier2 again."

1 hour later. Issue re-escalated to Tier2: "[tech_3] just explained why this can't be implemented globally (encrypted DAT file). This ticket could have been put to rest a long time ago, had I known that. Instead, I was hearing from that it was an easy blanket fix, but meeting ambiguous resistance from Tier2 every time I pushed it up."

Same day. Incident resolved. Tier1 complains about Tier2 to Tier3. Tier3 immediately sends everyone involved an e-mail and re-opens the incident after 5 minutes of identifying what is going on using Procmon: "Change the OutlookSecureTempFolder key to a different location. Do NOT redirect the entire temporary internet files folder to a network drive!!! We want to keep it off the network, not on the network. Revert all changes!

[HKEY_USERS\\Software\Microsoft\Office\11.0\Outlook\Security]
"OutlookSecureTempFolder"="C:\\Documents and Settings\\\\Local Settings\\Temporary Internet Files\\""

45 minutes later. Tier3 comments: "If this is a widespread problem, put it in the login script or something. Just make sure the folder exists - make the script check for existence, and if it doesn't exist, create something like [program_drive]:\Temp with the script as a hidden folder with no execute rights. Don't use the [document_drive] for this type of things - it is purely for redirecting"My Documents"."

Day 26. Tier3 discovers that redirecting temporary internet files for those users broke a few unrelated apps. Changes are rolled back manually for affected users.

Day 40. Incident still sitting in Tier2 queue.

Saturday, March 13, 2010

Unofficial Citrix Support channel on IRC

Stop by at the #citrix channel on irc.freenode.net for any Citrix-related questions and discussions. We have a good group of folks there at all times, and the channel is integrated with the @citrixirc twitter account, with important announcements reaching the channel in real-time.

On Redundancy - Part 2

In most systems, 1+1 redundancy is enough for critical systems. 1+1 means that for every protected system, there is an identical replacement system running in parallel, and it can instantly jump in and replace the first system in case of failure. In this case, however, two distinctions need to be made.

First distinction is whether the nodes operate in "Active-Active" or "Active-Passive" mode. The first one generally requires more setup and is more expensive, but allows for failure of the node without losing the state of the system - something that is called "stateful fail-over". For instance, in the case of network equipment, an Active-Active system will preserve the existing connections that used to go across the failed node by constantly sharing the working memory set and configuration between the two nodes. Such a setup has different names depending on the vendor: Live Failover, Fault Tolerance, Full HA, Stateful Load Balancing and many others. The main requirement for stateful Active-Active clustering is some form of shared synchronous storage that operates on a transaction level and ensures that both nodes are aware of all external conditions that each one is going through.

Active-Passive clustering is a cheaper form that utilizes a standby system with exactly the same specifications, the role of which is to watch the active system and assume its role only when the original system is unavailable. In general, active-passive systems do preserve the configuration, but require the state to be reset, thus "kicking everyone offline" for a moment and requiring either manual or automatic reconnection.

Using Microsoft server systems as an example, Network Load Balancing would be considered Active-Active clustering: several systems listen on the same set of virtual IP addresses and guarantee that if one of them is down, the other ones will still receive the communication and provide a response. Fail-over clustering is then considered Active-Passive, since the failover requires a "virtual restart" of all services on the passive node. Both options should be carefully examined for the right mix, since both have their ups and downs.

Every environment is different. Support should understand that before suggesting ideas.

Tuesday, February 9, 2010

Frequent problems in IT

Someone asked me to outline the most overlooked problems in IT projects. Here they are:

1) Deadlines. In any project that involves more than two people that are responsible for different sides of the project and don't have access to each others work, the carefully determined final deadline will never be met, and will need to be extended by N weeks, where N is the complexity rating of the project on a scale from 1 to 10. The only solution is giving both people a way to fix each others work without waiting for the other person to respond to the requests.

2) Instructions. Any instruction written by a person with a higher level of knowledge than the intended audience will inevitably be misunderstood, misused or ignored. It doesn't matter how many screenshots you include or how many footnotes you make, and even a video that you make of everything that is involved will not help. This point contributes to the huge popularity of the "*** for Dummies" series of books. The only solution is having the person with least knowledge write the actual instruction, while getting all the knowledge from peers or Google.

3) Vendor Tech Support. Never wait for the vendor to "get back to you" - follow up with them within a reasonable time and make them keep their word on it. If they say they will get back to you within 24 hours, contact them in exactly 24 hours and ask for a heads-up. Don't be afraid to "escalate" the problem if you feel that the support person on the other end is incompetent or doesn't understand the problem in depth. Whenever you schedule a project completion date for something that might possibly involve vendor tech support, try to have at least one issue resolved by them so you can test the ground and get a feeling for how good their support really is, so you can factor a couple unexpected problems into the deadline.

4) Replication. Never willingly take ownership of a problem that cannot be replicated consistently, when it cannot be proven that the problem was created by you. Also, the "I can replicate this consistently" statement is the best way to get a vendor to solve something. There is no solution, since you'll inevitably get stuck with strange, infrequent and untraceable issues at some point.

5) User impact. If you have one problem that affects 50% of users once a month, and another problem that affects 2% of users every day (hypotheticaly), always solve the 50% problem first. The 2% can get used to something or even be trained with a workaround for the mean time, but the 50% will complain endlessly and will inevitably get you in trouble.

6) Coffee. Any person that pours the last cup of coffee from a pot and doesn't make a new pot should be punished. Solution: always leave half a cup of coffee in the pot so you cannot be accused of being that person.

Thursday, January 28, 2010

On Backups

On a late December night, a very sad girl was riding home in a half-empty bus. She was absently looking into the window on the running-by streets, while letting out occasional quiet sobs. Tears were slowly flowing down her cheeks. It seemed like anything that was going on around her didn't really matter... The grief made her lose track of countless overtime hours, and she couldn't even remember the last time she ate...

Her sorrow was unthinkable: a file server she was responsible for crashed, and the backup turned out to be corrupt.

Friday, January 22, 2010

On Virtualization

Surely, most folks in IT have heard about the tendency of virtualizing servers in order to save rack space, create an additional level of High Availability, and simplify the management of servers. I've jumped on the bandwagon a while ago, but have been careful enough to first virtualize the least critical servers and see how they perform in the long term, before even thinking about virtualizing the high-load production servers.

In my case, the perfect candidates for wide-stage production server virtualization were Terminal Servers, for the following reasons: The load on them is highly predictable, they are mostly memory-bound if there is a restriction on # of logged on users, and most importantly they are farm animals. The latter means that if there are any problems with the virtual servers, we can always go back to the idle physical ones, assuming we haven't already pulled them out of the datacenter.

There was an excellent study published by the LoginVSI project showing that terminal servers scale almost linearly in a virtual world, and figures such as 300 active users per dual quad-core physical host with enough RAM are very realistic and achievable.

Currently, there are 3 enterprise-ready vendors that can be used in a large-scale virtualization project: VMware, Citrix and Microsoft. They all have their benefits and drawbacks, but for the purpose of this project, due to various reasons that I might explain in another article, I chose Citrix XenServer 5.5 w/ Essentials.

My thoughts on XenServer so far: the technology is great, but support from the vendor takes way too long, if the issue is anything but primitive. Oh well, more in the next article.

On Redundancy - Part 1

Surely, in IT the word "redundancy" is probably used more than in any other industry. A fine example of that would be the invention of Lord Redund Redund at the uncyclopedia, an article that almost certainly was written by a hardcore IT geek.

There are multiple uses for true redundancy when it comes to IT, some more legitimate than others, but in essence it all boils down to one acronym: CYA. In fact, as someone who represents an utility computing company from a technical standpoint, sometimes it's hilarious to hear statements such as "our system has redundancy upon redundancy upon redundancy" as a sales pitch, especially considering that the statement itself is highly redundant.

So, let's take a look at forms of redundancy. Probably the closest synonym to it in the IT world would be "clustering" or "pooling", depending on the purpose. Clustering is utilized when multiple distinct systems in the same network are designed in such a way that they appear as a single system to an outside observer. A key factor in this technique is the fact that the system our observer sees can sustain a failure of at least one component, and easily recover its full capacity once that component is replaced. For example, clusters of HPC machines used in supercomputers can lose half of the infrastructure (which amounts to thousands of server nodes) without ever going offline - they will just operate on half-capacity until replacements arrive or additional provisions are taken.

In Enterprise IT, clustering is generally used in a different fashion. Instead of having one component of the infrastructure provide a business-critical task such as routing, DNS, Active Directory and countless others, we create at least two devices that can instantly switch roles and do each others' job.

In this case, the highest and most respected level of clustering would be "Concurrent Runtime" clustering, which means that 3+ systems receive the exact same inputs at exactly the same time, and execute the same code concurrently. Then, the outputs are synchronized, and if there are any discrepancies in the output between systems, a node majority will decide which output to select as the correct outcome of an action. Clusters of this level are so robust (and expensive) that they are only used in very specialized systems such as Space Shuttle components, although select variations make their way into the Enterprise realm in the form of mainframes, business-critical database server setups, core routers and other devices where near-100% availability by far overpowers the already prohibitive cost.

In the next part of the series, the roles of the more widely used clusters will be explained in greater detail.

Thursday, January 21, 2010

On Vendors

In most enterprise IT teams, the word "vendor" is spoken with a slight disgust (with a hint of disappointment) every time there is an issue that isn't about as straightforward as a brick falling off a bridge.

Also, as a general rule of thumb, the deeper you go into understanding your system, and as you learn your way through minor problems and annoyances in your to-be-implemented solution, the more complicated it is to actually resolve a problem with the vendor. You have to fight through layers and layers of break-fix technical support in order to get through to someone who actually understands the functioning of each component of their own software on an in-depth level.

You're lucky if the problem is something they've seen before, if it's in their knowledge-base, or if you can find it on Google. However, if you search your error message on Google (assuming there even is such a luxury point of reference as an error message), and the only thing you find is your own forum post on the vendor's website, then you're in trouble. Same applies to the "Your search did not match any documents" comment, that should be properly translated as "You're screwed. Good luck."

The worst case scenario is when you know more about the product than the vendor tech - through experience, for example. Maybe your hobby is to read log files from servers at night and decipher every single line until you understand exactly how they function - the "Enable Verbose Log" check box would be the holy grail in that case.

In this blog, aside from rambling about various day-to-day IT topics, I will also post my personal notes on vendor support, the highlights and downsides of my day/week/month/year, etc.

Every software is bound to crash, no matter how well it was written, which is why "IT Crashes" is the perfect name for this blog.