Friday, October 8, 2010

On building SANs from scratch

One of my exciting projects has been building a new SAN from scratch. I've gone through a lot of design ideas, but in the end settled on Nexenta Enterprise. In fact, I built two appliances: a single unit for our DR site and a dual-controller HA Cluster with redundant interconnects for the primary site. Here's what I got for the main unit:

2 Supermicro SC823TQ-R500LPB enclosures for the controllers. They have 7 low profile slots in the back, and 6 3.5" disk bays in the front, and fit in 2U.
2 Supermicro X8DTH-iF motherboards. Dual socket Nehalem, 12 ram slots and 7 PCIe 8x slots are perfect for what I need, and I opted not to get onboard SAS controllers. The "i" in the name stands for a built-in IPMI controller, so I don't have to drive all the way to the datacenter if something comes up.
2 Xeon E5620 CPUs. Same clock speed as E5530, but cheaper and with more cache. At this time I don't think I need dual CPUs in the controllers.
2 2U passive heatsinks. Don't forget these - the parts are specific to the enclosures, so make sure to get the correct part number.
2 LSI 9200-8e SAS controllers. 48Gbps of combined theoretical throughput in two external mini-SAS ports should be plenty.
2 QLogic 2460 controllers in order to export the storage to my servers via Fibre Channel. Interesting observation: unlike with hard drives, vendor-specific FC cards from Dell or IBM are much cheaper than "retail" ones. While this may be borderline unsupported, the chips are the same and it generally makes no difference.
4 3x4Gb DDR3 ECC 1333Mhz RAM sticks. 24Gb per host should be plenty for now.
4 cheap 160Gb 3.5" 7.2k rpm drives from Western Digital for mirrored boot drives.

Here comes the interesting part: the storage enclosure and the drives. In order to get more spindles in less space, I opted for a 2U unit with 24 2.5" drive bays. The quest of choosing the enclosure has been kind of a bummer: the cheaper Supermicro SC216 that I originally wanted to get does not support SATA to SAS interposers (drive mounts are too short and there are no screw holes for the interposer-offset drives), and I needed those for the SATA SSD's I was planning to use for read and write cache in dual-head configuration. SAS SSD's weren't an option because they would cost more than everything else combined.

After some googling, I have found the recently released LSI-620J unit, which is double the price of the Supermicro one, but supports 6Gbps SAS on the backplane, and doesn't need to be assembled manually. Also got a second SAS expander for it, as well as rails - neither of those are included by default.

Aside from that, all I really needed for the base setup were 4 3ft SFF-8088 miniSAS cables.

For the drives, I ordered 17 600Gb Savvio 10k.4's - they also have native 6Gbps SAS support. I originally planned to get similar Toshiba's because of higher performance, but got a really good deal on the Savvio's, so I couldn't resist.

Originally, the plan was to get four 160Gb Intel X25M SSDs for read cache, and two 32Gb X25E's for write cache, but I have decided to wait a little: these SSDs are sold out in a lot of places, and there are rumors that Intel is about to release bigger and faster SSD drives, hopefully at a lower price.

And now, eye candy:







Sunday, July 25, 2010

On Home Labs

In reply to @CoilDomain's blog post, I have decided to embark on a little mental exercise myself: what would my home lab really be like if I had $50k to throw at it. And, of course, out of all places, I have turned to eBay as my main supplier of cheap hardware. I don't need support for a home lab, right?

First, of course, I would pick up a chassis to hold my beloved blade servers: a time-proven BladeCenter E - $1590.93

It already comes with a couple Cisco gigabit switches, and I'd get a couple more 2Kw PSUs ($100) in order to keep both power domains happy for what comes later. Also, a BladeCenter with all 4 power supply slots filled sounds A LOT quieter than one with non-redundant power. Compare a vacuum cleaner to a Boeing 747 that is about to take off.

Then, drop in a couple IBM 7870CCU HS22 blades ($1800 a piece), get 4 4Gb chips of RAM so I get 10 gigs per server ($132 per chip), and add two QLogic CIOv cards for those blades, at $563 each.

So why did I get a BladeCenter for just two blades? No reason…. except i want to add more servers! Since I really don't need to have my lab based on the latest and greatest Nehalem line of Xeons, I can just as well play with HS21XM blades that are:
a) based on the previous generation Xeon 5400 series, and
b) cheaper.

These look pretty good, I'll take three ($700 each). Now throw away the 1Gb ram sticks, and get six 4Gb kits at $100 each.

I think i'll play with iScsi/NFS with these blades. Sounds like fun. FC is good, but i'm "open to other options".

And, just for past reference, get a couple bare-metal HS20's with local drives. I can always use them as shelves in my garage, since they're only $175 a piece.

So, my total on servers is: $9994.93.

Of course, I forgot the FC BladeCenter modules, but I'll add those later as I'm pricing out the storage portion of this exercise.

Storage… Storage… so much has been argued about it in the past years. One thing I know for sure: in order to be on top of things, I can't afford to not have SSDs in my home lab. I mean, seriously, it's not like I'm building a production environment. I can always wait for the RMA on a ruined SSD (and believe me, IOmeter does a consistently good job at that if you know what I mean).

And when choosing storage that uses SSD for both read and write cache, and doesn't cost an arm and a leg, ZFS is a pretty obvious choice. To be more specific, Nexenta would fill in my blank, since it actually offers a convenient way to manage that storage. I dislike command line interfaces if I have a choice between using one and not using one (exclusively).

So, what I need is write-oriented SSDs for ZIL and read-oriented SSDs for L2ARC, and normally the choice would be pretty clear for SLC when it comes to writes, and MLC when it comes to reads. However, there is all the hype about the new SandForce controller that supposedly makes MLC flash as reliable as SLC in write-oriented workloads (10,000 random write IOPS are no joke), so I wouldn't mind playing with it as well.

So, I'll take a 50Gb OWC Mercury Extreme Pro RE for write cache ($210), a 100Gb drive of the same brand for read cache ($399), a 160Gb Intel X25M G2 ($405) to compare the read cache results, and a 32Gb Intel X25E ($380) for proper SLC write cache. This may seem like a waste of money, but I think that SSD is the future. Total: $1394 on SSD. Wait… Let me also grab a Crucial RealSSD 256Gb for $599 in order to give it a run for the money. $1993 it is. I expect the Crucial to fail within a couple months anyway, which means i can return it under warranty and buy new tires for my car.

Of course, no storage system will be complete without spinning rust (which ultimately holds all the data, the SSDs are added for data sprints). Since I think that SSD will soon take the crown for low-latency data access, there is no reason to splurge on 15k drives: 10k's will do just as well. So, 10x 10k 600Gb Seagate Savvio's 10.4's it is (@$471 a piece). I honestly don't want to go into details about SAS-to-SATA multiplexors and 24-disk 2U enclosures (partially because I'm under a couple NDA's), but let's just say that $2500 will cover that with ease, especially since for a home lab redundancy for SAS controllers is a bit too much.

I'm getting a little bit tired of pricing out all the bits and pieces already… I'm going to estimate $5,000 for a decent storage controller (dual quad Nehalem's, 32Gb RAM, FC card + quad gigabit, SAS controller) that will theoretically be able to utilize all of my SSD cache as well as de-dupe everything quickly and efficiently. This takes the total storage costs to $14,203.

Now comes the fun part: networking - both FC and Ethernet. A Cisco ASA 5505 can be found for around $400 on eBay; a Juniper SRX100B would be about $600; a QLogic SANBox 1400 for $900, a 2Gb FC switch module for the Bladecenter is another $600. And, for the most fun part, a fully populated Cisco 6509 chassis for only $1395 - that'll last me a while. :)

So, I have the switches and firewalls, but i don't yet have any routers. Hmm, D-Link? Netgear? Nah, I think I'll get a Cisco 2600 series: it's cheap, abundant, and has more than enough capacity for my home broadband connection. ($100 tops). In fact, I think I'll get three of those just to practice weird routing schemes in real life: Cisco emulators will work fine for everything else.

Total for the network portion: $4195. Well, in order to be totally realistic, let's say power/network/FC wiring for everything will add another grand.

So, the total for my dream home lab is $29393.

Did i leave anything important out? Don't think so. A tape library, maybe, but for a home lab that's just overkill. Packet shapers? Proxy servers? Deep packet inspection appliances and spam filters? That can all be done either as a virtual appliance or using open source software (and, a lot of times, both at once).

I think that this setup has plenty of hardware to simulate any production situation I could ever encounter, on a smaller scale. Except for, maybe, 100 VDI users going to http://buffalowildwings.com (courtesy @JoeShonk for the link) all at once. I've got a rudimentary, yet full-featured switched SAN, which can do FC, NFS, iScsi and CIFS; my network setup is probably way overkill; my storage kicks ass for anything i can think of throwing at it, and a total of 52 gigs of RAM shared across the "server farm" (not including the storage controller) are more than I've seen in most ~100 user production environments. Could I build this for less? Definitely.

More? Well, I could make the costs astronomical just to scare people away from IT. However, the truth is, you don't need brand new under-warranty equipment for a home lab - hell, you don't even need it for production a lot of times, if you take time to properly set everything up and make sure you have no single point of failure.

And, on top of everything, I still have over $20k to spend in case the hypothetical home lab suddenly needs an upgrade.

Oh wait, I forgot a rack. Super-conveniently, the cheapest IKEA corner tables can serve me well: http://lifehacker.com/5459719/build-a-network-rack-with-an-ikea-table

That's all.

Friday, April 23, 2010

On IBM's onsite support

IBM was called to our datacenter to replace a tape drive for a collocated IBM Power i520 box. We have several BladeCenters of our own, and quite a few blade servers. Anyway, 2 IBM Engineers came in to replace the Ultrium tape drive (here come the "how many engineers does it take... jokes"). They then proceeded to pull the serial numbers off our Blades and then call them in (without permission) to find that there is no IBM Hardware Maintenance coverage.

Then they started badgering our engineer as to who to talk to about the coverage - even though we explicitly told them we keep plenty of spares and don't need coverage on the old blades - it would cost more than the blade itself.

They had no business touching or pulling the serial numbers on those Blades. They are our property, not the clients', and just because they were called in for a colo box does not mean they can touch everything else.

The consultant we were working with actually filed a complaint to IBM, and the response from the IBM manager was the following:

He said they did nothing wrong and that once they were allowed in the cage, any equipment within was fair game to them. He kept asking why we had a problem with them checking the serial numbers. His attitude was very surprising.

When asked how to prevent this in the future since they are only ones to service the area, he said that when an IBM Engineer is given access to the cage he should be instructed that he is to service only the piece of equipment he was called in for and that he is not given permission to the other equipment in the cage.

Epic fail. This is exactly why we don't buy hardware maintenance coverage from IBM.

On efficient solutions and low-priority cases

Here's a fine example of helpdesk efficiency taken from a real ticketing system.

Day 1. Issue created by Tier1 and escalated to Tier2. Priority set to low.
Description: "JPEGs not opening with Office Picture Viewer (set as Client's default) when opened from email. Viewer opens, but displays x'ed out thumbnails instead of the actual image. Windows Picture and Fax viewer opens JPEGs fine from emails, as does OPV when opening JPEG from desktop. Could this be another terminal server server registry issue?"

Day 2. Issue looked at by Tier2 and assigned to a tech.

Day 5. Issue de-escalated from Tier2 with a comment "Is this still an issue?"

3 hours later. Issue re-escalated to Tier2 with a screen shot and a comment: "YouTellMe.jpg and check your inbox."

Day 12. Issue de-escalated back to Tier1. "It's working fine for a test user on [server name], what server is this happening on? Check where IE is storing the temporary intenet files, should be to their my documents folder."

9 minutes later. Issue re-escalated to Tier2: "NOT A FIX! Even if this is the case, we will get calls from multiple users and not know they have a problem until they call. This needs to be implemented globally. Also, this doesn't explain why the client only started experiencing this after our last rollout of new servers."

Day 16. Issue de-escalated back to Tier1: "This IS the fix. This is currently, as far as I am aware, the only user that has had this issue. Also, we don't know that other user's may have moved their temporary files save location and if we globally change it then their's won't work. If it changes back after log out and back in then that is one thing, but making a global change for one user's problem is not a solution. Did it work? If so then we have resolved the issue."

45 minutes later, re-escalated to Tier2: "That is not the only user. I had the same issue, and I hadn't made any changes to my temp internet files until you told me to. What about [another_username]?"

Day 24. Issue de-escalated to Tier1 again. "What about him, is he having the same issue? Have you tried doing what I told you for him and did it work?"

20 minutes later. re-escalated to Tier2: "The point is that is not the only user. I've moved a few others' folder location to see if that helped. Besides, who would move their temporary internet files? Per [tech_3], this is an easy global fix."

6 minutes later. [tech_3] comment: "Lemme know what the fix is exactly and we can blanket out the changes needed to all users."

Day 25. Original Tier2 owner replies: "Need to move their Temporary Internet Files to their My Documents. But this only seems to be an issue, from what has said, for users using Microsoft Office Document Imaging and seems somewhat random as I didn't have the issue with the test user I created."

1 hour later. [tech_3] replies: "I would change the location of the temporary internet files for a test user and monitor the registry changes. We can then use that to create a blanket REG file. Let me know if you need my help"

2 hours later. Tier2 owner de-escalates the incident: "This is not going to be able to be done globally and will have to be done on a per user instance. If this has been done for those that have needed it, ie. those who have complained about this particular issue, then close the ticket. It does not need to come back to Tier2 again."

1 hour later. Issue re-escalated to Tier2: "[tech_3] just explained why this can't be implemented globally (encrypted DAT file). This ticket could have been put to rest a long time ago, had I known that. Instead, I was hearing from that it was an easy blanket fix, but meeting ambiguous resistance from Tier2 every time I pushed it up."

Same day. Incident resolved. Tier1 complains about Tier2 to Tier3. Tier3 immediately sends everyone involved an e-mail and re-opens the incident after 5 minutes of identifying what is going on using Procmon: "Change the OutlookSecureTempFolder key to a different location. Do NOT redirect the entire temporary internet files folder to a network drive!!! We want to keep it off the network, not on the network. Revert all changes!

[HKEY_USERS\\Software\Microsoft\Office\11.0\Outlook\Security]
"OutlookSecureTempFolder"="C:\\Documents and Settings\\\\Local Settings\\Temporary Internet Files\\""

45 minutes later. Tier3 comments: "If this is a widespread problem, put it in the login script or something. Just make sure the folder exists - make the script check for existence, and if it doesn't exist, create something like [program_drive]:\Temp with the script as a hidden folder with no execute rights. Don't use the [document_drive] for this type of things - it is purely for redirecting"My Documents"."

Day 26. Tier3 discovers that redirecting temporary internet files for those users broke a few unrelated apps. Changes are rolled back manually for affected users.

Day 40. Incident still sitting in Tier2 queue.

Saturday, March 13, 2010

Unofficial Citrix Support channel on IRC

Stop by at the #citrix channel on irc.freenode.net for any Citrix-related questions and discussions. We have a good group of folks there at all times, and the channel is integrated with the @citrixirc twitter account, with important announcements reaching the channel in real-time.

On Redundancy - Part 2

In most systems, 1+1 redundancy is enough for critical systems. 1+1 means that for every protected system, there is an identical replacement system running in parallel, and it can instantly jump in and replace the first system in case of failure. In this case, however, two distinctions need to be made.

First distinction is whether the nodes operate in "Active-Active" or "Active-Passive" mode. The first one generally requires more setup and is more expensive, but allows for failure of the node without losing the state of the system - something that is called "stateful fail-over". For instance, in the case of network equipment, an Active-Active system will preserve the existing connections that used to go across the failed node by constantly sharing the working memory set and configuration between the two nodes. Such a setup has different names depending on the vendor: Live Failover, Fault Tolerance, Full HA, Stateful Load Balancing and many others. The main requirement for stateful Active-Active clustering is some form of shared synchronous storage that operates on a transaction level and ensures that both nodes are aware of all external conditions that each one is going through.

Active-Passive clustering is a cheaper form that utilizes a standby system with exactly the same specifications, the role of which is to watch the active system and assume its role only when the original system is unavailable. In general, active-passive systems do preserve the configuration, but require the state to be reset, thus "kicking everyone offline" for a moment and requiring either manual or automatic reconnection.

Using Microsoft server systems as an example, Network Load Balancing would be considered Active-Active clustering: several systems listen on the same set of virtual IP addresses and guarantee that if one of them is down, the other ones will still receive the communication and provide a response. Fail-over clustering is then considered Active-Passive, since the failover requires a "virtual restart" of all services on the passive node. Both options should be carefully examined for the right mix, since both have their ups and downs.

Every environment is different. Support should understand that before suggesting ideas.