Monday, March 12, 2007

what about some other clusters you built?

What about my other clusters, you might ask? What makes me qualified to run this one?

Well, my linux-fu is quite strong, and I learned from the best. But linux-fu is not all that is required.

Let me see if I can construct a quick illustrated history of my past work:

(...comes back to blogger after unsuccessfully searching for photos)

Well, illustrations be damned - my former place of business is 'redesigning' their website, so all the photos of my clusters are now unreachable. I've been gone for 1.5 years, and they still have 'this site under construction' horsepucky up there.

(looks some more, this time using 'teh google')

  • 1*
so, here is the first cluster I cut my teeth on:
Impressive? Not by today's standards, that's for sure. But, what was impressive about this machine was the year it was built (1997 or 8 I think), it's size and interconnect. What you see here are 64 Pentium III's (whoo!), each with about a 5G hard drive, and I think 512M ram. The interconnect is Myrinet B (if memory serves correctly), a highly proprietary network fabric that consists of these weird hand-cut platinum cables that cost something like $100 a foot or some such weirdness. It was one of the fastest around at the time, with something like 1Gb/s uncompressed bandwidth. It was a real pain in the rear getting this beast to install, let alone run, and I was lucky to be allowed to touch it, let alone completely rebuild this beast. Most of the scripts were written by my insanely intelligent coworkers/gurus to handle job submission back before you kids got all this fancy job submission and monitoring software.

  • 2*
Next up, we have my 2nd cluster:
This pretty fellow is me, screwing in the 4U dual-proc bleeding edge (for 2002) monsters.

This is the final pic, just after running linpack on it, I believe. 44 dual proc amd mobos with 1G each of memory. This system was pretty advanced, and getting all the parts working correctly was difficult, given the lack of drivers that were available at the time (it probably wouldn't surprise you that the windows drivers were all available up front, those jerks). Each machine had I think a 40G hard drive, and those disk arrays you see there in white above the monitors were the filesystem for /home - mirrored arrays that were each something like 160G apiece. By the time I left this job, parts of this machine were starting to fail at an alarming rate, and I think only about 20 of these are actually up at this time. I know the disk array was limping along with a broken mirror and one drive reporting errors when I left. The interconnect was a 1G ethernet network for os-provisioning, etc, and a myrinet fiber fabric for the inter-process communication. Rocks was the provisioning os, using redhat (9?) at the time. I would like to point out that this machine was conceived of and designed by my former brilliant co-worker and linux guru, who now has a phd in NeuroBiology.

Pretty slick for a small time department run by a psychopathic old professor that, aside from bringing in tons of oil-related grants for her tenured chair, would stumble through the halls in a mumu shouting profanity that would make a sailor's ears burn. She could be extremely nice about 10% of the time, and the other 90% was spent berating people or insulting them. I once heard her yell at a Chinese grad student 'Get out of my office, and don't come back until you learn how to [expletive]ing speak [expletive]ing English!' She was never rude to me for some reason, but hated my entire group of coworkers. PhD's are crazy as well most of the time. Maybe 20% are normal, decent humans, while the rest are soulless freaks. Look it up!

  • 3*

After that, I built a cluster using Apple XServes, a complete and utter slow-motion disaster (no pic - hell, just imagine a rack full of silver XServes - it looks cool, but boy does it suck). The company sent a 'tech' to help who knew absolutely nothing about what he was doing. You could tell he ran a help support desk for Macs on a network, but never an OSX provisioning cluster for HPC. He wasted over 3 days of my precious time. I ended up complaining to Apple so badly that they sent me and my coworkers three free ipods as restitution. I still use that ipod 4 years later, but that's beside the point.

Don't ever build a cluster out of XServes with OSX! One university built something on the order of a 1000-node version of this, and ended up pushing the OS to each node by hand with an image loaded onto each node by ipod via firewire. I kid you not. Completely retarded.

I spent about 1 week solid writing my own auto-install scripts and setting up the environment, but didn't get to see the project to fruition, because I got a better job building MUCH larger systems (where I currently work, to be exact).

  • 4*

At my new job, I assembled one of the largest computers in the world (here, I kid not again - this machine is in the top 30 machines in the world as of 2007 - I would get more specific and say the exact rank in the top500 list, but then you could probably guess who I am and where I work, so I can't do that). This is not to say I single-handidly built the entire machine - I'm just saying I did most of the work putting the hardware together, from wiring the racks internally, to installing the power, the chassis, the nodes and the hca's in every node. Other intelligent people were involved in the design, administration and software implementation of this machine where I work, I just assembled the majority of it.

The last machine I assembled was insane. Here is a picture of just 1/10th of the cables. This construction almost killed me. And you could have hidden my body in the cables...(and yes, that is a Cray you see there, as well as an SGI)

and here is a pic of part of the final product - just one row, mind you

So there you have it. I have a very weird job, and how I got here was even weirder.

My, what a fun trip it's been down memory lane!

1 comment: said...

that beige cluster was delivered on halloween, 1997. it's still in the room where that photograph was taken...