Monday, September 17, 2007

cluster fork

You may or may not know, but cluster-fork is actually a command in the rocks universe. It allows you to run a command or set of commands on every machine in your cluster.

It's also a very apt way to describe why I haven't been updating this blog lately. I've been cluster-forked!

It's gotten to the point where I either do something useful, or update this blog. With the new machine, family and life, I've been afforded little time for such a thing as blogging. When I do get a chance, I will update as much as possible.

So, while I'm sitting here, I'll post a brief synopsis.

Uh, let's see.

We were testing and retesting the hardware, figuring out the correct bios settings to get the best performance out of the system, along with figuring out how to remotely manage and monitor thousands of machines in an easy and convenient manner. The blades are designed really well, and we are getting very nice performance percentages out of the new chips (wink wink). I know some of you are dying to know actual numbers, but I haven't had time to ask if I can write about it.

Additionally, we had to get the remote operating system install up and running smoothly, so that the blades come up uniformly and with only the necessary software and daemons to get the jobs run, when the hardware actually starts to arrive.

We installed all the disk servers with their necessary little bits and pieces, all the while learning the ins and outs of the unique and extremely engineered hardware from sun.

One might wonder how to go about controlling a massive cluster bigger than my house. Well, we're using sun's neat embedded service management voodoo hardware to monitor and remotely connect to the machines. Basically, it's a little computer that's embedded in the back of the machine you are running. It has it's own ip address, it's own processor, and can power on, turn off (gracefully or immediate) and monitor the server's health, all via ssh, or https(!).

There are 2 ways to do this, through ipmitool (a command-line interface to access, query and control the machines), or with the sun-produced browser-based java gui hoohadilly. I threw that last word in there for people who don't know what I'm talking about (hi mom!).

I use both. I am a strong advocate of command-line, script-based control of machines, and it will never be replaceable. However, I have taken a great liking to the java juju, I must admit. It's nice to pull up a browser and watch a computer boot in another place like I'm standing in the machine room with it. I can control it with both keyboard and mouse. It's a giant network-based kvm. I can access over 4000 machines from my one computer, and it's pretty cool.

Had to run the ethernet networking fabric for all of that. 12 48 port switches all connected to 2 24 port leaf switches that are uplinked via 10-GigE lines. That's alot of ethernet, btw. We run the lines, then velcro the bundles to the floor supports underneath our suspended floors.

Lots of 10-GigE cards installed. Lotsa fiber cards in PCI-E slots.

We have started receiving 18-wheeler shipments almost daily of huge amounts of hardware. 16 huge racks come in each shipment, and soon 200-400 blades will start arriving every day as well.

I'm sure that if our department weren't hosting two hpc conferences right now, we'd have actual photo evidence of all the work, but I will try to ask around for pics of what's going on.

This system didn't feel big until we got 45 racks and pushed those mammerjammers into place. That's when it felt huge. You can no longer walk in between rows through the spaces where the racks would go anymore, and the distance to go around to the sides, where the only openings are, is considerable. Now, you have to plan before you go somewhere, lest you have to turn back to get something you forgot. I'm not kidding. I jogged down the aisle and it took far too long to get to the end. It's really impressive.

Lots and lots of other details are being glossed over, because I am tired, and need to get up early to meet the next truck shipment in the morning.

more later...

below, blogger says I posted this at 8:47 pm or something like that. that isn't true. it's 11:47pm. why must blogspot lie to the world and make little children cry?