Friday, June 29, 2007

...and now for something completely different

Here's some random moron smashing his head into the crate that the magnum was shipped in. What a tool.

And, since this is UT, a picture of the magnum with the ever-present longhorns mounted on top. Eat that, every other University!

First Beta Rack of the new Sun Constellation system

The new beta version of the Sun Constellation series has arrived! this is the back of it as it is being unloaded from the crate. Please keep in mind that no one in the world has one of these. Pretty awesome.

Here is the front of the rack, after being moved into place, with no blades in it.

And here is a blurry shot of the actual blades that will be going in - quad cpu slots that will each hold quad amd barcelona chips. Once again, no one has any of these - they are all beta units.

(editor's note: these blades are not quad-cores, these are just run of the mill dual core systems)

We've turned on and installed two of the blades, and they are really fast! We've also tested the magnum switch, and it passed all the tests with flying colors. Just as fast as they claimed it would be, and only one tiny issue that is software related, and we should be good to go.

Tuesday, June 26, 2007

Biggest Switch in the World for the Biggest Computer in the World

So, today, one of the many heads of Sun is announcing the new Infiniband Switch, called Magnum, which will be the biggest switch of its kind in the world. 3,456 ports, in a fat-tree configuration, this switch will be the first of its kind, although with a different name, I'm sure (who wants to fight a copyright battle with a condom manufacturer?).

This switch will be one of two for the entire system, and will comprise the spine and nervous system for the biggest supercomputer in the world.

As you can see in the above picture, there was a huge, head-sized hole punched in the side of the crate when it arrived here. After unpacking, we determined that it was only superficial. I just know that made some people sweat, seeing as how we are receiving the ONLY copy of this switch in the entire world - wouldn't want it to get screwed up in transit.

Here are the extra parts that come with the main chassis - the line cards, and all the cable management, etc.

Here is the empty box! How exciting! This is right after it was rolled off the pallet.

And now, the back! OOOOOOOh. AAAAAAAAh.

And a little magnum frontal action, too! This is pre-line card/filler panel install. Please start salivating now.

The great thing about this is, no one else has one! You can probably order it after today, but good luck getting it any time this year!

Odds and Ends

It's been 16 days since my last post, and there are many good reasons for that.

We've been chugging along, installing and testing, testing and reinstalling. Lots and lots of work to do. I've been putting in insane hours trying to get everything ready for the main components to arrive. Here are some neat details. 1st, a picture of knoppix booting on one of the X4600s.
Each penguin in that picture represents a CPU. In 3 months, you would see that boot with 16 penguins! Those would each represent 1 of the 4 cores in the new AMD 4core barcelona chip.

Next, I managed to figure out the way that Thumpers, or X4500's, boot. Grub on an X4500 is a very weird, fickle thing. You have to relax your brain and accept that it will not act the way you think it should.

When you are installing Rocks on a thumper, it can only see the first 24 drives, only one of which is bootable. It is very unfortunate that Sun decided to put the two bootable drives on the same scsi controller, since if you lose the controller, you can't boot. But whatever.

Here's how to get a thumper to boot with grub in linux. Install to /dev/sdy - while installing, the OS, via anaconda, sees the bootable disk as sdy - the 24th disk. At the end, you want to install grub to this disk (a quick synopsis of this is, in grub: device (hd1) /dev/sdy;root (hd1,0); setup (hd1);).

Once you have marked the drive with the grub magic, you now need to set up grub.conf to point to, get this, hd0. Yes, I know you set grub up on what you defined as hd1, but just accept this and move on.

While the machine is booting, the BIOS of the X4500 only present grub with two disks - disk 0 and disk 1. So in grub.conf, you need to point to 0, like this: root (hd0,0). Then, tell it where you installed root (like /dev/sdy1), so that once grub loads, it can find the kernel, etc. That is where I got hung up - I kept thinking grub would see what I defined as hd1, but it couldn't see it - it could only see hd0.

I know I'm not really explaining it well, but that's all that you need to do.

Sunday, June 10, 2007

installing like a madman

OK, so where was I?

Last Friday, we started to receive the first $X dollars worth of equipment (known as the 'starter system'). That equates to 8 X4600s, 16 X4100s, a 4G fiber-channel disk array and 4 thumpers (X4500s). Doesn't sound like much, until you start to work with them, and see just how nicely they are built. Here is a pic of the inside of an X4600 - and yes, that is 8 slots for cpus, which can all hold either dual or quad core amd chips in there!
Here is a fuzzy shot of the disk controller, the disks that will hold our Lustre metadata, and the thumpers on the right.

Here is the view I have to my right while I am working on this. This data center feels like a set from Star Wars. By the way - it is pretty bad for your health to stay in a brand new machine room for extended periods of time. Besides the fact that all the air is blowing everywhere at 65 degrees, there is a bunch of dust and particulate matter flying around that I can feel in my lungs. Not a good sign for my health!

Here are the X4600's sitting nicely on the ground, waiting to be installed:

bad pbr sig

I had posted about this error earlier, then I deleted the post because I thought it was stupid. Then, I noticed alot of people searching google for this post after I removed it. So I am posting my solution to help out.

I have come across this problem only on Sun hardware, but it may occur on others. I encountered it running CentOS linux, using grub.

This error, Bad PBR Sig, means that your Primary Boot Record has become borked. This normally happens when you are installing an OS on a machine that the OS is not familiar with, and it writes the record to the wrong place.

In my case, I was installing a thumper (sun x4500). My version of linux doesn't see all 48 hard drives in it, and so the install process is pretty lengthy. First you install an OS to /dev/sda, then you have to copy the OS, once it's installed, to /dev/sdy, which is the first bootable disk on the system. In doing so, if grub writes it's boot record to /dev/sda, then you won't be able to boot, as the BIOS don't see /dev/sda as bootable, just sdy and sdac. I think this is due to the layout of the disks and their closeness in position to the scsi channel.

Anyways, the way I fixed it was to use grub to write out the correct record to the correct drive.

You need to write to /boot/grub/, and tell grub which hard drive is which. Mine looked like this:

(fd0) /dev/fd0
(hd0) /dev/sda
(hd1) /dev/sdy
(hd2) /dev/sdac

I installed the os first to hd0, moved it over to hd1, and mirrored it to hd2. I will be booting from a mirrored boot partition.

in grub, you set everything up like this:

type grub;

in grub:

grub> device (hd1) /dev/sdy
grub> root (hd1,0)
grub> setup (hd1)
grub> device (hd2) /dev/sdac
grub> root (hd2,0)
grub> setup (hd2)

this will mark the correct disks with boot records.

then, you make sure you boot from the correct drive in grub.conf:

title CentOS-4 x86_64 (2.6.9-42.ELsmp)
root (hd1,0)
kernel /vmlinuz-2.6.9-42.ELsmp ro root=/dev/md1 rhgb quiet
initrd /initrd-2.6.9-42.ELsmp.img

that's it.

Monday, June 4, 2007

Gotta get a handle on this

Here's an interesting physical problem we came across. We have received 4 thumpers so far (what's a thumper? check here), and 2 of the 4 have different system controller handles. One allows for the IB cables to be plugged in with plenty of room to spare. The other is too large, and you can barely manage to squeeze the connector in. It locks, but this worries us, as any stress on the IB cables equates to imminent failure in the future. It looks like the one that allows for easiest access was designed after someone tried to actually use the old version, in figure 2, below.

Figure 1: Example X4500 with notched handle which allows for correct access of Port 0 on each of the PCI-X Infiniband HCAs. This is the desired configuration (note that the IB cable is connected).

Figure 2: Example X4500 without notched handle which does not allow easy access to Port 0 on each of the PCI-X Infiniband HCAs (IB cable not connected).