Log in



Archive for February, 2008

When the cloud is gone

February 29th, 2008 by Peter

InfoQ had an interesting article on the recent outage of Amazons storage service S3. The article indirectly gives a beginners lecture on fault tolerance metrics, and gives an insight on the true relevance of SLA in practice. We learn again that even with a new term for distributed computing (“cloud”), the problems remain the same:

http://www.infoq.com/news/2008/02/s3-outage-trust-slas

[Update]

I start to feel a little sneak preview of old people’s wisdom. See here:

There is nothing new about the failures themselves, it is more about the (misplaced) trust on the vendors capabilities for dependability. Dependable systems cost real money – a lot of it. The old rule still counts – you get what you pay for.

Debugging SGE DRMAA applications

February 25th, 2008 by Peter

My latest DRMAA-based C application failed silently on a SGE 6 installation. Even with STRACE, it was not possible to figure out why the application stucked already in the drmaa_init(). I found here the relevant trick to get some useful output:

Source a magic debugging script (source sge-root/util/dl.csh) and use the new command dl 1 to enable some SGE debugging output on console.

I also experienced that DRMAA functions do not trigger a flushing of STDOUT. If you do some printf of the last error buffer and continue with the next DRMAA function call, you might see nothing. You should therefore use some kind of flushing debugging macro, or state setlinebuf(stdout) at the beginning of your program.

Human Computer Games

February 18th, 2008 by Peter

An old colleague pointed me to this one. It’s really funny, especially if you know old computer games and their sound tracks:

http://notsonoisy.com/gameover/index.html

Installing Rocks Cluster on really old hardware

February 5th, 2008 by Peter

I am currently installing a cluster of old PIII boxes at BTH. Every machine has 4-20GB of harddisk and 512MB memory. Since I was tired of doing all the machine installation by myself (Debian, NFS, NIS, SGE, Ganglia, Java, …), I searched for a better ‘out-of-the-box’ solution. The Rocks cluster distribution provides everything I expected. You install a front node with all the software packages, and the dumb compute nodes install themself over PXE. The great thing is that all relevant cluster stuff is already integrated, so you get a full-fledged SGE+Ganglia+Globus Head+(their)NIS+MPI cluster in one day:

www.rocksclusters.org

Now the bad part: The documentation is lousy, like in all purely academical projects. My main problem was the age of the machines – the pure amount of software installed normally expects at least 1GB of RAM and 10GB of harddisk. Here is my set of experiences:

  • Give the front node at least 1GB. With 512MB, you get an obscure VFS error message during frontend installation, since the ramdisk gets full. The compute nodes work fine with 512MB.
  • insert-ethers is only needed for the first time the compute node is connected to the cluster. After the MAC address is registered, it will reinstall always from PXE. So you can have endless rounds to fix the problem with this particular compute node.
  • If the compute nodes have too small hard disks, switch to manual partitioning. Rocks expects a root partition, a swap partition, and a partition mounted under /state/partition1. With SGE, Globus, Ganglia, Java, and HPC roll the root partition needs at least 3GB with Rocks 4.3.0. The activation of manual compute node partitioning is described here.
  • New cluster users are created as described here.
    You need to consider that all compute nodes must be up and running to receive the update immediately.
  • Resist the temptation to install everything from the Net. It takes ages, and in my case the SGE installation on the frontend was incomplete afterwards, which leaded to another round of frontend renewal (in case, look here). Burn the CD’s or the DVD.
  • You are currently browsing the troeger.eu blog archives for February, 2008.