Log in



Categories » ‘DRMAA’

Official Debian repository for Condor available

July 28th, 2010 by Peter

6 years (!) after my initial Debian ITP (see here) and all the hard work on a custom Condor Debian package, they finally made it too:

http://www.cs.wisc.edu/condor/debian/

The Condor people are now offering there own Debian repository for the (longer existing) Condor DEB files. This allows you to have a decently updated Debian-based cluster with the latest version available. Great thing. Use it.

If you were using my Condor Debian packages in the past, you should stop that, and do a careful migration. The user names are equal, and the Wisconsin version of the installation scripts seems to be a little bit picky. I recommend to purge my package (for example with apt-get remove –purge condor) and check for any remainings, before you start with the new repository.

The installation is by default working in personal mode, which demands no user interaction during installation. The package looks like a completely new development, which is sad – some people (check the ITP) spend some time on things such as debconf support.

DRMAA Version 2 – Call for Action

September 5th, 2008 by Peter

I recently announced the start of the specification work for DRMAA2.

Most interested people know that the DRMAA group is extremely conservative regarding major API changes. Of course, this is one of the reasons for the broad adoption of the spec. The upcoming months are therefore one of the few chances to trigger major changes in the API layout. Challenge us with JSDL, SAGA, OGSA-BES or anything else – we are quite open for ideas. Contact details are described on drmaa.org.

DRMAA IDL Spec 1.0 is out

April 25th, 2008 by Peter

After more than two years of work, OGF finally posted our IDL spec as proposed recommendation GFD.130.

The official availability of this document has some non-obvious implications, due to earlier decisions in the working group. All future language binding documents should from now on be derived from this specification, especially if they describe the DRMAA API for an object-oriented language such as Java and Python.

This reduces the degree of freedom for the language binding authors, which is (in our understanding) good. Instead of modeling something which looks “DRMAAish” after the C-centric GFD.022 document, people now take the IDL definitions and get an immediate idea of what to model as class, interface, attribute or method. It enforces some consistency over language borders, even though we left (hopefully) enough decision space to consider language-specific styles. I remember some heavy discussion about “Pythonic” interfaces and if the IDL spec ever would allow to derive them. We will see.

I want to emphasize that I appreciate the great work done by people such as Tim Harsch (DRMAA for Perl) or Enrico Sirola (DRMAA for Python). We mainly wanted to give them a better input than GFD.022, which was never intended to map to object-oriented languages. The Java binding meanwhile simply took over the descriptions from the IDL spec, instead of re-inventing the wheel everywhere. So for Dan and me, the concept seems to work.

The next step is to quickly push out IDL-based “official” binding documents for all the OO languages people are interested in. This is now an easy step, since you mainly need to map the DRMAA IDL interface to a DRMAA-[Python|Ruby|Perl|Occam|Haskel|C#|whatever] interface. Thats it.

I will try to take care of Python and C#, with the main problem of not throwing away the already existing preliminary implementations. Dan has his hands on Java. Any other volunteers ?

Parallel Monte Carlo

March 27th, 2008 by Peter

I was recently asked for some initial material on running larger Monte Carlo simulations on clusters and Grids.

In general, the solution is obvious, since Monte Carlo is one of the embarrassingly parallel problems (some introduction slides). You can find application reports for areas such as financial derivatives pricing.

The easiest explanation I could find is this one by Paul Gray. It shows an example of how a parallel Monte Carlo simulation can look like.

Parallel MC has the problem of generating trustable random numbers, as described here and here.

Debugging SGE DRMAA applications

February 25th, 2008 by Peter

My latest DRMAA-based C application failed silently on a SGE 6 installation. Even with STRACE, it was not possible to figure out why the application stucked already in the drmaa_init(). I found here the relevant trick to get some useful output:

Source a magic debugging script (source sge-root/util/dl.csh) and use the new command dl 1 to enable some SGE debugging output on console.

I also experienced that DRMAA functions do not trigger a flushing of STDOUT. If you do some printf of the last error buffer and continue with the next DRMAA function call, you might see nothing. You should therefore use some kind of flushing debugging macro, or state setlinebuf(stdout) at the beginning of your program.

Installing Rocks Cluster on really old hardware

February 5th, 2008 by Peter

I am currently installing a cluster of old PIII boxes at BTH. Every machine has 4-20GB of harddisk and 512MB memory. Since I was tired of doing all the machine installation by myself (Debian, NFS, NIS, SGE, Ganglia, Java, …), I searched for a better ‘out-of-the-box’ solution. The Rocks cluster distribution provides everything I expected. You install a front node with all the software packages, and the dumb compute nodes install themself over PXE. The great thing is that all relevant cluster stuff is already integrated, so you get a full-fledged SGE+Ganglia+Globus Head+(their)NIS+MPI cluster in one day:

www.rocksclusters.org

Now the bad part: The documentation is lousy, like in all purely academical projects. My main problem was the age of the machines – the pure amount of software installed normally expects at least 1GB of RAM and 10GB of harddisk. Here is my set of experiences:

  • Give the front node at least 1GB. With 512MB, you get an obscure VFS error message during frontend installation, since the ramdisk gets full. The compute nodes work fine with 512MB.
  • insert-ethers is only needed for the first time the compute node is connected to the cluster. After the MAC address is registered, it will reinstall always from PXE. So you can have endless rounds to fix the problem with this particular compute node.
  • If the compute nodes have too small hard disks, switch to manual partitioning. Rocks expects a root partition, a swap partition, and a partition mounted under /state/partition1. With SGE, Globus, Ganglia, Java, and HPC roll the root partition needs at least 3GB with Rocks 4.3.0. The activation of manual compute node partitioning is described here.
  • New cluster users are created as described here.
    You need to consider that all compute nodes must be up and running to receive the update immediately.
  • Resist the temptation to install everything from the Net. It takes ages, and in my case the SGE installation on the frontend was incomplete afterwards, which leaded to another round of frontend renewal (in case, look here). Burn the CD’s or the DVD.

SGE 6.1u2 problem with Debian testing

December 10th, 2007 by Peter

We upgraded our SGE cluster machines to Debian testing release, which contains beside other things the latest 2.6 libc:

marie# dpkg -l|grep libc6
ii libc6 2.6.1-1+b1 GNU C Library: Shared libraries
ii libc6-i686 2.6.1-1+b1 GNU C Library: Shared libraries

Especially on the master node, this package version for libc is needed to install libmotif3 from the Debian repositories, which is a prerequisite for the qmon tool.

After upgrading everything, the SGE daemons refused to be started by the old /etc/init.d/ scripts, while the direct start of the binaries still worked. I figured out that the Sun shell scripts rely on the output from SGE_ROOT/util/arch, which returned the following on our updated installation:

UNSUPPORTED-lx24-GLIBC-2.6-x86

With bash -x arch, I got the following execution dump:


...
++ strings /lib/libc.so.6
++ grep 'GNU C Library'
+ libc_string='GNU C Library stable release version 2.6.1, by Roland McGrath et al.'
+ '[' 0 -ne 0 ']'
++ echo GNU C Library stable release version 2.6.1, by Roland McGrath et al.
++ tr ' ,' '\n'
++ grep '2\.'
++ cut -f 2 -d .
+ libc_version=6
+ case $libc_version in
+ unsupported=UNSUPPORTED-
+ lxrelease=24-GLIBC-2.6
+ ARCH=UNSUPPORTED-lx24-GLIBC-2.6-x86
...

I am not a shell script expert, but fetching the libc version by using strings on the binary looks a little bit weird.

Anyway, looking for the libc_version variable in the arch script brought up the following:


# verify the GNU C lib version
# For an alternative means to determine GNU C lib version see
# http://www.gnu.org/software/libc/FAQ.html#s-4.9
case $lxmachine in
amd64)
libc_string=`strings /lib64/libc.so.6 | grep "GNU C Library"`
;;
ia64)
libc_string=`strings /lib/libc.so.6.1 | grep "GNU C Library"`
;;
*)
libc_string=`strings /lib/libc.so.6 | grep "GNU C Library"`
;;
esac

# retrieving libc version failed
if [ $? -ne 0 ]; then
unsupported="UNSUPPORTED-"
lxrelease="${lxrelease}-GLIBC"
else
libc_version=`echo $libc_string | tr ' ,' '\n' | grep "2\." | cut -f 2 -d "."`
case $libc_version in
2)
unsupported="u"
;;
3|4|5)
;;
*)
unsupported="UNSUPPORTED-"
lxrelease=24-GLIBC-2.${libc_version}
esac
fi
;;

As you can see in the last part, all libc minor versions until 2.5 are considered to be known. I couldn’t find any problem statement for libc2.6 on the SGE web pages, so I added the missing “|6)”, and everything works now.

The rise and fall of CORBA

October 16th, 2007 by Peter

With all the bloated statements about CORBA, Web services, and the past development of middleware systems the ACM article by Michi Henning, who is a widely accepted CORBA guru, is really nice to read.

He describes why the main reason for CORBA’s fall was a lack in OMG’s standardization procedures. I share all his thoughts, even for WS technology, and his argumentation about standardization reminds me of my own DRMAA standardization work. It is good to see that our internal DRMAA group credo (focus on implementations, no unnecessary features, open source) fits to his suggestions.

Just read the article …

Victory !!!

October 9th, 2007 by Peter

OGF recently announced that DRMAA has reached the final status of a grid recommendation. After several years of work by the implementors, the users, and the group members this is a very happy moment for all of us.

It must be noted that OGF needed more than one year to acknowledge the status change, which shows (at least to me) that the document process has still room for improvement. Greg Newby and all the other steering commitee members already work on this, but according to Andrew Grimshaw it will take time.

It is also a pity that OGF officials do not attach more importance to this event. While OGSA is hyped in every press release, the work of the still existing Non-OGSA community is more or less ignored by the PR division in OGF. Maybe it is time for some adoption rate analysis of all the different group outputs ;-)

Document process at OGF

January 9th, 2007 by Peter

For our DRMAA standardization activities, I analyzed the GGF document process several months ago. Since OGF still relies on the old document, we have no change in the rules so far. Why is this a problem ?

Actually there is no way to fix issues in a finished doucment without restarting the OGF recommendation document process. The GGF / OGF document process was originally derived from the IETF document process (RFC 2026). I found out that most parts are identical; except for a section where RFC 2026 allows changes to a document that do
not lead to a status change. This special rule was not taken over by the
GGF. One should use this knowledge as starting point for some discussion
with the GFSC about a document process enhancement.

  • You are currently browsing the archives for the DRMAA category.