I attended the MRSC 2009 conference, taking place at the Zuse institute in Berlin. Some impressions:
- The conference claims to be a “many-core and reconfigurable supercomputing conference”. In fact, it was more about FPGA-based hardware accelerators than about many-core technologies. The mixture of industry and academia talsk was quite interesting. And the catering was really good
- Prof. Reinefeld, the host of the conference, gave some nice introduction about his interest in FPGA accelerators. One of the main reasons is power consumption – the ZIB facilities increased their power consumption from around 90 kW in 1997, over 260 kW in 2002, to meanwhile 660 kW in 2008. He also told us that roughly 40% of the power consumption today accounts for cooling only. Specialized FPGA accelerators could allow to keep the speedup pace AND reduce the power consumption significantly.
- I learned that FPGA boards have some general properties that are relevant for their programming. They have a comparatively low frequency (hundreds of MHz), can support tailored data types very well (if programmed accordingly), and normally provide thousands of ‘cores’ in a SIMD-like fashion. FPGA mostly have a memory bandwidth problem, meaning that they produce the results too fast to put them away in time. All algorithms simply run in pure hardware. The latest (very impressive) trend are FPGA’s that fit into a standard X86 CPU socket. The FPGA tool vendor provides a FSB or HyperTransport implementation, which allows you to add a spezialized CPU with full RAM access to your SMP system.
- All speakers agreed that standard n-core processors, FPGA’s, and GPU’s all have their right to exist. Standard processors are best in control-flow and sequential activities, GPU’s are perfect for floating point work, and FPGA’s are well-suited for the optimized processing of application-specific data types.
- Since I am a fan of standardization, the according activities OpenFPGA and OpenAccelerator must be mentioned.
- Martin Herbordt gave a nice talk about the advantages of FPGA acceleration. I found the slides in a 2006 version. He explained programming models for FPGA, showing that according applications need heavy restructuring, and that performance is highly sensitive to implementation quality – even more than with MPP programming. His presentation showed that the tailoring of FPGA’s can bring fast implementations things that would be hard on a standard processor (e.g. random number generation or coordinate transformation). This doesn’t come for free, FPGA-enabled algorithms need to fit to the ‘vector-like’ architecture. Stream processing through a series of ALU’s seems to be one favourite approach. He explained an example were BLAST (an indexing problem) was accelerated by changing it to a streaming problem.
- Phillip Maar from University of Potsdam presented a combined solution to design multiprocessor system-on-chip (MPSoC) solutions. They take a C program and first parallelize it automatically by clan partitioning to a MPI application (yes, this was the weak part). In the next step, they perform a functional cycle-accurate simulation of this program to design an optimal FPGA layout. In order to bring the MPI program logic to the chip, they created a hardware version of an MPI subset (SocMPI).
- CAPS presented the HMPP workbench, which allows to parallelize C and Fortran software by preprocessor directives. The most interesting aspect is the support for hybrid systems with GPU’s, FPGA’s, multiple cores and other execution engines. The software has the concept of “codelets”, which are functions to be executed in a remote device or specialized core. The source code always remains independent from the target accelerator.
- Mitrionics was also an interesting company, mentioned by nearly everybody. Their major product is a virtual processor implementation for FPGA’s, which can be stripped down according to the application needs before it goes on chip. It reminded me of the operating system concepts in embedded systems, were you compile your own version by putting together only the relevant modules (e.g. Windows CE). In the Mitrionics case, you program against the virtual processor function set (so called tiles), and not the FPGA chip itself. The highly optimized set of standard ’tiles’ is turing-complete. All tiles in the virtual processor can be compiled with the bit width you need. This saves precious space on the FPGA chip. Mitrionics has an own parallel programming language for the virtual processor solution. The speaker spent some time on explaining why automated parallelization of sequential code can never work. His main argument was that the compiler would need to know all possible parallelization strategies for sequential control flow patterns in advance. (BTW, I completely agree that todays parallelizing compilers perform only intelligent algorithm pattern matching, sometimes supported by annotations such as OpenMP). The Mitrionics language is based on data dependencies only, without any execution order description. Somebody from the audience identified similarities to data-flow languages from the 80’s, so they might be interesting again.
- Somebody from HP labs talked about their view on the world – very generic. HP seems to count on heterogeneus system integration due to the power wall as near future problem.
- Microsoft showed the recent parallel computing extensions in Windows 7 and .NET. Nothing new, but nicely explained. For Windows 7, they claim better NUMA support (some lightly extended WIN32 scheduling API’s) and user-mode scheduling (UMS) as major things.The first thing is not really impressive – they basically provide group affinity support and extended information about the core-cache relationships. If you know Linux CPUsets, it’s more or less the same. UMS is interesting, since it immediately reminded me of NT4 fibers. The Microsoft guy (from Redmond, not Cambridge !) confirmed that it’s more or less the same, but Dave Probert states it in a different way. Everything will come with Visual Studio 2010, and no, this will not be based on Phoenix
- SiliconGraphics had a nice presentation with some internal details. The speaker explained how SG was suprised by the rise of GPU’s in the last months. Hardware vendors (ClearSpeed, NVidia, XDI) and tool providers (RapidMind, Mitrionics, Alinea) together provide a lot of alternatives to their solutions. He showed how SG is now providing tailored hybrid clusters, and talked a little bit about the main problems with them – power-on sequence, non-atomic resource allocation, SW incompatibilities and online diagnosis flaws. He also showed some really cool research prototype with 180 x 2 Atom processor nodes in one (!) 3U chassy. Plans go up to 10.000 cores per rack, all running under CentOS. Here is a picture.