[ubuntu-uk] High Performance Computing

Mon Aug 6 21:12:02 BST 2007

Tony

Thanks, some great insights and some nice bedtime reading too!

Only point to pick you up on is that LTSP from v5, or maybe even from the
last v4, can allow applications to physically run on the terminal utilising
the LTSP server as the HDD.  As you correctly say this still involves a lot
of network overhead, so LTSP also have the option to utilise local storage
as well for swap and temp.

Or that's my understanding having waded my way through the LTSP user guide!

E

-----Original Message-----
From: ubuntu-uk-bounces at lists.ubuntu.com
[mailto:ubuntu-uk-bounces at lists.ubuntu.com]On Behalf Of Tony Travis
Sent: 05 August 2007 22:41
To: British Ubuntu Talk
Subject: Re: [ubuntu-uk] High Performance Computing

Ian Pascoe wrote:
> G'day all
>
> Anyone out there involved with, or has theoretical  experience with HPC
> using clusters?  Before I approach the various projects and make myself
look
> a complete twonk, I'd appreciate some views and thoughts.  Please?

Hello, Ian.

I've built a 92-node Ubuntu 6.06.1 LTS + openMosix Beowulf cluster that
I use for bioinformatics:

	http://bioinformatics.rri.sari.ac.uk

> Been looking at Rocks Clusters (http://www.rockclusters.org) which
provides
> the HPC platform based on REL 4, and the Linux Terminal Server project.
The
> reason is a theoretical one of utilising redundant computing power, ie old
> out of spec machines that are to be chucked, in an environmentally nice
way.

Rocks is about distributed computing, and LTSP is about using 'thin'
clients to run applications on a powerful central server. Using 'old'
PC's as 'thin' clients makes sense, but the overhead of network latency
makes is less attractive to use old PC's for distributed HPC computing.

The reliability of old PC's may not be good enough for them to be used
in an HPC cluster. I know this sounds disappointing, but I have a lot of
experience of using COTS (Commodity Off The Shelf) PC's as cluster nodes
and even new COTS PC's can be unreliable. This is important if you want
to do 'serious' work, because the results may be inaccurate.

In particular, COTS PC's don't have ECC (Error Correcting Code) memory
and recent PC's don't even have parity checking memory. In desktop use
COTS PC memory is reliable enough to run for a few hours without error,
but not when these PC's are run for months without rebooting, as HPC
compute nodes. I posted a message about this on the openMosix Wiki:

http://howto.x-tend.be/openMosixWiki/index.php/Additions_to_the_FAQ

> So the solution I came up with was to run the computing nodes as diskless
> work stations, getting their kernal / apps from the LTSP server, and
dealing
> with the cluster server for the work queue.  This sounds pretty straight
> forward to me.

I think you've misunderstood what LTSP is for: The apps run on the LTSP
server, and are just displayed on the 'thin' clients. What you are
describing is different, and what I do on our Beowulf cluster. The
compute nodes (COTS PC's) PXE boot from one of the cluster servers, and
they run an NFSROOT kernel.

> However, all references I find to cluster computing shows that the
computing
> nodes each are headless systems; which is fine but I wanted to look at
> reducing the green footprint by taking the heavy power requirements out of
> the equation, ie the HDDs etc.

In fact it's the CPU that consumes most of the power, especially when
it's working hard. I've got 'dataless' compute nodes, actually. The
reason is, again, network latency. The idea is to use a local disk on
each node for swap and /tmp.

> I have already identified some technical aspects that knock this on the
> head - the main one being I envisage two seperate ethernet networks, one
for
> LTSP and the other for the cluster, but neither software supports more
than
> one NIC on a terminal / computing node.

My Beowulf is based on the design of EPCC's (Edinburgh Parallel Computer
Centre) BOBCAT (Budget-Optimised Beowulf Cluster using Affordable
Technology). A particular feature of this architecture is the use of two
separate network fabrics - One is the 'system' network, the other is the
'application' network. The original BOBCAT web site no longer exists,
but these links might be of interest:

http://www.hoise.com/primeur/00/articles/monthly/AE-PR-10-00-1.html
http://www.dl.ac.uk/TCSC/DisCo/TechPapers/Beowulf/node7.html

There is also a more detailed report about this type of Beowulf cluster
(PostScript format) at:

	http://www.dl.ac.uk/TCSC/DisCo/TechPapers/Beowulf/beowulf.ps

> The next problem is that of storage space on the diskless terminal.  By
> utilising the LTSP server as the processor rather spoils the whole thing,
so
> I've looked at the terminal running the kernal and any apps locally, using
> the LTSP server to host the files required by the kernal / apps to run.
> This will reduce the load on the LTSP network.  However, the terminal will
> still require to store stuff temporarily, like the swap partition, so I
> thought about using either flash drives, too expensive, or USB pen drives,
> preferred.

I think you're confusing two things: LTSP runs applications centrally,
but displays output on distributed clients. Rocks runs distributed
applications but you could, if you wanted, display output from HPC on a
'thin' client. However, the idea of running distributed applications in
'thin' clients is not very good. The openMosix software I run can be
used to do what you want to, but by CPU 'cycle 'stealing' to distribute
jobs between powerful workstations that may sometimes be idle:

	http://openmosix.sourceforge.net/

> I chose Rocks over other projects mainly due to it's pedigree and support
> infrastructure, and LTSP as it seems to work with practically everything.

I think you are more impressed with Rocks' pedigree than you should be!

> "But why?" I hopefully hear you groan.  As I say it's all theoretical, but
> doing some research there is certainly need for this type of setup.  Maybe
> not for a top level production system, but one that just plods along and
> does the job.

If you want to learn about HPC, your approach is fine but if you want to
get a job done then buy a new quad core AMD64 motherboard, which will
out perform a network of many 'old' PC's...

> Sorry, I know it's not exactly Ubuntu orientated  .... but this area
really
> interests me.

Well, I'd better put my flame-proof underpants on too as I've gone on a
bit about it, but the Beowulf cluster I've built does run Ubuntu 6.06.1
LTS with a linux-2.4.26-om1 openMosix kernel, recompiled with NFSROOT
for the PXE 'dataless' compute nodes. I've made deb's of this if you're
interested:

	http://bioinformatics.rri.sari.ac.uk/~ajt/openmosix

Best wishes,

	Tony.
--
Dr. A.J.Travis,                     |  mailto:ajt at rri.sari.ac.uk
Rowett Research Institute,          |    http://www.rri.sari.ac.uk/~ajt
Greenburn Road, Bucksburn,          |   phone:+44 (0)1224 712751
Aberdeen AB21 9SB, Scotland, UK.    |     fax:+44 (0)1224 716687

--
ubuntu-uk at lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-uk
https://wiki.kubuntu.org/UKTeam/