(j3.2006) (SC22WG5.3620) [ukfortran] Preparing for the Tokyo meeting

John Wallin jwallin
Wed Nov 5 15:58:04 EST 2008

Hi Nick,

Here is a re-post of my response as per John Reid's request - with a few minor changes.

You have some interesting points.   Here are some thoughts-


----- Original Message -----
From: "N.M. Maclaren" <nmm1 at cam.ac.uk>
Date: Wednesday, November 5, 2008 8:40 am
Subject: Re: (j3.2006) Preparing for the Tokyo meeting

> I have just spotted this one, too.  I am sending to J3 only.
> John Wallin wrote:
> >
> > I think it is certainly true that coarrays are manifestly useless on
> > systems that are uniprocessor vonNeumann machines.  However, you
> can't> buy those any more.  Everything now and in the future is
> going to be
> > multicore, at least from Intel's point of view.
> While that is true, multi-user (shared) systems are NOT going away,
> andparallel applications are a right b*gg*r to schedule on such
> things.Some administrators forbid them, and jump had on users
> caught running
> them.  But, in many more, they cause trouble to other users and/or run
> very badly (including taking many times as much CPU as they need).
> So serial Fortran will not go away any time soon, and some people will
> positively want coarray-free compilers (or a fixable mode).

I think that most people would not want to use a code that only runs on a single processor of a larger machine if they had access to even a basic PC.   The performance of the low end PC with multiple cores will almost always trump the use of a single core on a larger multi-core machine.

Of course, their may be pragmatic reasons why we must use part of a multi-user multi-core machine.  The big one would be classes where programming is being taught.   I think that most of the conflicts can be resolved in two ways.

1) You can lower the priority of the jobs using the renice command.  I have two of my graduate students that do large multicore runs on my desktop (sometimes using the graphics card).   I don't even notice when they are on, because their priority is lowered.   I have even done timing tests on multicore assignments from my HPC class on my desktop when these large runs were in the background.  The speed up on the higher priority (aka MY runs) was the same as if no one was logged in at all

2) It would seem sensible to be able set user limits on the number of images available.   This is clearly an implementation issue, not something intrinsic to the language.  (See the note by Alex)

> > With coarrays, you will need to define processor teams and set up
> the> algorithm to work on a heterogeneous machine, but the big
> advantage is
> > that you can do it without all the horrific overhead associated with
> > using MPI.  (I have some good examples of how horrific this can
> be for
> > the morbidly curious.)  ...
> If you have any such examples that apply to distributed memory systems
> without special RDMA/coarray hardware, please Email them to me.

MPI is a distributed memory system, so my treecode runs on everything.

Here is a simple example of an MPI coding horror.  I have to pass a Fortran derived type between nodes.

Here is the code I need to use to even make this exchange possible.   Please note that a lot of code has been deleted so I don't overwhelm everyone's email.

  integer (kind = IEEE_INTEGER) :: mpi_sph_type
  integer (kind = IEEE_INTEGER), parameter ::  sphblock_length = 3
  integer (kind = IEEE_INTEGER), parameter ::  sphblock_counts(1:3) = (/4,37,5/)
  integer (kind = IEEE_INTEGER), parameter ::  sphblock_size(1:3)   = (/1,1,ndim/)
  integer (kind = IEEE_INTEGER), parameter ::  sph_mpi_type(1:3) = (/MPI_INTEGER, &

  type (sph_particle_type) :: sph_particle_tmp

  call  MPI_ADDRESS(sph_particle_tmp%particle_number ,   address(1), ierr)
  call  MPI_ADDRESS(sph_particle_tmp%h,                  address(2), ierr)
  call  MPI_ADDRESS(sph_particle_tmp%dw(1),              address(3), ierr)
  do i = 1, sphblock_length
    ptypes(i) = sph_mpi_type(i)
    block_length(i) = sphblock_counts(i) * sphblock_size(i)
    displacements(i) = address(i) - address(1)
  call MPI_TYPE_STRUCT(sphblock_length, block_length, displacements, &
       ptypes, mpi_sph_type,ierr)
  call MPI_TYPE_COMMIT(mpi_sph_type,ierr)

You might wonder what the 37 is the sphblock_counts.  It is the number of double precision numbers in a row within this particular data structure. 

The important thing to note is that every time a grad student adds a single element to the data structure, you have to alter the block counts and sizes by hand.   This leads to huge problems debugging and maintaining the code if the base structures are modified.  (And this code is the best way I have found for doing it.)

Also, this pattern repeats for all of the 15 or so data structures.   Each time you declare a new data type, you add more MPI code.

Setting up the synchronized send and receives also very difficult to code.  You need copy elements into buffers (after allocating space), set up an order to send and recv that everyone agrees on, transfer the messages, and then merge the data from the buffers into the original array.

The overhead on this code is huge.   About 40% of the MPI code is book keeping stuff because of the message passing.

Nearly all of this could be eliminated by using coarrays.   I believe that the something like 75% (of this 40%) of the MPI and supporting code could be deleted.  

In short, coarrays would make my head hurt less.

I can post the code with these MPI horrors if you promise not to distribute.   Just let me know...

> > The final part of the argument was that Fortran provides a
> convenient> expression of algorithms that are applicable to any
> architecture.  This
> > is only partially true.  All of us have tuned our codes to run more
> > efficiently on vector machines.  ...
> Er, the vast majority of Fortran users have never USED or even SEEN a
> vector machine (and, no, I don't count SSE etc.)  You have, and I
> have,but the kiddies - including post-docs here :-) - I teach never
> have.Some have never even HEARD of them!

My point was that if do a bad enough job of coding, then you get bad performance.   Broken pipelines and memory access pattern issues play the dominate roles in single node performance problems.  These are not addressed by the Fortran directly.   An inefficient codes will run, but a good programmer can take steps to fix them to make the code run more efficiently.

The best example of the memory access patterns is probably renumbering the nodes on finite element codes.   This makes a huge difference in performance on both serial and parallel machines.  We found factors of 5 or greater on these types of codes, just an algorithm that fits the architecture.

In fact, the underlying principle of lapack was to fix the problems that linpack had with "modern" cpus.   It didn't require a change of languages, just a change of approach.

With the new multicore/multi-box architectures, we actually need a change in the language to write codes for these machines.   

> > In fact, this is the reason why coarrays need to be included in
> part 1.
> > However, it is not true that it will run at the same efficiency
> on all
> > platforms.  That detail is still left to the programmer since
> compilers> really can't figure this out and architectures are
> rapidly changing.

I would suggest talking to the UPC forum about this.   They have a lot of experience with this, and can address it directly.

However, here is my two cents -
I think a well written general routine would work on most architectures.   You certainly would sacrifice simplicity for efficiency, but I think it would be possible to make it run on almost everything well.   My treecode is an example of this.   It runs well on multicore machine and on clusters.   There are no problems on any of the architectures, since I designed it for the lowest common standard.

Of course, I also made a purely a serial version of my MPI code mostly because MPI wasn't installed everywhere.   Doing that was nightmare of #IFDEF's and conditional compilation.   Yes - I do get good, scalable performance, but the cost of doing it was extremely high both in time and in terms of sanity.

With coarrays included in part 1, at least I could be sure I didn't have to write two version of the code.

> In my view, the real concern is whether they will work RELIABLY on all
> platforms.  This means without artificial deadlock, with only a
> boundedperformance degradation, with specified and comprehansible
> behaviour,and without the code doing the wrong thing on occasion
> because the
> specification is essentially unimplementable.

It is very possible to create codes with deadlocks on any machine.  Of course, higher latency combined with lower bandwidth brings this problems to the forefront more quickly.  However, this would seem to be a problem with the program rather than the language.  Non-synchronized memory doesn't behave well, so users need to beware.   Similar problem exist in all parallel languages.

I think coarrays can express this algorithms more easily and cleanly.   This should (and I would say does on systems that have coarrays) make it possible to write algorithms with a more concise syntax.  Of course, it doesn't mean that a poorly written algorithm won't deadlock.  Of course, programmers can already write serial code that will iterate forever with no exit conditions.  

The point of coarrays is to make it easier to express these ideas so mistakes are minimized, and so programmers can write one code that runs on different machines.  

> And that is why I am going to Tokyo!  We need to ensure that is
> possiblebefore including them.
> Regards,
> Nick Maclaren,
> University of Cambridge Computing Service,
> New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.
> Email:  nmm1 at cam.ac.uk
> Tel.:  +44 1223 334761    Fax:  +44 1223 334679

More information about the J3 mailing list