(j3.2006) (SC22WG5.3645) [ukfortran] A comment on John Wallin's comments on Nick MacLaren's comments
Aleksandar Donev
donev1
Thu Nov 6 19:16:22 EST 2008
On Thursday 06 November 2008 15:39, N.M. Maclaren wrote:
> >Deadlocks introduced by the implementation should be reported as bugs to
> >the vendor.
> Unless the standard makes it clear whether the programmer or implementor
> is at fault, that is merely a waste of time.
I have been discussing these things with Nick in private for some time now, so
let me say a few words to WG5 (since I won't be in Tokyo---I think arguing
over e-mail is somewhat inefficient and wasteful).
I agree with Nick that we should make a decision concerning "progress", and
somehow anoint in the standard, even if we cannot say anything in
standardese. This, at the very least, will give guidance to both programmers
and implementors as to what kinds of codes/implementations not to write. I do
not agree that leaving it up to "vendors get it right" is good-enough here.
As Nick mentions, rightly, tunability, i.e., transparency of performance to
the programmer, is very very important and cannot be replaced by faith.
Especially if that faith is not portable.
This kind of thing plagued and still plagues array syntax: Nice syntax is
great but if it cannot be intuitively mapped onto performance there is a
problem. Coarrays are 1000 times better than, for example, HPF, in this
respect, due to the transparency of remote operations. However, Nick's
example (I paste it from e-mail below since the WG5 site is not working for
me), it not that clear cut. I think mixing in the example about the spin loop
in this discussion is a sidetrack---the example and explanation pasted below
is IMO more illustrative.
The essential feature in these examples is that the progress of the program
depends on some notion of uniform or at least persistent progress of all of
the images, and of all the communication (regardless of what images are
doing). Sure, a good implementation *and* a good machine will automatically
ensure that, but is it in some sense required or expected? What Nick and I
seemed to agree on is listed below as our "intent". Please consider this
carefully instead of endlessly arguing about job/thread schedulers.
The difference between the serial and parallel case is somewhat vague but I
believe real. In particular, in the parallel case there is no way any
heuristic can actually determine how to "progress" the images, in situations
where it cannot ensure that everything (CPU loops, I/O, OS interrupts,
communication) progresses at the same time (which would be possible if you
had a separate co-processor for communication or RDMA, and a separate core or
CPU for every image).
As a worst-case scenario, consider running n_images=2 as two threads on a
single CPU with a single core. It seems reasonable users may want to do this
for debugging, unless we warn them against it. Should we?
Best,
Aleks
===========================
5) While discussing paper N1744 via Email, the first author realised
that he had made a mistake in including the issue of 'progress' together
with user-defined ordering. While the combination is by far the most
likely to cause deadlock in practice, and is resolved by proposal (4)
above, the issue can arise even with SYNC IMAGES. Because this is a
complicated matter, there is an explanation appended below.
It is clear that not much can be required on all systems, but much more
can be requested on suitable ones. Of the widespread parallel
interfaces, only MPI has specified exactly when progress is required in
normative text. Consider the case of image P changing an object on
image R and a subsequent ordered segment on image Q accessing that
object Z[R].
We are in agreement about the intent, but not about how to express it.
Our intent is:
1) Where possible, those accesses will proceed without major delay
irrespective of what image R is doing at the time. In particular, a
"good" implementation will not introduce deadlock into a program that
does not have it.
2) At the very least, they will proceed whenever image R enters or
leaves an image control (and probably I/O) statement, and within a short
time if R is blocked in an image control statement. This is
implementable on all plausible systems, and is roughly what MPI
specifies.
3) It is unclear whether anything should be said about the case when
image R is executing ordinary Fortran code (e.g. a CPU loop). While
specifying that the accesses will proceed without major delay is clearly
possible, it does constrain implementations. This needs careful
consideration.
Explanation of the Progress Issue
---------------------------------
The question is whether images P and Q can communicate through a coarray
on image R, irrespective of what R is doing at the time. This is
extremely hard to implement on some systems, at least when R is in a
call to a companion processor, performing I/O or in a long-running
'pure' CPU loop.
For example:
PROGRAM Progress
INTEGER :: one[*] = 0
SELECT CASE (THIS_IMAGE())
CASE(1)
one[9] = 123+one[8]
SYNC IMAGES ( (/ 2 /) )
CASE(2)
SYNC IMAGES ( (/ 1 /) )
PRINT *, one[9]
CASE(8)
one[2] = 456+one[1]
SYNC IMAGES ( (/ 9 /) )
CASE(9)
SYNC IMAGES ( (/ 8 /) )
PRINT *, one[1]
END SELECT
END PROGRAM Progress
Consider a processor where an image services requests for coarray data
that it owns only when it reaches an image control statement; this is
common for MPI, and is also done by the reference implementation of UPC.
The above program will deadlock, because image 1 will not reach its SYNC
IMAGES until after images 8 and 9 have responded, and image 8 will not
reach its SYNC IMAGES until after images 1 and 2 have responded.
Obviously, that is a poor implementation of coarrays, but that is not
the point at issue. The question is whether it is a conforming
processor in the sense of 1.4 paragraph 2.
More information about the J3
mailing list