(j3.2006) (SC22WG5.4931) WG5 ballot on first draft TS 18508, Additional Parallel Features in Fortran

N.M. Maclaren nmm1
Wed Mar 13 15:43:09 EDT 2013


Please answer the following question "Is N1967 ready for forwarding to
SC22 as the DTS?" in one of these ways.

    3) No, for the following reasons.

Regards,
Nick Maclaren.



I have not had time to cross-check on all of the details of N1967
against Fortran 2008, so these are not necessarily all of my objections.
I have drafted some proposals for resolution, but have not included them
in this.  I will post them separately.



REASONS FOR VOTING NO
---------------------
---------------------

Generic
-------

    1.1) The wording refers to cases when the execution of a statement
is not successful, but Fortran 2008 refers to error conditions.  This is
confusing, at best, and they should use compatible terminology.  It is
more serious when one considers node failure.

    1.2) That is not the only aspect in which the details differ.
The wording and other details need a systematic check and improvement.

    1.3) I am distinctly unhappy about the number of places where
semantics are defined for error conditions that are caused by
infrastructure failure, which is not in accordance with the Fortran
standard's previous practice.  STAT_FAILED_IMAGE is mentioned later, but
this is also done for events.

    1.4) The current dominating standard for parallel processing is MPI,
and its basic model has proven to be solid over many years.  This TS
provides many comparable facilities, but does not seem to have included
the comparable constraints needed for correctness and implementability.
This applies particularly to teams, but also to collectives.


Teams
-----

I have serious difficulty even understanding the basic model, and it
appears to make little sense.  FORM SUBTEAM is specified to be an
ordinary statement creating a variable, and all synchronisation is in
CHANGE TEAM, using a variable defined by a previous FORM SUBTREAM
statement.  All of the descriptions of which team is being referred to
are in terms of a variable, and not a value.  The following are a few of
the issues this causes.

    2.1) What happens if only some images in the current team have
called FORM SUBTEAM?  How does CHANGE TEAM know which other images to
wait on?

    2.2) In the following, do alf and bert indicate the same subteam?
And is it allowed to create two different teams at the same level, as in
bert and colin?  And how do other images know which of these FORM
SUBTEAM statements matches the FORM SUBTEAM statement on their image?

    TEAM_TYPE alf, bert, colin, dave
    FORM SUBTEAM (13, alf)
    FORM SUBTEAM (13, bert)
    FORM SUBTEAM (666, colin)

    2.3) Fortran defines intrinsic assignment for derived types; even if
that were locked out, several argument passing mechanisms imply implicit
copying.  The nearest that Fortran has to the concept of two variables
being the same is association.  It is not within the remit of this TS to
add a major new, fundamental semantic concepts to Fortran, such as
unassociatable objects.  For example, in:

    TEAM_TYPE alf, bert
    FORM SUBTEAM (13, alf)
    bert = alf
    FORM SUBTEAM (42, bert)

or:

    TEAM_TYPE alf
    FORM SUBTEAM (13, alf)
    CALL ugh(alf)

    SUBROUTINE ugh (TEAM_TYPE arg)
        FORM SUBTEAM (42, arg)
    END SUBROUTINE ugh

    2.4) The following is allowed by the specification, but it makes no
sense.  Specifying synchronisation by how often CHANGE TEAM is called
doesn't work if its argument may be variable and there are no further
constraints.

    TEAM_TYPE array(NUM_IMAGES())
    // Set up somehow
    REAL :: junk
    CALL RANDOM_NUMBER(junk)
    CHANGE TEAM (array(junk*THIS_IMAGE()+1))
        ...
    END TEAM

    2.5) Allowing subteam variables in CHANGE TEAM with no further
constraints allows non-hierarchical team usage, which was not the intent
of N1930 T3.

    TEAM_TYPE alf, bert, colin
    FORM SUBTEAM (13, alf)
    CHANGE TEAM (alf)
        FORM SUBTEAM (42, bert)
    END TEAM
    FORM SUBTEAM (666, colin)
    CHANGE TEAM (colin)
        CHANGE TEAM (bert)
            ...
        END TEAM
    END TEAM
   
    2.6) The issue described in 2.4 also allows SYNC TEAM to synchronise
teams which are not the current team or one of its descendants.  This
is, at best, a recipe for deadlock.  Even allowing it on ancestors
introduces a conflict with N1930 T1 and T2.  Also, I cannot see that the
statement is required by N1930, or actually necessary.  It can be done
by temporarily changing team and calling SYNC ALL.

There are other problems, too, such as:

    2.7) In the specification of CHANGE TEAM, the current team when the
CHANGE TEAM was executed is not necessarily the parent of the team that
is being changed to, so specifying synchronisation of the parent team is
incorrect.

    2.8) I have tried to convince myself that correct programs will not
deadlock, and I have tried to convince myself that correct programs can
deadlock, and have failed with both.  The design is just too complicated
to be sure it is correct.

    2.9) The design very dubiously meets the requirement N1930 T2,
because an image belongs to all of the teams that it has formed and can
use them, which is the cause of the SYNC TEAM problems.

    2.10) There are some very nasty issues to do with resource leakage
if these facilities are used in a library.  FORM SUBTEAMS creates a
handle to something or other, but there is no way to release that
handle.  This would be easily soluble only if its function were subsumed
into CHANGE TEAM.

    2.11) It has omitted the qualification in LOCK and UNLOCK that
semantics are defined only for successful execution of the statements.
This is a variant of reason 1.1.


Conclusion: the constraints on team actions and the semantics of teams
need a complete rethink.



Collectives
-----------

    3.1) CO_REDUCE requires commutativity but not associativity of
OPERATOR, which makes no sense.  MPI requires associativity but not
commutativity, which at least makes sense.  It should require both.

    3.2) Also, it does not specify anything about the consistency of
OPERATOR, which is a recipe for problems.  I have serious difficulty in
understanding the combination of C730, C1218, C1220, C1234 and 12.4.3.6
paragraph 7, but can believe that the requirement for an elemental
function means that it must be the same function.  However, that is not
enough (semantically) because of global or parent scope variables and
THIS_IMAGE().  This should be improved.

    3.3) The specification of the ordering of collective subroutines
makes sense and is what we agreed, but remains confusing.  A NOTE should
be added to clarify our intent.



Events
------

    4.1) I am baffled by the reference to INTENT(INOUT) in C602 and
C603.  In particular, both EVENT POST and EVENT WAIT necessarily both
read and write the variable, so it seems bizarre to lock out the only
case that makes semantic sense.  Neither of those statements make any
reference to whether their event-variable may be INTENT(IN), INTENT(OUT)
or PROTECTED, none of which make semantic sense.  The only thing that I
am assume is that the sense of the condition has got inverted.  This
needs fixing.

    4.2) Page 14 (6.4 EVENT WAIT) lines 7-11 are surely erroneous in the
case where the EVENT POST fails, and probably when the EVENT WAIT does.
This matter is not as simple as it appears to be, because it has a
significant impact on permitted serial optimisations.  See Data
Consistency below.

    4.3) The word 'later' is thoroughly ill-defined in a parallel
context, especially when it is applied to general semaphores.  In
particular, it begs the question of which one of several possible uses
of EVENT POST does the EVENT WAIT synchronise with?  As the
specification stands, this means that they must NOT be image control
statements, because that would introduce a logically recursive
definition into the standard.  I.e. the sequence of their execution
controls the ordering, but the ordering controls the sequence of their
execution!

This needs specifying properly, and would be vastly simplified if events
were changed from being general semaphores to being binary ones.  See
Data Consistency below.

    4.4) There is nothing said about global consistency, which is
well-known to be a potential problem with such actions (as with
atomics).  In particular, it might seem obvious to assume sequential
consistency, but that does NOT immediately follow.  Whatever model is
chosen needs specifying.  See Data Consistency below.

Obviously, that choice has a major impact on the EVENT_QUERY intrinsic,
especially as it is defined only when it is ordered with respect to all
EVENT POST and EVENT WAIT statements.



Atomic Intrinsics
-----------------

There are at least two structural problems with these.

    5.1) The first is that their ATOM argument is not required to be a
coarray, unlike ATOMIC_DEFINE and ATOMIC_REF, which is undefined if an
atomic coarray object is an actual argument to a procedure which does
not define it as a coarray but then calls these procedures.  That needs
fixing.

    5.2) The second is that these extensions are truly baffling in the
context of Fortran 2008 13.1 paragraph 3 and Note 13.1.  I am not sure
what to do, but supporting the fetch-and-operate paradigm means that the
global data consistency problem simply has to be addressed.  There are
several options, but all have extremely unobvious and serious
consequences.  See Data Consistency below.


Data Consistency
----------------

    6.1) This is not a simple matter, and WG5 will be making a serious
mistake if it proceeds to add facilities of the nature proposed in the
TS without putting some serious thought into the data consistency model.
In Fortran 2008, we evaded this by selecting a simple and extremely
proscriptive model for SYNC IMAGES and kicking the consistency of
atomics into the long grass.  This is no longer viable, for two reasons:

    1) Fortran events are general semaphores.  While these are
well-studied, they are nowadays usually avoided in favour of other
mechanisms.  However, I have so far been unable to find any precise
description of the ordering semantics for general semaphores, or
convince myself that I understand that aspect.  Dijkstra himself
pointed out that they are no more powerful than binary semaphores.
See 4.2 in:

    http://www.cs.utexas.edu/users/EWD/transcriptions/EWD01xx/EWD123.html

Worse, the current specification defines behaviour if they are
unsuccessful, and I have absolutely no idea what implications that might
have.

In particular, events affect segment ordering and, if we do not specify
anything, that is going to break the properties of the data consistency
model that we agreed (after much discussion!) in Fortran 2008.  Either
we need to simplify them very considerably (probably to binary,
unconditional semaphores), or we need to call in some specialist
expertise.  I am completely out of my depth with the current
specification.

    2) I really can't see any way to make sense of these atomics except
by enforcing consistency.  In particular, it is NOT automatic that
operations on even a single atomic variable are sequentially consistent,
which was the topic of such debate in Fortran 2008.  However, thinking
about what anything other than sequential consistency might mean with
the fetch-and-operate intrinsics makes my head hurt.

There is also the point that the simple, inconsistent atomics that we
defined in Fortran 2008 are extremely useful on systems that have no
hardware or operating system support for global consistency, because
they can often be implemented efficiently, whereas consistent atomics
need to be emulated by using locks or equivalent.  There are also a lot
of uses for atomics that do not require any particular consistency.

    6.2) When it comes to consistency, there is no logical difference
between a PGAS model and shared memory, and one of the few good designs
I have seen is the C++ memory model.  For a clear and fairly simple
description of the issues, see sections 1 to 3 of:

    http://www.hpl.hp.com/techreports/2008/HPL-2008-56.html
 in:
    http://www.hpl.hp.com/personal/Hans_Boehm/pubs.html

Evidence of the model's consistency is in:

    http://www.cl.cam.ac.uk/~pes20/cpp/popl085ap-sewell.pdf
 in:
    http://www.cl.cam.ac.uk/~pes20/

Regrettably, I cannot claim to be enough of an expert to guarantee to
validate such a design for Fortran, though I am enough of one to spot
obvious flaws.

There is the question of whether the atomics should also synchronise
other data uses.  I believe that our assumption is that they should NOT
necessarily do so, which is a significant difference from the C++ model.
This is the point at which I get out of my depth.  I am almost certain
that the Fortran design makes sense, but parallel semantics is so
deceptive that I am chary of assuming I am right.

    6.3) While I believe that the time has come to do it properly, I
accept the argument that this would derail the schedule.  However, I
believe that it is essential that we (a) integrate events with the basic
sequentially consistent segment model, (b) decide and define something
about atomics and (c) try to avoid taking decisions that will prevent
a proper solution later.

    6.4) Arising from this, I believe that the easiest solution to the
event problem is to simplify them to be binary semaphores and explicitly
require all image control statements to be sequentially consistent.
This last is not a change, but merely a more explicit statement of what
the standard currently says.  I am a bit nervous about even allowing
EVENT POST on an already posted event variable to return an indication
of the fact, but suspect that it will be so often demanded that it is
unavoidable, and it does not add any problems that have not already been
introduced by the ACQUIRED_LOCK= specifier in the LOCK statement.

    6.5) I believe that the simplest solution to the atomics problem is
to explicitly define sequential consistency on a single atomic variable
for the new atomics, and to explicitly state that the sequence for two
variables need not be consistent.  This leaves ATOMIC_DEFINE and
ATOMIC_REF as anomalies, and I believe that we should provide
alternative store and fetch atomics that are consistent with the new
ones, and leave ATOMIC_DEFINE and ATOMIC_REF as processor-dependent.



Image Failure
-------------

    7.1) This is not a minor addition.  No language has ever managed to
standardise recovery of an application from general system-generated
errors or infrastructure failure, and even POSIX does not attempt it.
There are fundamental reasons why this should not be attempted in a
portable language.

Fortran has not so far defined any recovery facilities from even the
simplest cases of I/O errors (e.g. erroneous data in formatted input),
which have been supported by many languages for the past 40 years.
Image failure is significantly harder to recover from than even
system-generated I/O errors such as a disk failure.  Furthermore, I/O
error recovery is required by a vastly larger number of programs than
even use coarrays, let alone those that want this facility.

Fortran has not defined any form of the now-conventional exception
handling, because of difficulties in integrating it with the existing
language.  Even Ada has not attempted to define recovery from
system-generated errors, let alone infrastructure failure.  C's signal
facility permits this sort of failure trapping, but is merely a
standardised syntax to entirely undefined (sic) semantics and
implementations that simply crash exist and are conforming.

When an image fails, it will usually be while it is active, which at the
very least means that any data it might have defined in its active
segment (including coarray data on other images) becomes undefined.  Any
other images that may have been interacting with the image when it
failed (whether via coarray data that it owns, collectives, events or
other) also reach an undefined state.  Because Fortran permits the
buffering of file operations over image control statements, the output
and error units, and any shared files also become undefined, even if
they were not being accessed at the time or even used in the image that
failed.

Lastly, there is nothing stopping processors from providing this
facility as an extension, and that would be a far better way to do it,
at least until such time as it this is shown to be feasible in at
least most processors.

This feature should be removed.




More information about the J3 mailing list