[J3] FORM TEAM statement NEW_INDEX= specifier & failed images

Nathan Weeks weeks at iastate.edu
Sat May 11 12:52:27 EDT 2019


Hi all,

Thanks for the helpful clarification (and identifying where the standard is
unclear). I'll note that this issue impacts the first failed-images example
in section C.6.8 of the Fortran 2018 standard, so there is motivation for
clarification in the standard itself.

One drawback of modifying the standard to assign stopped/failed images to
new teams in a processor-dependent manner when FORM TEAM includes
NEW_INDEX= is that the resulting team wouldn't be able to use collective
routines (although I suppose the program could check if FORM TEAM sets
STAT=STAT_FAILED_IMAGE, and in that case execute a FORM TEAM without
NEW_INDEX= to cull non-active images, before trying FORM TEAM with
NEW_INDEX= again).  But I suppose an advantage is that it would give
applications additional flexibility?

I'm particularly interested in this issue as I'm implementing a prototype
of OpenCoarrays that includes failed images support for teams.  FWIW, I
have a research poster that advertises a (somewhat dated) prototype
implementation (where FORM TEAM excludes NEW_INDEX=):
https://lib.dr.iastate.edu/cs_conf/49/
(the Docker image, which emits a lot of debugging info, is based on
OpenCoarrays ~2.4.0 + failed images extensions).

Given the current implementation (MPI + ULFM under the hood), it would be
easier to either (1) just not create the team, or (2) exclude non-active
images from new teams created with FORM TEAM (..., NEW_INDEX=...), and
consider NEW_INDEX= to behave (in a processor-dependent way) like the
MPI_Comm_split() "key" value, used simply to determine ordering of the
images in the resulting team.  But I don't think it would be insurmountably
difficult to implement a failed/stopped-images-fill-in-gaps (where
necessary?) approach (though I admittedly haven't through about it too
hard).

--
Nathan


On Thu, May 9, 2019 at 10:13 AM <j3-request at mailman.j3-fortran.org> wrote:

>
> Message: 3
> Date: Thu, 9 May 2019 14:12:57 +0000
> From: Bill Long <longb at cray.com>
> To: General J3 interest list <j3 at mailman.j3-fortran.org>
> Subject: Re: [J3] FORM TEAM statement NEW_INDEX= specifier & failed
>         images
> Message-ID: <A2E9865F-EB94-4DA4-81E0-B8F2E75C6E62 at cray.com>
> Content-Type: text/plain; charset="utf-8"
>
>
> > On May 8, 2019, at 8:07 PM, Malcolm Cohen via J3 <
> j3 at mailman.j3-fortran.org> wrote:
> >
> > <<<
> > I?m not convinced that Jon?s example is unreasonable.
> > >>>
> >
> > It?s definitively Not Standard Conforming!
>
> That, I think, we can agree.  The way the standard is written, the only
> way for an image to join a new team is to execute FORM TEAM.  NOTE: Exactly
> the same issue under discussion applies to STOPPED images as well as FAILED
> ones.  An image that is stopped will not execute FORM TEAM either.  Stopped
> images are a bit different in that data on stopped images can still be
> accessed. Such accesses are still possible through the image index of the
> stopped image in the parent team.
>
>
> >
> > <<<
> > We currently allow the value of an image index to be larger than the
> number of active images in a team.
> > >>>
> >
> > We currently allow the user to drink hot coffee too, which is about as
> relevant.
> >
> > Seriously, he suggested the image index being larger than NUM_IMAGES().
>
> Only if the failed (and stopped) images are not part of the new teams,
> which they are not by the argument above. NUM_IMAGES() includes failed and
> stopped images in a team.  If we want to require that the  sum of the
> num_images() values for the new teams is the same as num_images() for the
> parent [something that I think would be intuitive], then the processor
> could assign failed and stopped images  to new teams in a
> processor-dependent way that ?fills in? gaps in the image index values if
> NEW_INDEX is specified, or adds them at the end of the image index range if
> there are no gaps (or if NEW_IMAGE is not specified).
>
> So the question is what we really want.  Either the failed and stopped
> images  of the parent are discarded from the new teams (the current
> wording), or whether they are included in the new teams.  Which option is
> more useful and less error-prone for the programmer?   I think the
> arguments / examples presented so far suggest that including the stopped
> and failed images is preferred. Others, of course, might have the opposite
> view.  But this is the question that should be discussed, not whether the
> example is conforming to the current standard (since we have a clear answer
> for that already).
>
> Cheers,
> Bill
>
>
>
>
> > That is a horse of a completely different colour.  I assert that this is
> rather likely to lead to disastrous outcomes, like accessing data of
> non-existent images (but image number less than num_images!), or syncing
> with such, etc.  It?s not that unlikely that the runtime system itself has
> tables sized by the number of images, which could lead to sync all not
> syncing with the images whose numbers are out of range, using random
> garbage instead of network addresses, or indeed seg faulting.
> >
> > <<<
> > Otherwise, we are basically requiring that the user go through an
> exercise to track down which images are failed
> > >>>
> >
> > If only we had an intrinsic that returned the failed image numbers?
> >
> > <<<
> > before executing FORM TEAM and then communicating these boundary image
> index numbers at runtime
> > >>>
> >
> > Indeed, the user had better do this if he cares about his data, as the
> failed images Have Already Lost His Data!  Carefully forming a boundary
> image setup without knowing what data is available is a complete waste of
> time. (He might not even *have* any boundary data left, or all the images
> that would have belonged in ?team 3?, which might be a critical team for
> his algorithm to work, might have failed.)
> >
> > If he doesn?t care about the data why bother running the program at all.
> >
> > Or if your concern is how he is going to recover without ?going
> non-standard?, we would need to add that capability. There are two obvious
> possibilities: (1) NEW_INDEX is simply inoperative when an image has
> failed; (2) invalid NEW_INDEX is allowed, at least when images have failed,
> but doesn?t form a team.  There are many details to be worked out, but we
> could certainly do something about it.  STAT_INVALID_NEW_INDEX would be one
> possibility, but something more lightweight could also work.
> >
> > In principle I support ?doing something?, as long as we can all agree on
> what the something is, as otherwise there is a race condition anyway in
> between checking for failed images and executing FORM TEAM with NEW_INDEX.
> >
> > Cheers,
> > --
> > ..............Malcolm Cohen, NAG Oxford/Tokyo.
>
> Bill Long
>      longb at cray.com
> Principal Engineer, Fortran Technical Support &   voice:  651-605-9024
> Bioinformatics Software Development                      fax:  651-605-9143
> Cray Inc./ 2131 Lindau Lane/  Suite 1000/  Bloomington, MN  55425
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.j3-fortran.org/pipermail/j3/attachments/20190511/d85443b4/attachment.html>


More information about the J3 mailing list