[J3] FORM TEAM statement NEW_INDEX= specifier & failed images
Nathan Weeks
weeks at iastate.edu
Tue May 14 19:51:46 EDT 2019
Hi John,
My interpretation of the description of the FAILED_IMAGES result value
(16.9.77 p5) is that a standard-conforming processor need only update
the array of FAILED_IMAGES at image control statements and collective
routines (although I think the standard should be amended to include
the case when an image selector whose STAT= specifier assigned the
value of STAT_FAILED_IMAGE as well).
Indeed, that's how my OpenCoarrays fork operated: in the underlying
MPI collective routine implementing an image control statement or
(Fortran) collective routine, an MPI process failure would cause an
MPI error handler to be invoked, which used the ULFM routines
MPIX_Comm_failure_ack()/MPIX_Comm_failure_get_acked() to return a
group of failed processes to be unioned with a group of all MPI
processes known to have failed. This MPI group would be translated
into image indexes in the current team and returned by
FAILED_IMAGES().
--
Nathan
On Sun, May 12, 2019 at 2:51 PM John Reid <John.Reid at stfc.ac.uk> wrote:
>
> Nathan,
> >
> > I think there's still a problem with the FORM TEAM statement in the
> > program from C.6.8. Suppose the program is executed by 11 images, so 1
> > is intended to be a spare. If image 9 in the initial team fails
> > immediately before it executes the first FORM TEAM statement, then
> > image 10 in the initial team, which executes FORM TEAM with a
> > team-number == 1 and NEW_INDEX == 10 (== me), will have specified a
> > NEW_INDEX= value greater than the number of possible images in the new
> > team. (In general, it appears that if an image whose image index in
> > the initial team is > 1 and < images_used fails in the "setup" DO
> > construct before the FORM TEAM statement, a similar situation can
> > occur).
>
> Yes, this has not been allowed for but it is a low-probability event.
> Image 9 was active when image 1 referenced FAILED_IMAGES. Nevertheless,
> we should cover the case. We seem to need to test status after the FORM
> TEAM statement.
> >
> > Additionally, if this is an error condition for FORM TEAM, per 11.6.9
> > p5 ("If an error condition other than detection of a failed image
> > occurs, the team variable becomes undefined"), the simulation_team
> > team variable would be undefined---and I assume execution of
> > subsequent CHANGE TEAM statement would result in undefined behavior?
>
> Yes, we need a test after the FORM TEAM statement.
>
> It looks as we need to set up an interp request.
>
> John.
>
>
> >
> > Best,
> >
> > --
> > Nathan
> >
> > On Sun, May 12, 2019 at 7:38 AM John Reid <John.Reid at stfc.ac.uk> wrote:
> >>
> >> Nathan,
> >>
> >> Nathan Weeks via J3 wrote:
> >>> Hi all,
> >>>
> >>> Thanks for the helpful clarification (and identifying where the standard
> >>> is unclear). I'll note that this issue impacts the first failed-images
> >>> example in section C.6.8 of the Fortran 2018 standard, so there is
> >>> motivation for clarification in the standard itself.
> >>
> >> I think we were a bit hasty in choosing to assign failed images to new
> >> teams in a processor-dependent manner. We definitely want the C.6.8
> >> example to work. It was always a design objective that following an
> >> image failure, it would be possible to form a new team of active images
> >> and continue the calculation there. We don't want any failed images in
> >> the team because we want to be able to test for newly failed images.
> >>
> >> Anyway, 11.1.5.2, para 5 says
> >>
> >> 5 Successful execution of a CHANGE TEAM statement performs an implicit
> >> synchronization of all images of the new team that is identified by
> >> team-value. All active images of the new team shall execute the same
> >> CHANGE TEAM statement. On each image of the new team, execution of the
> >> segment following the CHANGE TEAM statement is delayed until all other
> >> images of that team have executed the same statement the same number of
> >> times in the original team.
> >>
> >> It is clearly expected that all images of the team are active. The
> >> adjective "active" is not used in the first and third sentences. It
> >> should be deleted from the second, for consistency.
> >>
> >> To go back to your question:
> >>
> >> "What happens in the case where an image specifies both NEW_INDEX= and
> >> STAT= in a FORM TEAM statement, and the image index specified for
> >> NEW_INDEX= turns out to be greater than the number of images in the
> >> new team due to image failure during the execution of FORM TEAM?",
> >>
> >> I think this is an error condition. Note that in C.6.8 the NEW_INDEX
> >> values are carefully set.
> >>
> >> Cheers,
> >>
> >> John.
More information about the J3
mailing list