(j3.2006) (SC22WG5.5072) Ballot on N1983
John Reid
John.Reid
Sun Aug 11 14:01:23 EDT 2013
-------------- next part --------------
Please answer the following question "Is N1983 ready for forwarding to
SC22 as the DTS?" in one of these ways.
3) No, for the following reasons.
1. Reasons for no vote
1.1 Continued execution in the presence of failed images
It is not clear how execution is intended to continue in the presence of
failed images.
For most calculations, the failure of one image leads to the failure of the
whole calculation. To recover from this, the program probably needs to revert
to a previous "check point" and continue the execution from there using
images that have not failed. One possibility is to not to use all the
images for the calculation but keep a few "spares". Execution is within a
CHANGE TEAM construct, with spares in a separate team and idle. When an image
fails, the CHANGE TEAM construct is left, a new team is formed by
substituting a spare for the failed image, the check point data are recovered,
and the CHANGE TEAM construct is re-entered. This avoids any need in the main
code for remapping of data - it only has to detect failed images and
exit the construct if there are any.
Some calculations are "massively parallel". Most of the work is done
completely independently on separate images. Perhaps one image acts as
"master" handing out tasks and collecting results. As long as the master does
not fail, the calculation can continue happily with failed images. The master
sends the work that it gave to a failed image to the next image that is free.
I will assume that we want to cater for both situations. Even in the first
case, the parent team needs to execute in a team that has failed images while
it forms the new team and recovers the check-point data.
The collective procedures are not massively parallel. They should surely fail
if any of the images of the team have failed. However, the last paragraph of
page 15 says "If an image has failed, but no other error condition occurred,
the argument is assigned the value STAT_FAILED_IMAGE.". If this behaviour is
retained, the effect of failed images on the result needs to be described.
The effect of SYNC ALL and SYNC TEAM in the presence of failed images should
be to synchronize the images that have not failed. This should be stated.
I am not sure about SYNC IMAGES.
The FORM SUBTEAM statement should work in the presence of failed images. I am
inclined to think that if a subteam has a nonfailed image, all its images
should be nonfailed.
The CHANGE TEAM statement should work in the presence of failed images.
The ALLOCATE and DEALLOCATE statements should work in the presence of
failed images.
For locks and events, at most one other image is involved and its failure
must be regarded as an error.
1.2 Cosubscripts of arrays declared in ancestor teams
New syntax (R624) was added during the Delft meeting to allow a coarray
to be addressed by the cosubscipts of a team other than the current team.
It is not restricted to be the current team or an ancestor, but I think
that was the intention. Because there is no means of specifying the
mapping between cosubscripts when teams change, the new syntax should be
restricted to refer to the team in which the coarray was declared.
Alternatively, a mechanism for specifying the mapping should be added.
I suggest that it should be as for the association of a dummy coarray with
the corresponding actual coarray. A possibility is the statement
new cosubscripts (<cosubscript-decl-list>)
where
<cosubscript-decl> is <coarray-name> []
<lbracket> <explicit-coshape-spec> <rbracket>
2. Other comments
2.1 FORM SUBTEAM
What happens if NEW_INDEX is absent? Is the mapping from parent image index
to child image index processor dependent? Or is THIS_IMAGE(DISTANCE=1)
monotic increasing?
2.2 REDUCE
For CO_MAX, CO_MIN, and CO_SUM, there is a corresponding transformational
function so it is easy to write code for the common case where the max, min,
or sum of all the elements of the arrays on all the images is wanted. We
need to add REDUCE to play the same role for CO_REDUCE.
2.3. Error condition for a collective
The last para. of 15, says that if an error condition occurs and STAT is
present, the effect is as if SYNC MEMORY were executed. This seems wrong
because RESULT has intent out so one expects it to become undefined.
Do we expect SOURCE to be used by the implementation for workspace
when RESULT is absent? If so, an error condition should cause SOURCE to
become undefined.
2.4 Examples
More examples are needed, particularly of continued execution in the presence
of failed images.
3. Edits
[15:22] After "the beginning" add "to the end".
More information about the J3
mailing list