(j3.2006) (SC22WG5.5072) Ballot on N1983

John Reid John.Reid
Sun Aug 11 14:01:23 EDT 2013


-------------- next part --------------
Please answer the following question "Is N1983 ready for forwarding to 
SC22 as the DTS?" in one of these ways. 

3) No, for the following reasons.

1. Reasons for no vote

1.1 Continued execution in the presence of failed images

It is not clear how execution is intended to continue in the presence of 
failed images. 

For most calculations, the failure of one image leads to the failure of the 
whole calculation. To recover from this, the program probably needs to revert 
to a previous "check point" and continue the execution from there using
images that have not failed. One possibility is to not to use all the
images for the calculation but keep a few "spares". Execution is within a
CHANGE TEAM construct, with spares in a separate team and idle. When an image 
fails, the CHANGE TEAM construct is left, a new team is formed by
substituting a spare for the failed image, the check point data are recovered, 
and the CHANGE TEAM construct is re-entered. This avoids any need in the main 
code for remapping of data - it only has to detect failed images and
exit the construct if there are any. 

Some calculations are "massively parallel". Most of the work is done 
completely independently on separate images. Perhaps one image acts as 
"master" handing out tasks and collecting results. As long as the master does 
not fail, the calculation can continue happily with failed images. The master 
sends the work that it gave to a failed image to the next image that is free. 

I will assume that we want to cater for both situations. Even in the first 
case, the parent team needs to execute in a team that has failed images while 
it forms the new team and recovers the check-point data. 

The collective procedures are not massively parallel. They should surely fail 
if any of the images of the team have failed. However, the last paragraph of 
page 15 says "If an image has failed, but no other error condition occurred, 
the argument is assigned the value STAT_FAILED_IMAGE.". If this behaviour is 
retained, the effect of failed images on the result needs to be described.

The effect of SYNC ALL and SYNC TEAM in the presence of failed images should 
be to synchronize the images that have not failed. This should be stated. 
I am not sure about SYNC IMAGES. 

The FORM SUBTEAM statement should work in the presence of failed images. I am
inclined to think that if a subteam has a nonfailed image, all its images 
should be nonfailed.  

The CHANGE TEAM statement should work in the presence of failed images.

The ALLOCATE and DEALLOCATE statements should work in the presence of 
failed images. 

For locks and events, at most one other image is involved and its failure
must be regarded as an error. 

1.2 Cosubscripts of arrays declared in ancestor teams

New syntax (R624) was added during the Delft meeting to allow a coarray
to be addressed by the cosubscipts of a team other than the current team. 
It is not restricted to be the current team or an ancestor, but I think
that was the intention. Because there is no means of specifying the 
mapping between cosubscripts when teams change, the new syntax should be  
restricted to refer to the team in which the coarray was declared.
Alternatively, a mechanism for specifying the mapping should be added.
I suggest that it should be as for the association of a dummy coarray with
the corresponding actual coarray. A possibility is the statement
   new cosubscripts (<cosubscript-decl-list>) 
where
   <cosubscript-decl> is <coarray-name>    []
                         <lbracket> <explicit-coshape-spec> <rbracket>     


2. Other comments

2.1 FORM SUBTEAM
What happens if NEW_INDEX is absent? Is the mapping from parent image index 
to child image index processor dependent? Or is THIS_IMAGE(DISTANCE=1) 
monotic increasing? 

2.2 REDUCE
For CO_MAX, CO_MIN, and CO_SUM, there is a corresponding transformational
function so it is easy to write code for the common case where the max, min, 
or sum of all the elements of the arrays on all the images is wanted. We
need to add REDUCE to play the same role for CO_REDUCE. 

2.3. Error condition for a collective
The last para. of 15, says that if an error condition occurs and STAT is 
present, the effect is as if SYNC MEMORY were executed. This seems wrong 
because RESULT has intent out so one expects it to become undefined. 
Do we expect SOURCE to be used by the implementation for workspace
when RESULT is absent? If so, an error condition should cause SOURCE to 
become undefined. 

2.4 Examples
More examples are needed, particularly of continued execution in the presence 
of failed images. 

3. Edits
[15:22] After "the beginning" add "to the end".











More information about the J3 mailing list