[J3] FORM TEAM statement NEW_INDEX= specifier & failed images

Bill Long longb at cray.com
Thu May 9 10:12:57 EDT 2019


> On May 8, 2019, at 8:07 PM, Malcolm Cohen via J3 <j3 at mailman.j3-fortran.org> wrote:
> 
> <<< 
> I’m not convinced that Jon’s example is unreasonable.
> >>> 
>  
> It’s definitively Not Standard Conforming!

That, I think, we can agree.  The way the standard is written, the only way for an image to join a new team is to execute FORM TEAM.  NOTE: Exactly the same issue under discussion applies to STOPPED images as well as FAILED ones.  An image that is stopped will not execute FORM TEAM either.  Stopped images are a bit different in that data on stopped images can still be accessed. Such accesses are still possible through the image index of the stopped image in the parent team. 


>  
> <<< 
> We currently allow the value of an image index to be larger than the number of active images in a team.
> >>> 
>  
> We currently allow the user to drink hot coffee too, which is about as relevant.
>  
> Seriously, he suggested the image index being larger than NUM_IMAGES().  

Only if the failed (and stopped) images are not part of the new teams, which they are not by the argument above. NUM_IMAGES() includes failed and stopped images in a team.  If we want to require that the  sum of the num_images() values for the new teams is the same as num_images() for the parent [something that I think would be intuitive], then the processor could assign failed and stopped images  to new teams in a processor-dependent way that “fills in” gaps in the image index values if NEW_INDEX is specified, or adds them at the end of the image index range if there are no gaps (or if NEW_IMAGE is not specified). 

So the question is what we really want.  Either the failed and stopped images  of the parent are discarded from the new teams (the current wording), or whether they are included in the new teams.  Which option is more useful and less error-prone for the programmer?   I think the arguments / examples presented so far suggest that including the stopped and failed images is preferred. Others, of course, might have the opposite view.  But this is the question that should be discussed, not whether the example is conforming to the current standard (since we have a clear answer for that already). 

Cheers,
Bill




> That is a horse of a completely different colour.  I assert that this is rather likely to lead to disastrous outcomes, like accessing data of non-existent images (but image number less than num_images!), or syncing with such, etc.  It’s not that unlikely that the runtime system itself has tables sized by the number of images, which could lead to sync all not syncing with the images whose numbers are out of range, using random garbage instead of network addresses, or indeed seg faulting.
>  
> <<< 
> Otherwise, we are basically requiring that the user go through an exercise to track down which images are failed
> >>> 
>  
> If only we had an intrinsic that returned the failed image numbers…
>  
> <<< 
> before executing FORM TEAM and then communicating these boundary image index numbers at runtime
> >>> 
>  
> Indeed, the user had better do this if he cares about his data, as the failed images Have Already Lost His Data!  Carefully forming a boundary image setup without knowing what data is available is a complete waste of time. (He might not even *have* any boundary data left, or all the images that would have belonged in “team 3”, which might be a critical team for his algorithm to work, might have failed.)
>  
> If he doesn’t care about the data why bother running the program at all.
>  
> Or if your concern is how he is going to recover without “going non-standard”, we would need to add that capability. There are two obvious possibilities: (1) NEW_INDEX is simply inoperative when an image has failed; (2) invalid NEW_INDEX is allowed, at least when images have failed, but doesn’t form a team.  There are many details to be worked out, but we could certainly do something about it.  STAT_INVALID_NEW_INDEX would be one possibility, but something more lightweight could also work.
>  
> In principle I support “doing something”, as long as we can all agree on what the something is, as otherwise there is a race condition anyway in between checking for failed images and executing FORM TEAM with NEW_INDEX.
>  
> Cheers,
> -- 
> ..............Malcolm Cohen, NAG Oxford/Tokyo.

Bill Long                                                                       longb at cray.com
Principal Engineer, Fortran Technical Support &   voice:  651-605-9024
Bioinformatics Software Development                      fax:  651-605-9143
Cray Inc./ 2131 Lindau Lane/  Suite 1000/  Bloomington, MN  55425




More information about the J3 mailing list