[J3] FORM TEAM statement NEW_INDEX= specifier & failed images

Bill Long longb at cray.com
Wed May 8 18:27:18 EDT 2019


I’m not convinced that Jon’s example is unreasonable.  We currently allow the value of an image index to be larger than the number of active images in a team.  For example, if you start with 10 images, they are initially numbered 1,2,3,4,5,6,7,8,9,10.  If image 8 fails, then the remaining images have image index numbers of 1,2,3,4,5,6,7,9,10, even though there are only 9 active images.  Note that num_images() still returns 10, so the user is not likely to be surprised if there is an image index value larger than the number of active images. 

In most cases, I’m guessing that users will not specify image index values using the NEW_INDEX= specifier in FORM TEAM. Rather they would just take whatever the processor gives them.  However, there are reasonable cases where using NEW_INDEX= might be desired.  For example, you start with an initial team where each image has data relevant to a particular area of the earth (climate or weather model, for example).  You want to split the images into LAND and OCEAN teams.  But it is useful to know that specific images represent the beach areas so that boundary conditions can be exchanged across teams.  Suppose you start with 20 images and you want 1..10 to go to the ocean (where they are numbered 1..10) and 11-20 to go to the land (where they are numbered 1..10).   By using NEW_IMAGE=  you can be sure that land image 1 and ocean image 10 are the ones communicating.  You want these numbers to be correct, even if image 8 in the parent team has failed.   Otherwise, we are basically requiring that the user go through an exercise to track down which images are failed before executing FORM TEAM and then communicating these boundary image index numbers at runtime. 

Cheers,
Bill


> On May 8, 2019, at 3:17 AM, Malcolm Cohen via J3 <j3 at mailman.j3-fortran.org> wrote:
> 
> Actually it’s even simpler than that. If only nine images execute the FORM TEAM statement, whether that’s because there were originally 10 and one failed, or whether there were only nine images to start with, and one of them specifies NEW_INDEX=10, anything can happen.  You say it “will be assigned” the image index of 10, but since it’s already left the standard it could just as well set fire to your underpants.
>  
> Non-standard = non-standard.  The rules in 11.6.9p4 are violated, end of “the standard” story.
>  
> A “checking” implementation might detect and report this as an error, maintaining the rest of the Fortran semantics, but technically the wheels have already fallen off.
>  
> Similarly, the assertion about what is assigned to STAT= if that rule violation is not detected is unwarranted.  It might assign STAT_FAILED_IMAGE to it, or it might buy you a new cup of coffee.  One might *hope* that the implementation does something “sensible”, but ( a ) that is just a pious hope; ( b ) reasonable people may differ as to what is “sensible”.  Personally I might lean towards “a new cup of coffee” being more sensible for this particular rule violation (or some stronger drink, seeing as how image indices outside the range 1-num_images could well lead to deadlocks, seg faults, or just plain incomprehensible behaviour).
>  
> Cheers,
> -- 
> ..............Malcolm Cohen, NAG Oxford/Tokyo.
>  
> From: J3 <j3-bounces at mailman.j3-fortran.org> On Behalf Of Steidel, Jon L via J3
> Sent: Wednesday, May 8, 2019 1:01 AM
> To: General J3 interest list <j3 at mailman.j3-fortran.org>
> Cc: Steidel, Jon L <jon.l.steidel at intel.com>; Malcolm Cohen <malcolm at nag-j.co.jp>
> Subject: Re: [J3] FORM TEAM statement NEW_INDEX= specifier & failed images
>  
> Malcom said:
>  
> This cannot happen.  A failed image does not participate in program execution so cannot specify a team number.
>  
> Say we have 10 images, and NEW_IMAGE=THIS_IMAGE(), and we are creating a single team of 10 images (just for simplicity sake).  On the new team, each image will have the same image index as on the parent team.  Prior to executing the FORM TEAM statement, image 8 fails.  Nine images then execute the FORM TEAM statement.  Image number 9 and 10 will be assigned the image indices 9 and 10, respectively.
>  
> If the implementation checks for out of range image indices in a FORM TEAM statement, image 10 may detect that its NEW_INDEX is out of range, and if STAT= is not specified, image termination may  be initiated (depending if the implementation views this as a fatal error).  If STAT= is specified, by the rules of 11.6.11 p5, stat-var would be assigned a positive value different from STAT_STOPPED_IMAGE and STAT_FAILED_IMAGE.  
>  
> If the implementation does not detect out of range image indices, and if STAT= is specified and no other error condition is detected, stat-var would then be assigned the value STAT_FAILED_IMAGE.
>  
> -jon
> From: J3 [mailto:j3-bounces at mailman.j3-fortran.org] On Behalf Of Malcolm Cohen via J3
> Sent: Monday, May 6, 2019 8:55 PM
> To: 'General J3 interest list' <j3 at mailman.j3-fortran.org>
> Cc: Malcolm Cohen <malcolm at nag-j.co.jp>
> Subject: Re: [J3] FORM TEAM statement NEW_INDEX= specifier & failed images
>  
> <<< 
> What happens in the case where an image specifies both NEW_INDEX= and
> STAT= in a FORM TEAM statement, and the image index specified for
> NEW_INDEX= turns out to be greater than the number of images in the
> new team due to image failure during the execution of FORM TEAM?
> >>> 
>  
> This cannot happen.  A failed image does not participate in program execution so cannot specify a team number.
>  
> FORM TEAM is not a multi-step operation with exposed internal failure modes, so either the effect is as if the image failed “immediately before FORM TEAM” (in which case STAT_FAILED_IMAGE is returned, and the NEW_INDEX= values must match the number of active images in the new team), or the effect is as if the image failed “immediately after FORM TEAM”, in which case there is no error (STAT==0), it’s just that some of the images in the new team failed very quickly afterwards.
>  
> This is not terribly explicit, but I don’t see any other way to interpret the text that’s actually there.  For there to be weird effects with exposed internal failure modes *we’d have to have allowed that* and we did not.
>  
> Cheers,
> -- 
> ..............Malcolm Cohen, NAG Oxford/Tokyo.
>  
>  
> 
> Disclaimer
> The Numerical Algorithms Group Ltd is a company registered in England and Wales with company number 1249803. The registered office is: Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, United Kingdom. Please see our Privacy Notice for information on how we process personal data and for details of how to stop or limit communications from us.
> 
> This e-mail has been scanned for all viruses and malware, and may have been automatically archived by Mimecast Ltd, an innovator in Software as a Service (SaaS) for business.

Bill Long                                                                       longb at cray.com
Principal Engineer, Fortran Technical Support &   voice:  651-605-9024
Bioinformatics Software Development                      fax:  651-605-9143
Cray Inc./ 2131 Lindau Lane/  Suite 1000/  Bloomington, MN  55425




More information about the J3 mailing list