(j3.2006) (SC22WG5.4933) WG5 ballot on first draft TS 18508, Additional Parallel Features in Fortran

Bill Long longb
Thu Mar 14 09:06:50 EDT 2013



On 3/13/13 2:43 PM, N.M. Maclaren wrote:
> Image Failure
> -------------
>
>      7.1) This is not a minor addition.  No language has ever managed to
> standardise recovery of an application from general system-generated
> errors or infrastructure failure, and even POSIX does not attempt it.
> There are fundamental reasons why this should not be attempted in a
> portable language.


I agree that this is not minor. We would not have included the 
capability if the issue were not so urgent and important. I fear that 
the explanation of the feature was not clear enough.

What is proposed is very similar to the way we treat I/O errors.  There 
is a mechanism for notification of a problem (STAT=,  like I/O) and a 
way to identify where the error occurred (failed images index values; 
the I/O unit number is already available to the users).  Unlike I/O 
where we have singled out some failure modes (end-of-file, for example), 
we did not specify particular modes of failure for images. In current 
experience, it is almost always a non-recoverable memory error, but I 
think we should wait for more data before being more specific.   The 
current spec is intentionally minimal.

The recovery aspect is almost secondary.  In the case of I/O, there is 
no general recovery option either.  The user may decide that the file 
that triggered an error was not that important, and the program can 
continue. Or they might intentionally read past the end of the file as a 
mechanism for stopping a loop, and have "recovery" implicit in the code. 
  Or in the case of a write failure, perhaps writing to a different file 
is an option.  In the worst case, writing out some check-point data and 
aborting is still better than an immediate abort.   For the failed image 
case, the main option is to reform the current team omitting the failed 
images and continuing after changing to the new team.  The facility is 
included as part of the TS because this is the first time we had the 
capability of changing the execution team.   It is likely that in many 
cases the choice will be the same as for I/O failure - to write out 
check-point date and abort. This is still a much better option than to 
be killed without any option to do something.

I strongly disagree with a proposal to have this be a "vendor 
extension".   That is the enemy of portable code.   For implementations 
that do not support any detection, we included the usual escape that the 
set of conditions that constitute an image failure is processor 
dependent. The set could be empty for some vendors.  But for the rest, 
for whom this is an important issue, having a standard syntax is the 
only way to prompt code portability.

Cheers,
Bill




-- 
Bill Long                                           longb at cray.com
Fortran Technical Support    &                 voice: 651-605-9024
Bioinformatics Software Development            fax:   651-605-9142
Cray Inc./Cray Plaza, Suite 210/380 Jackson St./St. Paul, MN 55101





More information about the J3 mailing list