(j3.2006) (SC22WG5.4077) [ukfortran] Some minor points on the Draft Final CD

N.M. Maclaren nmm1
Wed Sep 9 08:28:34 EDT 2009


On Sep 9 2009, Bill Long wrote:

Thanks very much for the corrections.

>This is covered in 6.6, including a Note explicitly discussing the 
>"ragged" case.  Users can (and some do) check the number of images at 
>runtime and stop if the number is not consistent with assumptions they 
>have made in the program.  However, we don't want to restrict all 
>programs in this way - it violates the idea of being able to write code 
>that is independent of the number of images.

I missed that NOTE - sorry!  We HAD discussed it then, and I had forgotten.
Actually, 6.6 doesn't cover it, as that is about image selectors, and
declarations don't use image selectors.  I will rephrase, suggesting
another NOTE in the relevant place, because this is exactly the sort of
thing where implementors are likely to misunderstand, and there is no
normative text to beat them over the head with.

>A restriction like this would prevent sync images from being used for 
>wave-front synchronization, one of it's main uses.  It would also 
>prevent the example in Note 8.37, which is also very useful.  The 
>functionality you are suggesting is covered by SYNC TEAM which, as part 
>of the grand compromise, was deferred to a TR.

Good points.  I will see if I can think of a serious ambiguity; if I can't,
I will drop this one, or perhaps suggest a NOTE.

>> 3) I think that the current wording overspecifies error termination in
>> paragraph 4 of 2.3.5 (Execution sequence).  Specifically, requiring one
>> image to be able to force with others into termination without them
>> executing any special statement is a heavy burden on implementors, and
>> may not always be feasible.  
>
>It does not seem any more of a burden that an implementation of CALL 
>ABORT() [or MPI_Abort()] which effectively does the same thing.

CALL ABORT? Is that a Cray-specific feature? If it refers to the C 
function, then I can assure you that it does NOT normally cause clean 
termination (despite the ambiguity in C, which can be read to imply that it 
does). As far as I know, my C run-time design was one of the few that did 
clean up properly after abort(), and it was HELL to implement!

I can assure you that MPI_Abort does NOT do the same thing!  It does what I
propose the wording to be!  Here is its specification in full:

        This routine makes a "best attempt" to abort all tasks in the
    group of comm. This function does not require that the invoking
    environment take any action with the error code.  However, a Unix or
    POSIX environment should handle this as a return errorcode from the
    main program.
        It may not be possible for an MPI implementation to abort only
    the processes represented by comm if this is a subset of the
    processes.  In this case, the MPI implementation should attempt to
    abort all the connected processes but should not abort any
    unconnected processes. If no processes were spawned, accepted or
    connected then this has the effect of aborting all the processes
    associated with MPI_COMM_WORLD.

The first paragraph in the specification of MPI_Finalize makes it clear
that my interpretation is the one MPI means:

        This routine cleans up all MPI state. Each process must call
    MPI_FINALIZE before it exits. Unless there has been a call to
    MPI_ABORT, each process must ensure that all pending non-blocking
    communications are (locally) complete before calling MPI FINALIZE.

>> On the other hand, when it isn't possible,
>> the standard can't say anything useful.  To put it in other words, is
>> "error termination" intended to mean that the program has failed, or
>> does it really mean "unilateral termination"?
>
>Forcing all the images to terminate is important to users who are 
>charged for their resource usage. It does mean "unilateral termination".

Indeed but, as someone who has managed 8+ different parallel systems by
5+ different vendors, it's hell to implement and generally extremely
unreliable.  I spent a LOT of time writing job termination scripts to
increase the frequency of this happening, because the vendors had failed
to do so.  And even then, I couldn't make it bulletproof.  I don't think
that Fortran should imply that it's reliable.

Regards,
Nick Maclaren.





More information about the J3 mailing list