(j3.2006) (SC22WG5.5400) Preliminary result of the TS straw vote
John Reid
John.Reid
Fri Dec 12 04:53:30 EST 2014
Dear all,
> Here is the first draft result of the TS straw vote. Please check that
> your view and comments have been correctly included. Please let me know
> of any errors by 9 a.m. (UK time) on Friday 12 Dec.
>
> The conclusion is absent at the moment. I want to discuss this with the
> Editor, Bill Long.
>
Here is a revision of the draft. Adding Van's vote is the only change to
the results. With four no votes, the vote has failed - consensus has
not been reached.
I have discussed the way forward with Bill. We believe that it should be
possible to reach consensus with far less changes than have been made at
recent J3 meetings. Therefore, I propose that the co-array email group,
led by Bill, consider all the comments and prepare a fresh version ASAP,
hopefully by 31 December. I will then conduct a 14-day straw vote on the
new version in January. The fall-back position is that J3 work on the
document at the February meeting.
With best wishes,
John.
-------------- next part --------------
ISO/IEC JTC1/SC22/WG5 N2038-2
Result of the WG5 straw ballot on N2033
John Reid
N2035 asked this question
Please answer the following question "Is N2033 ready for forwarding to
SC22 as the DTS?" in one of these ways.
1) Yes.
2) Yes, but I recommend the following changes.
3) No, for the following reasons.
4) Abstain.
The numbers of answers in each category were:
1 for 1) Yes (Chen)
5 for 2) Yes, but I recommend the following changes
(Bader, Long, Nagle, Reid, Whitlock)
4 for 3) No, for the following reasons
(Cohen, Corbett, Snyder, Maclaren)
2 for 4) Abstain (Moene, Muxworthy)
The straw ballot has failed - consensus has not been reached.
However, I believe that we may be able to reach consensus with
far less changes than have been made at recent J3 meetings. Therefore,
I request that the co-array email group, led by the Editor Bill Long,
consider all the comments and prepare a fresh version by 31 December.
I will then conduct a 14-day straw vote on the new version in January.
Here are the comments and reasons. I have included the comments of
Tobias Burnus, who did not vote.
Reinhold Bader
2) Yes, but I recommend the following changes.
(A) Section 5.9
Now that the TS has the concept of stalled images, I think that image
control statements without a STAT= specification that involve a
failed image could now relatively easily be made to result in the
executing image becoming stalled, instead of terminating the program.
This would make development of fail-safe packages much easier,
because the fail-safety can be designed in a top-down manner i.e.
library code that synchronizes, allocates or deallocates must not
necessarily be modified.
Suggested edits to N2033:
[14:19-] Add a new paragraph
"If an <<image-selector>> identifies an image that has failed
and a corresponding team, the executing image becomes a stalled
image."
[[textually separate identification from consequences. This is
not only needed for image control statements, but also atomic
and collective invocations without a STAT.]]
[14:19] Replace "If an ... stalled image" by
"If an image has stalled with respect to a team other than
the initial team, it remains stalled"
[14:24] Replace "If an ... stalled image" by
"If an image has stalled with respect to the initial team,
it remains stalled"
[36:36+] Replace "or an image ... initiated." by
"error termination is initiated. Otherwise, if an image involved
in the execution of the statement has stalled or failed, the
executing image becomes stalled. The stalled image's team
is the current team if an END TEAM, FORM TEAM, SYNC ALL,
SYNC MEMORY, or SYNC IMAGES statement is executed; it is the team
specified by the value of the <<team-variable>> in the execution
of a CHANGE TEAM or SYNC TEAM statement, and it is the team
identified by the <<image-selector>> in execution of a LOCK,
UNLOCK or EVENT POST statement."
[45:14-17] Replace para by
"If the implementation is capable of managing stalled images,
this example will continue execution in the face of failing
images even if synchronization statements, collective
or atomic subroutine invocations, or coarray allocations and
deallocations inside the change team block do not specify a
STAT argument."
(B) Section 7.2
It is not specified what happens if no STAT argument is specified.
Suggested edit:
[17:32] Add new para
"If no STAT argument argument is present in an invocation of an
atomic subroutine and the coindexed argument is determined to
be located on a failed image, the executing image becomes stalled;
the team is that identified by the <<image-selector>>.
Otherwise, if an error condition occurs, error termination is
initiated."
(C) Section 7.3
Here, the case without a STAT argument needs modification.
[17:24-25] Replace para by
"If no STAT argument argument is present in an invocation of a
collective subroutine and a failed or stalled image is identified
in the current team, the executing image becomes stalled with
respect to that team, and the argument A becomes undefined;
otherwise, if an error condition occurs, error termination is
initiated."
(D) Collective intrinsics CO_BROADCAST (7.4.10) and
CO_REDUCE (7.4.13)
There still seems to be some missing text with respect to invoking
these intrinsics with objects of derived type that have POINTER
components.
Here are suggestions for edits:
[22:24] Before "A becomes defined", insert "Except for ultimate
POINTER components, ".
[22:25] After "SOURCE_IMAGE.", add " The association status
and value of any ultimate POINTER component of A is not changed."
[24:5] After "computed value" add ", except for ultimate POINTER
components of A, "
[24:6] After "team.", add " The association status and value of
any ultimate POINTER component of A is not changed."
[24:14] After "operation.", add " The implementation of OPERATOR
shall not perform an ALLOCATE statement on any ultimate POINTER
component of the function result."
(Tobias Burnus has requested that finalizers be executed for both
the A argument as well as the OPERATOR function result. The above
edits constitute an attempt to do without this, inasmuch as
we're talking about INTENT(INOUT) arguments, while finalizers
are normally only executed for INTENT(OUT).
If his suggestion is followed instead, it should be noted that the
finalizers must be PURE procedures, because the intrinsics are;
allowing the A argument of CO_BROADCAST to be polymorphic would
then also be precluded, because the PUREity of the actually
executed finalizer could not be determined).
(E) Section 7.5.3 (MOVE_ALLOC)
Edit for support of stalling if executed without STAT argument:
[30:4-5] Replace para by
"If no STAT argument argument is present in an invocation of MOVE_ALLOC,
and a failed or stalled image is identified in the current team, the
executing image becomes stalled with respect to that team; otherwise,
if an error condition occurs, error termination is initiated."
_______________________________________________________________________
Tobias Burnus
First, thanks for the work in the draft. One item I want to raise now
before I forget it or it is passed 8 December:
The DTS does not address finalization of CO_BROADCAST and CO_REDUCE for
derived types which have finalizers.
For CO_BROADCAST, simply adding a statement like the following should be
sufficient and implementation wise, it should be simple as one can
simply finalize it before the actual data transfer: In the description
of "A" append: "On all images of the current team but on the image
specified by SOURCE_IMAGE, A is finalized before it becomes defined."
For CO_REDUCE, the implementation will be more difficult; still, I
believe it makes sense to require finalization. Possible wording: "If
RESULT_IMAGE is not present, A is finalized and the computed value is
assigned to A on all images in the current team. If RESULT_IMAGE is
present, A is finalized and the computed value is assigned to A on image
RESULT_IMAGE and A on all other images in the current team is finalized
and becomes undefined."
This might need some refinement as also intermediate results ("tmp =
operator(a,b)") have to be finalized at some point ? assuming that "A"
is used for those ? and I am not sure whether that's already implied.
______________________________________________________________________
Malcolm Cohen
NO, for the following reasons.
I agree with Robert Corbett's vote.
I am somewhat taken aback that we've suddenly added this new concept
(stalled images) with far-reaching effects (and more proposed in other
comments) at the last minute.
It needs to be clear that it is possible to implement the "reliability"
(failed/stalled/whatever image) features efficiently on a variety of
architectures. It should not require incompatible changes to an existing
coarray implementation (which the current draft certainly seems to do). I
have no problem with some "bells and whistles" potentially requiring extra
work, but a reasonably effective subset needs to be workable without heroic
efforts, and without affecting programs that do not use the feature.
Additional minor comment:
Re finalization, I agree with Tobias Burnus' comments that it would be good
for this to be spelled out in detail for CO_BROADCAST and CO_REDUCE. For
the latter it should say that the result of applying the function is
finalized, including the final function application, (the latter is as if
the output variable were assigned an expression that is the last function
reference). It should, I think, also be stated that the finalizations of
the intermediate function results are done on the image that actually
invoked the function, so that any deallocations are handled by the image
that did the allocations.
______________________________________________________________________
Robert Corbett
My vote is "3) No, for the following reasons."
I voted "yes" or "abstain" on recent ballots regarding the draft TS
because the features specified in the drafts ranged from good to
tolerable and because I thought it would be good to have the TS
completed so that implementors could gain experience with the
features before they became part of an edition of the Fortran
standard. The addition of stalled images in their present form
is sufficiently objectionable that I am compelled to vote "No."
My primary objection is to the requirements given in the third
paragraph of Clause 5.9 [14:19-23]. I do not see how the specified
semantics can be implemented without compromising the performance of
codes that do not make use of the feature. I am not certain that
the semantics can be implemented at all in some common environments.
At a secondary level, I find the specification of stalled images to
be unclear. Some points follow.
Clause 3.7 [5:41]
What does it mean for an image to have "encountered" an
<image-selector>? I know we use the usual meaning of a word when
we do not specify its meaning, but that rule is inadequate for this
case. For example, if an image executes a statement that contains an
<image-selector>, but that <image-selector> is part of an operand that
is not evaluated, has the <image-selector> been "encountered?" My
guess is that it has not, but I cannot tell that from the draft TS.
Clause 5.9, paragraph 3 [14:20-23]
When does a stalled image transfer control to the END TEAM statement?
Can it happen immediately or must it wait until all other images that
are part of the same team have completed, failed, or stalled?
Clause 5.9, paragraph 3 [14:20-23]
Are the deallocations and finalizations subject to any requirements
w.r.t. the order in which they are performed? For example, during
normal execution, allocatable objects that are part of an instance of
an internal procedure will be deallocated before the allocatable
objects that are part of the related instance of the host procedure.
Is there any requirement that that ordering be respected by the
stalled image?
_____________________________________________________________________
Bill Long
Yes, but I recommend the following changes.
N2033: [14:23] Delete ?,without synchronization of coarray deallocations?.
Tom Clune, and others since, have noted that this phrase increases the
uncertainty of how the recovery of a stalled image is expected to be
implemented. Additionally, it conflicts with a basic tenant of coarrays
that the existence of a coarray should be consistent across the images
where the coarray was allocated If a stalled image prematurely
deallocates a coarray, accesses from an active image might produce
nonsense results, or even fail. This would be an undesirable exception
to our normal rules.
-------------------------
Additional general comments:
Nick explained the rationale behind the stalled image classification.
I would just add one background note. Most of the modes of inter-image
activity involve statements (image control statements or calls to
intrinsics) that have an optional STAT= specifier or STAT argument.
In those cases, an abnormal state can be detected by a programmer and
explicitly acted upon with statements in the program. If the program
fails to use these facilities (no STAT= specified, or omits the optional
STAT argument) and an error condition occurs, the program aborts, as has
long been the case. The one exception to this model is a simple
reference or definition of a variable on a remote image using the
image-selector syntax. There is no ?STAT? method available there, nor
would it make much sense, since the designator that includes the image
selector could be in many places of a complicated expression or statement.
The stalled image facility addresses this case, plugging an otherwise
serious hole.
There is substantial opinion that implementing stalled image recovery
is not easy. I do not disagree. In simplest terms, it is equivalent to
implementing the infrastructure to handle an exception handling mechanism.
It is a bit simpler - the handler is basically internal to the runtime
rather than user-specified, and if the relevant END TEAM statement lacks
a STAT= specifier, the code would end up aborting anyway, so there is no
need to do much before then. However, the basic process of unwinding the
call stack (if there is one) that grew after the CHANGE TEAM statement
execution is more or less the same as for an exception handler. Given
that exception handlers already exist in other languages, and certainly
at the system level, the argument that implementors do not know how to
do this seems weak at best. I understand grumbling about hard work, not
claims of inability.
The more general question of whether Fortran should include fault
tolerance on a timely schedule at all is really a question Fortran?s
future relevance in the HPC market place. And that is the only market
where Fortran has a significant fraction of programming language mindshare.
The need for this capability is in the 2018-2020 ?exascale? time frame.
If we miss that window, we?re seriously disadvantaged. The Fortran 2015
standard (with compilers available ~2018) is our last opportunity to meet
the schedule. Alternatives like MPI and SHMEM are actively making progress
in this area, realizing the same target dates are looming.
The idea that vendors need to implement a facility like fault tolerance
before including it in the standard is out of touch with the realities of
modern-day compiler development. It might have been viable in the past,
but today?s compiler vendors will implement a feature AFTER is it in the
standard, not before. Not only is this an economic reality, but also a
positive for program portability. In many cases from the past where
vendors implement new facilities outside the standard, the features end
up being ?extensions? that don?t go away but perpetually lead to
non-portable code for programmers who use them. On platforms with
multiple Fortran compilers, this is a recurring frustration.
Finally, Tobias raised, and Malcolm elaborated and provided details on
the issue of finalization in the context of CO_BROADCAST and (especially)
CO_REDUCE. This issue is a side effect of the introduction of intrinsic
subroutines that allow INTENT(INOUT) arguments of types that specify
finalization. This case was not envisioned (or relevant) when the
current "4.5.6.3 When finalization occurs? was written. Modification to
the TS to account for this would be in Clause 8. I see this as
essentially an integration issue. While this is important, the TS
process also does allow for subsequent modifications during integration,
so I don?t see this as an issue that should block the TS from progressing
to a vote.
_______________________________________________________________________
Nick Maclaren
NO, for the following reasons.
Reason 1
--------
I agree with Robert Corbett and Malcolm Cohen about stalled images, but
believe that they have understated the issue. The requirement is to
handle the 'knock-on' effect of image A failing, image B getting stuck
as a consequence, and image C then needing to interact with image C. I
agree with the authors that the concept is essential if support is to be
provided for failed images, and that is one of the reasons that I have
consistently voted against the whole feature or abstained.
I have implemented error recovery in run-time systems, have used and
worked on it in several contexts, and know that I am not smart enough to
specify it for a language like Fortran. Of the thousand or so language
and environment specifications I have seen, I have never seen one
specify this successfully, even for a single environment. It might be
possible in Haskell, but Fortran is not Haskell. From the lack of
convergence of these documents and the comments on the mailing list,
this TS seems to be failing in the ways that so many others have failed
before it. It is doubtful that adding this facility takes "full account
of the state of the art" (see the ISO Directives).
I believe that there is no chance whatsoever that this issue can be
resolved, and WG5 still keep to the schedule agreed in Las Vegas (see
N2020 and N2024). Indeed, I doubt that it could be done with even a
year's delay. Solving this problem is not within the state of the art,
despite considerable efforts in a great many contexts over the past
half-century.
I believe that the whole feature of support for failing and stalled
images should be removed, possibly specified in another TS, and not
integrated until there is significant implementation and user experience
in a fairly wide variety of environments.
Reason 2
--------
Many or most of the comments in N2013 on events have still not been
addressed, nor have some of ones on atomics and collectives. In
particular, there are assumptions of cross-facility coherence and
progress but no normative text requiring them - indeed, quite the
opposite. It is doubtful that the current TS is "consistent, clear and
accurate" (see the ISO Directives). This is extremely serious, as
adopting an inconsistent set of assumptions will make it almost
impossible to deliver the target specified in 1. Scope, paragraph 2,
even ignoring the problem of the schedule.
I do not believe that this issue is as intractable as the previous one,
because specifying data and control flow and progress are within "the
state of the art". However, I am doubtful that the facilities in TS can
be implemented efficiently without special hardware or operating system
support, while still delivering the consistency and progress that seem
to be assumed. However, even if there are no consistency problems to be
resolved, I do not believe that accepting these aspects of this TS is
compatible with keeping to the agreed schedule.
I believe that this area needs further clarity, even if not a polished
specification, before the TS should be accepted. I am not repeating the
relevant comments in N2013, because there is little point - there has
been little relevant change to the drafts.
___________________________________________________________________
Dan Nagle
Yes, but I recommend the following change.
[27:14-15] change "a nonzero value" to "a positive value"
Error values are positive.
__________________________________________________
John Reid
Yes, but I recommend the following changes.
[10:19] Delete "or be the value of a team variable for the initial
team".
Reason. Execution of FORM TEAM is always required.
[10:38], [12:21], [29:34], [34:8], [34:13]. At the end of the sentence
add "since execution last began in this team" (wavy underlined on page
34).
Reason. We need to allow for teams changing during the execution of the
program. At the October meeting, these words were added at [13:5],
[35:26], [35:37], and [36:1].
[13:1] Change "the team" to "team".
Reason. Definite article is wrong here.
[13:5] Remove space before period.
[14:9] Change "detect that an image has stalled" to "manage a stalled
image".
[14:20] After "becomes a stalled image" add ". If the processor does
not have the ability to manage a stalled image, the executing image
becomes a stalled image for the rest of the execution of the program.
If the processor has the ability to manage a stalled image, the
executing image becomes a stalled image"
Reason. I think the intention is to allow implementations not to
support stalled images transferring control to the END TEAM statement.
Stalling will still happen and will need to be permanent.
_______________________________________________________________________
Van Snyder
No, for similar reasons to Robert Corbett, Malcolm Cohen, and Nick
MacLaren. I am concerned that we have added incompletely
thought-through things at the last minute.
_______________________________________________________________________
Stan Whitlock
Yes, but I recommend the changes in Bill Long and John Reid?s ballots.
More information about the J3
mailing list