(j3.2006) [ukfortran] (SC22WG5.5454) Response to TS ballot

N.M. Maclaren nmm1
Sun Feb 22 15:50:25 EST 2015


On Feb 21 2015, Bill Long wrote:
>
> Sorry, Nick. I was mistaken in assuming that the "infeasible" task was 
> the compiler and runtime work. I'm glad that we have a demonstration of 
> success in that area.

Well, no, we don't.  What we have is evidence that it is possible to
provide error recovery with callback to user code on mainframe systems,
at least for single image (but multi-tasking) systems.  I was doing it
for a single, non-optimising compiler for one system, and the only
other projects that I know of (IBM CEL and DEC VMS) were for single
systems, and may never have been completed.  I have heard rumours that
it has been done in other single language, single system cases, but
investigation has never shown up any evidence.

What I can also say, from subsequent experience, is that even the
simpler task we were tackling would have been completely impossible
on ANY modern mainstream system, let alone in a portable fashion,
because of the lack of adequate system interfaces and tendency of the
system calls to misbehave or hang, unpredictably, on encountering any
hardware or underlying system failure.

In particular, I can witness that the current mainstream networking
interfaces have that defect, badly, including name lookup and TCP/IP.
That is one of the reasons that MPI says that any continuation after
it has detected an error is undefined.

> Did you provide documentation with your project? (I'm guessing so.) Based 
> on that, any suggestions for our writing task would be most appreciated.

No.  I was planning to, but one of the things that I learnt was that
it was too hard.  In particular, any objects active when failure occurs
(including those being passed as arguments but otherwise unused) can
end up in a worse-than-undefined state.  I can witness that IBM CEL
was not planning to specify the state of objects, because I was a
consultant on that project.

I have a document that I wrote for WG14 proposing something for C,
based on my implementation experience, but it would not be a great help
to WG5.  The main reason is that C's object model is vastly simpler
than Fortran's, but it also needed the introduction of an exception
recovery point (which, in Fortran terms, would be a SYNC ALL across
all images, teams notwithstanding).  The last is one of the reasons
that I voted against being able to access images not in the current
team, or in an active subteam.

> In terms of "specifying the circumstances when recovery is possible", I 
> see two aspects. One is whether there are syntax and semantics specified 
> that allow the program to resume execution at a place that is not part of 
> the normal execution sequence. Second is whether the algorithm is such 
> that it is possible to restart. Our current position is that the second 
> is entirely the programmer's responsibility. We only supply help with the 
> first.

Yes.  And it is purely the first I am talking about.

> The task of "specifying the state of the program following such recovery" 
> is more of a challenge. It depends on how the programmer is trying to 
> restart. If, for example, the program includes checkpoints, the desired 
> action after the END TEAM statement would be to branch to code that 
> restored state from the most recent checkpoint and resume execution just 
> after that checkpoint. The method we're supplying additional help for is 
> to branch back to the beginning of the current CHANGE TEAM construct and 
> re-execute it. There are a lot of potential states for objects in the 
> program. Paper 15-138 is an attempt at describing what happens. 
> Suggestions for improvement are welcome.

Don't start from here.  I don't believe that it can be done, and all
evidence from the past half century is that it is too hard for the best
experts, and possibly flatly impossible.


Regards,
Nick.




More information about the J3 mailing list