(j3.2006) (SC22WG5.4934) [ukfortran] WG5 ballot on first draft TS 18508, Additional Parallel Features in Fortran

Keith Bierman khbkhb
Thu Mar 14 11:58:38 EDT 2013


On Thu, Mar 14, 2013 at 9:49 AM, Tobias Burnus <burnus at net-b.de> wrote:

> N.M. Maclaren wrote:
> ...
>

MPI is arguably the poster child for too low a level parallel processing
tool.



> Am actually wondering whether it helps to look at how MPI and MPI
> implementations do it.
>
>
> PS: While my impression is that most programs currently do not attempt
> to do error recovery, there seems to be quite some demand to support
> failure recovery. I think the line of thought is that with hundreds of
> thousands of processors,


Yes, this is quite a common problem. People routinely use a
checkpoint/restart facility to hack around it, or other vendor and/or
application specific hacks. I'm certain that's why the Committee is
attempting to address this.



> there will be several hardware defects per day.
> Still, I don't really know whether that's more a hype or some real
> trend.


Error rates remain relatively constant .. as circuits get smaller, the
amount of real estate expended to maintain the effective error rates tends
to go up. Is that what you mean by a trend?...

>
> 'Note that it is provably impossible to reliably detect all kinds of faults
> *Node ?down? may be node ?really, really slow"
>

If a node is performing exceptionally badly, it's often a sign of hardware
failure (viz. retries, alternate paths, etc.). Obviously, there are
exceptions (like in the case of hardware which punts on underflow, and
resorts to a software implementation).

> .
> ....
> *Note poor analysis of hardware / software faults in some studies ?
> persistent faults can sometimes be identified, but transient faults
> (hardware upsets, timing/race issues in software) means source of many
> faults unknown
> *Challenge: What are the likely faults and how will system software
> respond to them? How should a programming model interact with the
> system? How much should the programmer participate in managing different
> kinds of faults?'
> _
>
> The most common faults (that are hw and not a software bug) are transient,
and are related to multiple alpha particle hits. Just abandoning the work
done, reforming a team (where the OS does whatever sanity checks it deems
fit, and a power cycle of the node), seems likely to be helpful in many
cases.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.j3-fortran.org/pipermail/j3/attachments/20130314/2880f848/attachment.html 



More information about the J3 mailing list