[J3] (SC22WG5.6128) [EXTERNAL] Re: Two things from IFIP WG 2.5 meeting
Van Snyder
van.snyder at jpl.nasa.gov
Mon Jul 29 16:43:20 EDT 2019
On Mon, 2019-07-29 at 13:29 +0000, Bill Long wrote:
> > On Jul 27, 2019, at 10:20 AM, Van Snyder via J3
> <j3 at mailman.j3-fortran.org> wrote:
> >
> > Specification that a DOT_PRODUCT produce a correctly-rounded result
> does
> > not depend upon the kind of the arguments. If a processor has a
> super
> > accumulator that only works for binary64 (or binary32), it could use
> it
> > for those precisions, and use a software method otherwise. The
> > processor could detect at run time whether the CPU (or a
> coprocessor)
> > provides a super accumulator, and use the appropriate method.
> >
>
> The focus on dot product seems odd for a couple of reasons. Most
> processors use IEEE floating point representations where the exponent
> range for 64-bit reals (used whenever accuracy is important) is ~
> 10**300. The number of particles in the universe is ~10*80, so
> overflow seem very unlikely for any reasonably formed problem.
> Similarly for underflow, as tiny values are ~10**-300. The numerical
> problem I see as potential is for a very long vector sum at the end
> where the values are significantly different in size or have lots of
> cancellations because of alternating signs. There are a couple of
> techniques that can help using existing hardware. First, vectorize
> the operation which results in parallel accumulation of partial sums
> that are combined at the end. (Note that this only happens at
> non-zero optimization levels). Vectorization alone is sufficient for
> the vast majority of users. For widely varying sized terms in the
> summation, resulting in truncation errors in the additions, you could
> sort the list and sum staring at the small end.
There are more things in heaven and earth, Horatio, than are dreamt of
in your philosophy.
As the paper remarks, dot product is ubiquitous.
The issue isn't overflow. It's getting correct bits in the result.
Somehow, I trust experts who are specialists in this area (Kulisch,
Rump, Ogita, Oishi, Dekker, Neumaier, Knuth, Kahan...) who say that such
simplistic solutions aren't adequately reliable. Sorting is O(n log n),
so for vectors longer than eight elements or so, Rump's algorithm (based
on compensated summation) would be faster. Sorting requires branches.
Bin sorting on the exponent is faster but requires work space. Summing
the bins might be slower than algorithms based upon compensated
summation. If there are no sign changes, accumulating small-to-large
works best. If there are sign changes, accumulating large-to-small
works best. But none of those simplistic approaches are reliable.
The paper cited in the proposal considers Bill's simplistic methods.
It's in Tutorials/Ogita-Rump-Oishi.pdf. Nick Higham devoted an entire
chapter to this in "Accuracy and Stability of Numerical Algorithms."
Dekker's algorithm has branching. Neumaier uses Dekker's algorithm.
Knuth's doesn't branch, but requires six flops per add. Even so it's
typically 50% faster than Dekker's. Extended precision (e.g. XBLAS) can
be silently inadequate (and Rump's algorithm is 40% faster).
Ogita and Oishi were at Waseda University in Tokyo in 2005. Maybe we
could invite them to give a presentation, if they're still nearby.
> The relevant question for WG5 is whether this issue is in the scope of
> the Fortran standard. For the small number of cases where these
> round-off issues matter to a program, the user might want to either
> write separate code, or use one of the professionally written
> libraries (NAG, IMSL, …) for the dot_product computation. Fortran
> specifies the result in mathematical terms. I think it is unwise to
> be specifying implementation details.
The description of NORM2 includes a recommendation "without undue
overflow or underflow." That's a tiny bit more emphatic than "processor
dependent approximation."
We've done many things for small numbers of users. HYPOT, for example,
which is entirely redundant to CABS and NORM2.
The problem with implementing reliable algorithms is that optimizers are
eager to subvert them, and subverting optimizers is difficult.
Does NAG (or IMSL) provide a correctly-rounded dot product procedure?
Maybe they believe XBLAS are sufficient?
If one has a super accumulator in the (co)processor, would a NAG (or
IMSL) dot product function detect it and exploit it?
The proposal EXPLICITLY DOES NOT SPECIFY IMPLEMENTATION DETAILS! DID
YOU READ IT OR JUST JERK YOUR KNEE?
> > I recall descriptions of neural network training failing to converge
> in
> > even 100,000 iterations without correct linear algebra. Iterative
> > refinement was tried, but made only a small dent, because of poor
> > conditioning of the dot product. It finally worked when an accurate
> dot
> > product was used. IIRC, this was described in a paper by Victor
> > Pereyra, in connection with variable projection and separable
> nonlinear
> > least-squares problems. I've asked correspondents for more
> examples.
>
> The linear algebra needs in neural network training codes are beyond
> the skill set of most of the Python programmers who dominate the
> field. They typically just (ultimately) call the library routines
> supplied by the GPU vendors appropriate for the target hardware. If
> those routines are insufficient for some reason, comments to that
> effect should be directed to the vendors.
Neural network training is only one example, not the entire universe of
problems.
Bit it's always handy to throw in a red herring when you don't have much
else.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.j3-fortran.org/pipermail/j3/attachments/20190729/f9bbaeb7/attachment.html>
More information about the J3
mailing list