[J3] (SC22WG5.6116) [EXTERNAL] Re: [ukfortran] Two things from IFIP WG 2.5 meeting
Van Snyder
van.snyder at jpl.nasa.gov
Fri Jul 26 13:20:19 EDT 2019
On Fri, 2019-07-26 at 13:24 +0100, N.M. Maclaren wrote:
> On Jul 26 2019, Van Snyder wrote:
>
> >Two things that ought to be of interest to the Fortran standard were
> >discussed at the recent meeting of IFIP Working Group 2.5 (Numerical
> >Software).
> >
> >1. The dot product is an ubiquitous operation in mathematically-related
> >problems, not just in library modules. Floating-point arithmetic is not
> >associative. The dot product can be arbitrarily badly conditioned. The
> >numerical software community would like to have either an exact dot
> >product or a correctly-rounded dot product. Using a super-accumulator
> >as originally described by Kulisch, the exact dot product can be
> >computed as fast as operands can be fetched. IBM implemented the exact
> >dot product in hardware for the 370, in an option called ACRITH.
> >Kulisch and his students implemented the exact dot product on a PCI
> >card. Demmel and his students implemented the exact dot product on a
> >PCI card. Both Kulisch and Demmel observed that the PCI card computed
> >an exact dot product six times faster than a double-precision floating
> >point method using a Pentium.
>
> In order to do that for IEEE 754 extended precision, you need a huge
> accumulation 'register' (c. 4,200 bytes, if I recall). That's not a
> problem in many cases (e.g. the hardware ones you mention) but becomes
> one for many others (e.g. some highly parallel implementations).
It's not 4,200 bytes; it's 4,288 bits for binary64 accumulation of 2^20
terms, and 640 bits for binary32. On a processor having a billion
transistors, is a 4,288-bit accumulator even visible without an electron
microscope?
For binary128, 49,536 bits would be required, but it was not requested
for IEEE 1788. The attached description that Kulisch and I wrote gave
the size for binary128. A 4,288-bit accumulator would work for
binary128, but only for vectors much shorter than 2^20 elements (and
would work for vectors that large if no intermediate fixed-point
overflow occurs). If anybody comes up with a problem in which
accumulation for binary128 vectors doesn't work by rounding the inputs
to binary64 as they're fetched, we can discuss using a super accumulator
(as opposed to a correctly-rounding algorithm) at that juncture.
As mentioned in the original message, an alternative to fixed-point
accumulation in a super accumulator is an algorithm that produces a
correctly-rounded result. A modified intrinsic DOT_PRODUCT function
would be agnostic (in the standard) as to which method the processor
uses. Rather than an optional argument, it could be a different
specific function for the generic DOT_PRODUCT, so the processor wouldn't
need to check at run time (in the case without the additional argument
that requests the correct result). Of course, the cost of the check
would be trivial compared to the cost of the accumulation, except when
vectors are extremely short. If the cost of correct accumulation is a
problem, this might put pressure on chip developers to incorporate
super-accumulator fixed-point accumulation in the CPU.
An exact dot product doesn't have to be in the CPU. It could be
provided by a coprocessor, maybe even one on a PCI card. Some Fortran
processors might detect at run time whether it's present in the
computer, and use it instead of a (slow) correctly-rounding algorithm.
Surely we all remember vector coprocessors.
> Are you quite sure that the numerical software community really DOES
> want that?
To the extent that IFIP WG 2.5 (Numerical Software) is representative of
the numerical software community, yes, the community DOES want this.
Since 2000, I attended about half of WG 2.5 meetings. This was
discussed at most of those. The dot product is ubiquitous. See
footnote 2 on page 3 of the attachment. Ill-posed problems cannot be
solved if it's not correct. Defect correction and iterative refinement
of ill-posed linear problems fail if it's not correct. Bounds in
interval arithmetic can unnecessarily expand. Nearly-identical
eigenvalues and their correspinding eigenvectors go wrong. Even Monte
Carlo methods can get into trouble if it's not correct.... There are
certainly many users of numerical software who are struggling with
incorrect results and don't realize that an exact dot product would
solve their problems.
> I agree that it wants something better than the obvious code
> gives, but I very much doubt that more than a few people could make use
> of so-called last-bit accuracy.
This is a red herring. Last-bit accuracy is less important at the
end-of-the-line than getting at least some of the bits IN THE RESULT
correct. In ill-posed problems, if the dot product is not accumulated
correctly, the number of correct bits IN THE RESULT can be very small
-- even zero. Getting the last bit of the result correct is a
by-product of correct accumulation, not the end goal. I think a
vanishingly small number would want to replace the last N bits of the
result with random junk, just because they didn't need last-bit
correctness.
If you want papers that describe the problem, I can get them. Siegfried
Rump gave very nice demonstrations of the problem at the WG 2.5 meeting
in Sydney last year, and at ICIAM in Valencia last week (and many other
times when I was not in attendance).
> Even the people who want repeatability
> can't have that without forbidding many important forms of optimisation
> and any reasonable performance / numerical improvements in the less
> tractable special functions and most forms of solver, not to say
> forbidding many important non-deterministic methods. And that's a
> fundamental restriction of the basic mathematics.
Another red herring. The problem isn't repeatability. It's getting the
correct answer at the end. The "correct answer" doesn't (generally)
have to be bit-for-bit repeatable, but it does need to have at least a
few correct bits. And a repeatable bit-for-bit correct result would
certainly not be harmful.
> >Computing a correctly-rounded dot product is possible using conventional
> >floating point arithmetic, at a cost of up to six times the cost of the
> >obvious method. It is difficult for users to implement it because
> >processors sometimes do not respect parentheses. For example, a
> >compensated-summation loop might be written
> It's not that simple, unfortunately. While almost all Fortran compilers
> have at least an option to honour the standard together with optimisation,
> surprisingly few C and C++ ones do in this area. They do NOT just need to
> honour parentheses, but to use the same arithmetic throughout; using
> native Intel 80-bit for calculation and IEEE 754 for storage also knackers
> that code. I baulked at teaching the necessary hacks in my C++ course, and
> I was notorious for not sparing students the sordid details. I can post
> my example code if you like.
This is a red herring. We are discussing the Fortran standard, after
all, not the C or C++ or 754 standard. BTW, Kahan did discuss the
80-bit vs. 64-bit problem in the paper attached to the original message.
> Furthermore, I suspect that the factor of six might well be exceeded in
> some cases (e.g. systems with very limited L1 cache and a particularly
> foul data distribution and IEEE 128-bit). There's no getting round that
> 4,200 byte requirement.
Some time during the last twenty or thirty years, I proposed a STRICT
block, in which calculations would follow strict arithmetic rules.
I.e., "turn off goofy standard-violating optimizations here, but go
ahead and use them elsewhere." An alternative is to hide some parts of
the compensation in different procedures, and hope the processor doesn't
inline them. This would indeed increase the cost beyond a factor of
six. Is that worse than getting the wrong answer?
A super accumulator for dot products of double precision vectors having
up to 2^20 elements would indeed require 4,288 bits (not bytes).
Kulisch's and Demmel's and IBM's implementations of exact dot product
did not use the processor, and therefore didn't use the L1 cache. They
computed the dot product as fast as operands could be fetched using DMA.
There is no way to do it faster. The cache is irrelevant if the exact
dot product is provided by a coprocessor.
If an exact dot product using a super accumulator were incorporated into
a processor, it could still operate as fast as operands could be fetched
from memory -- and occasionally faster if some of the operands were in
cache. It would still be six times faster than a double-precision
floating-point accumulation -- and would always produce a correct
result, unless the vectors were longer than 2^20 elements AND the result
overflowed. Indeed, if it were incorporated in the processor, there
would never be an excuse to use the obvious floating-point algorithm
under-the-covers for DOT_PRODUCT. Exact accumulation would be six times
faster for ALL dot products, not just extremely long ones.
>
> Regards,
> Nick Maclaren.
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: DotProd-8.pdf
Type: application/pdf
Size: 103369 bytes
Desc: not available
URL: <https://mailman.j3-fortran.org/pipermail/j3/attachments/20190726/81b6c2b2/attachment-0001.pdf>
More information about the J3
mailing list