(j3.2006) Fast accurate NORM2

Van Snyder Van.Snyder
Tue Dec 15 17:07:42 EST 2015


On Tue, 2015-12-15 at 21:46 +0000, Bill Long wrote:
> On Dec 15, 2015, at 3:26 PM, Van Snyder <Van.Snyder at jpl.nasa.gov> wrote:
> 
> > There's a bit more of interest in the article on the faithfully rounded
> > l2 norm of n-vectors.
> > 
> > The error of the netlib norm routine depends upon the vector length; the
> > error of the reported routine does not.
> > 
> > The reported routine is three times faster than the netlib routine,
> > except for very short vectors.
> > 
> > Since it's faster and more accurate than what many consider to be the
> > best contemporary l2-norm routine, is it unreasonable to hope that the
> > reported routine will be used for the intrinsic NORM2 function?
> 
> Alternatively, why has this not replaced the current NetLib routine?
> If it did, the benefits to NORM2 might be automatic.

This is precisely the reason I sent the message.  Hopefully, the source
of routines to implement intrinsic functions is not limited to netlib.

The routine just appeared in the most recent issue of ACM Transactions
on Mathematical Software.  It was a research article publication, with a
link to software, not an algorithm submission.  That is, ACM TOMS did
not publish the software.  It therefore is not part of CALGO, and
therefore not automatically posted at netlib.  In any case, routines at
netlib are not replaced.  New ones are added but old ones are never
removed.

One of the facets of CALGO algorithm policy is that code be portable.
The reported routine incorporates Intel SIMD instructions using C
builtins (whatever that means).

Under "Disclaimers" the authors (or more likely an Intel flack) wrote
"Software and workloads in performance tests may have been optimized for
performance only on Intel processors...."  Only one of the authors
(Ping-Tak Peter Tang) is an Intel employee.

The authors do remark that they implemented and tested the routine on
other processors.  Some of their advantage is lost on processors for
which the cost of divide is not as large compared to multiply as it is
on Intel processors.  They then remark that those processors have FMA
instructions, and that using those instructions regains their advantage.

They also remark that the algorithm is amenable to parallelization.  So
far, they have limited their exploitation of parallelism to SIMD
instructions, but they remark that future work will include
multi-threading.

The primary source of the advantage is that the loop has no branches or
divides.  The netlib routine has both.

> Cheers,
> Bill
> 
> 
> > 
> > On Mon, 2015-12-14 at 13:25 -0800, Van Snyder wrote:
> >> On Mon, 2015-12-14 at 13:57 -0700, Keith Bierman wrote:
> >>> 
> >>> On Mon, Dec 14, 2015 at 1:49 PM, Van Snyder <Van.Snyder at jpl.nasa.gov>
> >>> wrote:
> >>>        Efficient Calculations of Faithfully Rounded l2-Norms
> >>>        of n-Vectors."
> >>> 
> >>> ?Sounds nice. Is there a copy of the sw online?
> >> 
> >> From page 24:16 of the article:
> >> 
> >> "The complete set of codes, together with testing and performance
> >> measurement auxiliary sources, is available at
> >> 
> >>  http://www.christoph-lauter.org/faithfulnorm.tgz
> >> 
> >> under an open source license.
> >>  We implemented and tested our faithfully-rounded division-free l2-norm
> >> with faithful reporting of overflow and underflow....
> >>  We used IEEE754 binary64 as working precision and restricted ourselves
> >> to an SIMD environment, targeting in particular Intel SSE/AVX units,
> >> with or without the IEEE754 FMA insruction...."
> >> 
> >> Incidentally, Jim Demmel's students have implemented Kulisch's method to
> >> compute exact dot products.  Their implementation runs six times faster
> >> than a floating-point dot product, let alone a correctly-rounded one
> >> that doesn't overflow or underflow.
> >> 
> >>> A quick google peek turned up some slides which suggest that
> >>> intermediate computations with twice the word size are required (but
> >>> I'm not certain its the same work).
> >>> 
> >>> 
> >>> If so, worked details for double-double with one and two roundings
> >>> might be of interest for folks with platforms (e.g. POWER) that
> >>> support it (usually much cheaper than "quad" precision). 
> >>> 
> >>> 
> >>> ?
> >>> 
> >>> 
> >>> Keith Bierman
> >>> khbkhb at gmail.com
> >>> kbiermank AIM
> >>> 303 997 2749
> >> 
> >> 
> >> _______________________________________________
> >> J3 mailing list
> >> J3 at mailman.j3-fortran.org
> >> http://mailman.j3-fortran.org/mailman/listinfo/j3
> > 
> > 
> > _______________________________________________
> > J3 mailing list
> > J3 at mailman.j3-fortran.org
> > http://mailman.j3-fortran.org/mailman/listinfo/j3
> 
> Bill Long                                                                       longb at cray.com
> Fortran Technical Support  &                                  voice:  651-605-9024
> Bioinformatics Software Development                     fax:  651-605-9142
> Cray Inc./ Cray Plaza, Suite 210/ 380 Jackson St./ St. Paul, MN 55101
> 
> 





More information about the J3 mailing list