(j3.2006) Question from a colleague
Tobias Burnus
burnus
Wed Jun 11 16:11:58 EDT 2014
Van Snyder wrote:
> Here's a question from a colleague:
>
> In code expected to be used on several platforms, is one of these
> formulations generally (if not universally) more likely to be more
> efficient, or should I expect significant processor-dependent variation?
I think on most systems, the second version will be faster.
> xsqr_pls_ysqr=(xx**2+yy**2)
> !-------------
> ! Asymptotic expression for extremely large |z|
> mask=(xsqr_pls_ysqr>=1.0e16_rk)
> allocate(index_sub(count(mask)))
> index_sub=pack(indx_array,mask)
> w(index_sub)=one_sqrt_pi*(abs(yy(index_sub))+j1*(xx(index_sub)))/ &
> (xsqr_pls_ysqr(index_sub));
> deallocate (index_sub)
Memory allocation is relatively expensive. In addition, this version has
several loops:
(1) mask=(xsqr_pls_ysqr>=1.0e16_rk)
(2) count(mask)
(3) index_sub=pack(indx_array,mask)
(4) "w(index_sub) =" line.
> xsqr_pls_ysqr=(xx**2+yy**2)
> !-------------
> ! Asymptotic expression for extremely large |z|
> where (xsqr_pls_ysqr>=1.0e16_rk) &
> w=one_sqrt_pi*(abs(yy)+j1*(xx))/(xsqr_pls_ysqr);
In this version, it depends on the smartness of the compiler, how it
handles it. The compiler might generate a temporary mask variable and
then use it in the loop ? or it puts the condition directly in the loop.
If it generates a temporary, it might do so on the stack which is faster
than an explicit allocate on the heap, thus, even that part could be faster.
In addition, the assignment to will generate a loop, which might be a
vectorizable using masked assignment (with or without mask temporary).
By contrast, vector subscripts are very difficult to vectorize the compiler.
And finally, the second version is in my opinion vastly more readable.
Thus, I would use the second version.
Tobias
More information about the J3
mailing list