(j3.2006) Question from a colleague

Tobias Burnus burnus
Wed Jun 11 16:11:58 EDT 2014


Van Snyder wrote:
> Here's a question from a colleague:
>
> In code expected to be used on several platforms, is one of these
> formulations generally (if not universally) more likely to be more
> efficient, or should I expect significant processor-dependent variation?

I think on most systems, the second version will be faster.

>          xsqr_pls_ysqr=(xx**2+yy**2)
>          !-------------
>          ! Asymptotic expression for extremely large |z|
>          mask=(xsqr_pls_ysqr>=1.0e16_rk)
>          allocate(index_sub(count(mask)))
>          index_sub=pack(indx_array,mask)
>          w(index_sub)=one_sqrt_pi*(abs(yy(index_sub))+j1*(xx(index_sub)))/ &
>              (xsqr_pls_ysqr(index_sub));
>          deallocate (index_sub)

Memory allocation is relatively expensive. In addition, this version has 
several loops:

(1) mask=(xsqr_pls_ysqr>=1.0e16_rk)
(2) count(mask)
(3) index_sub=pack(indx_array,mask)
(4) "w(index_sub) =" line.

>          xsqr_pls_ysqr=(xx**2+yy**2)
>          !-------------
>          ! Asymptotic expression for extremely large |z|
>          where (xsqr_pls_ysqr>=1.0e16_rk) &
>            w=one_sqrt_pi*(abs(yy)+j1*(xx))/(xsqr_pls_ysqr);

In this version, it depends on the smartness of the compiler, how it 
handles it. The compiler might generate a temporary mask variable and 
then use it in the loop ? or it puts the condition directly in the loop. 
If it generates a temporary, it might do so on the stack which is faster 
than an explicit allocate on the heap, thus, even that part could be faster.

In addition, the assignment to will generate a loop, which might be a 
vectorizable using masked assignment (with or without mask temporary). 
By contrast, vector subscripts are very difficult to vectorize the compiler.

And finally, the second version is in my opinion vastly more readable. 
Thus, I would use the second version.

Tobias



More information about the J3 mailing list