(j3.2006) Integration of co-arrays with the intrinsic shift functions
Mon Jul 23 13:40:14 EDT 2007
> So I'm still not convinced why we are breaking faith with the ?
> past (e.g., SUM) and defining collectives as subroutines.
Any of the co-intrinsics, are *much* better, if not necessarily, subroutines,
rather than functions. Let me repeat some of the points:
> > x=co_shift(x,1)+co_shift(x,-1)
> > either unless (in this order):
> > 1) it is guaranteed that all images execute this statement
> Of course, it is a collective, so all images must execute the ?
> statement or it is a non-conforming program.
Easy to do for a CALL statement (we already do it). Functions are *different*.
They are evaluated. They don't even have to be executed, they may be executed
in different order, once or multiple times, etc. Example, is this OK:
How about this:
Expressions are beasts that are mis-behaving and causes of arguments even in a
serial context. Allowing parallelism complications in that mess is a disaster
Do you want each image to execute exactly the same statement for any statement
that involves an expression that *may* result in the execution of co_shift?
What can be different about the statements on different images (eg, constant
Specifics are everything. Just waving hands will simply not do. The "secret"
discussions you mention considered the above issues and concluded that the
complications were too big.
> > 2) x is a co-array
> X (as actual argument must be a co-array as specified in the function ?
Yes, this is OK.
> > 3) co_shift returns a co-array, that is, returns an array of the ?
> > same shape on
> > every image
> co_shift returns a local array. ?I assume this removes your problem. ?
No, it *leaves* the problem. There is performance implications to returning a
local array. Co-arrays can be addressed among images more efficiently than
local arrays, on almost any machine. They are designed for that purpose. If
the result is a local array, compilers will typically need to create a
temporary co-array, compute the function and store the result there, and then
copy the result where it is needed. Of course, a lot of this can be optimized
away with some more sophisticated expression optimization. But, I believe,
few existing compilers optimize expressions involving some of our more
complex array functions (like co_shift) to the degree they would optimize a
simple loop performing the same operation, even in a purely serial context.
> How can you say that data parallel is not the place of Fortran? ?It ?
> is already there! ?Consider FORALL, WHERE, CSHIFT, EOSHIFT (perhaps ?
> even MATMULT), array notation, ...
Yes, and it is still there on each image (which might have lots of internal
fine-grained parallelism). Data parallelism is great when it is easy to
express and when compilers can figure it out. The reason we are adding
*explicit* parallelism with explicit data distribution and flow control, is
to as to allow much more than that. Also, IMHO, most of the above constructs
and intrinsics, other than array notation, are:
1) Not used heavily in real kernels due to differences in optimization among
2) Source of frustrations for the vendors trying to optimize bechmarks (matmul
being a famous monster) resulting in a waste of resources
3) Badly designed to do anything other than vanilla data-parallism (this is
why we added DO CONCURRENT even though we have FORALL).
4) Used most often as shortcuts for simple and short loops (this is a good
thing, but not a justification for a major change to co-arrays at this date).
More information about the J3