(j3.2006) [ukfortran] (SC22WG5.5059) WG5 vote on draft TS on further coarray features
Bill Long
longb
Mon Aug 5 12:06:50 EDT 2013
On 8/5/13 9:56 AM, N.M. Maclaren wrote:
> On Aug 5 2013, Bill Long wrote:
>>>>>>
>>>>>> > Passim. The specification is messy and restrictive, and should be
>>>>>> > changed. For example, it is not possible to reduce INTENT(IN) >
>>>>>> examples.
>>>>
>>>> The INTENT(IN) case seems too trivial to justify changing a spec that
>>>> is increasingly in production use. If you want
>>>>
>>>> co_sum( <expr>, X)
>>>>
>>>> just write
>>>>
>>>> X = <expr>
>>>> co_sum(X)
>>>>
>>>> instead. The second form avoids the compiler having to create a temp
>>>> for <expr>, which you would want to avoid anyway if X is an array.
>>>
>>> Er, no, it doesn't. You have just created a named temporary: X.
>>
>> Huih? X is the RESULT argument in the first example. It is where the
>> user wants the answer. It has to be an already existing variable, not
>> a temp.
>
> That is only the case when you want the result on all images.
X has to exist on every image, independent of whether all images or just
one gets the result.
Reductions involve a tree operation. The images accumulate the result
while going up the tree. If all images get the result, it is broadcast
back down the same tree. If only one gets the result, it is assigned on
that image and the other images just return.
To minimize the space needed and number of copy operations, the RESULT
variable (if supplied, and SOURCE if not) is used as the internal
scratch space for the tree. The outcome is just what we all want -
minimal memory use and minimal copying.
>
> I also forgot to mention that this needs an EXTRA copy in all cases,
> which can cause a LOT of inefficiency. Like this:
>
> With the two-argument form, the source is read once and the result
> written once.
> With your solution, the source is read, the result written, the
> result read and the result written.
>
>>> Indeed, one of the main reasons to want a proper two-argument form
>>> IS to avoid an unnecessary array copy and the consequent inefficient
>>> use of space.
>>
>> Quite the opposite. The one-argument form uses less space. The
>> two-argument form requires double the space. The two-argument form,
>> indeed, involves a copy from SOURCE to RESULT as part of the
>> operation, since we are not changing the value of SOURCE.
>
> Again, not in the case when only one image wants the result.
>
>>> Consider a large number of images and reducing onto a single result
>>> image. You are now forcing all of the other images to copy the input
>>> argument.
>>
>> If SOURCE is INTENT(IN) such a copy is required anyway.
>
> Why on earth is that needed? MPI doesn't do that.
MPI can't even move a value to another image without making a copy. And
it does so using extra internal buffers, sucking memory away from the
user, instead of using already existing user variables.
>
>> If SOURCE is left INTENT(INOUT) then the copy can be avoided.
>
> Not in well-engineered code, where read-only arguments are INTENT(IN).
You seem to be hung up on the names SOURCE and RESULT. Would you be
happier if the interface were
co_sum (RESULT [,SOURCE, ...])
which is more like assignment (where result is to the left of source).
With that configuration, the optional SOURCE would be INTENT(IN).
RESULT would be INTENT(INOUT). This change would not be too disruptive
since calls like
call co_sum(X)
would be unchanged. Since this seems to be a common usage pattern, many
of the exiting codes would still be conforming.
> For heaven's sake, let's not go introduce features where good software
> engineering is incompatible with performance - that way lies C++!
>
>> The time for the copy is independent of the number of images, so it is
>> not that material. Making a new temp is potentially the more
>> problematic action.
>
> NO, it is NOT - not on any current multi-core system. None of them
> have enough memory system capacity to allow all cores to opy data in
> parallel (or, often and increasingly, more than a very small number
> of them).
Well, that is a problem only if the data is not in cache. I agree that
the memory bandwidth on modern processors is not nearly what it was in
days past. But we all have the same issue - the Intel Xeon chips found
in most supercomputers are the same one used in white-box servers.
Fortunately, in the case of the reductions, these memory operations tend
to be slightly time-shifted, so don't seem to be a problem. We
intentionally do not have a global barrier at the beginning of the
collective subroutines partly for this reason. The barriers are all
internal, as part of the tree traversal, and occur after these copies.
>
> The point is that, for every user who uses machines like the ones you
> are aiming for, there are tens or hundreds who use commodity desktops
> and small servers. And, if Fortran coarrays run like a drain if each
> image is assigned to a core, that's a very, very bad idea.
Properly implemented coarrays should be very fast on the shared-memory
configuration of most desktops and small servers. Certainly better than
MPI which involves additional internal memory copies. Assigning an
image to each core is not a requirement. Indeed, assigning an image to
each SMP node, and using OpenMP within the node is quite common.
Cheers,
Bill
>
>
> Regards,
> Nick Maclaren.
>
--
Bill Long longb at cray.com
Fortran Technical Support & voice: 651-605-9024
Bioinformatics Software Development fax: 651-605-9142
Cray Inc./Cray Plaza, Suite 210/380 Jackson St./St. Paul, MN 55101
More information about the J3
mailing list