(j3.2006) [ukfortran] (SC22WG5.5059) WG5 vote on draft TS on further coarray features

N.M. Maclaren nmm1
Mon Aug 5 13:22:51 EDT 2013


On Aug 5 2013, Bill Long wrote:
>>>>>>>
>>>>>>> > Passim. The specification is messy and restrictive, and should be
>>>>>>> > changed. For example, it is not possible to reduce INTENT(IN) >
>>>>>>> examples.
>>>>>
>>>>> The INTENT(IN) case seems too trivial to justify changing a spec that
>>>>> is increasingly in production use.  If you want
>>>>>
>>>>> co_sum( <expr>, X)
>>>>>
>>>>> just write
>>>>>
>>>>> X = <expr>
>>>>> co_sum(X)
>>>>>
>>>>> instead.  The second form avoids the compiler having to create a temp
>>>>> for <expr>, which you would want to avoid anyway if X is an array.
>>>>
>>>> Er, no, it doesn't.  You have just created a named temporary: X.
>>>
>>> Huih?  X is the RESULT argument in the first example. It is where the
>>> user wants the answer.  It has to be an already existing variable, not
>>> a temp.
>>
>> That is only the case when you want the result on all images.
>
>X has to exist on every image, independent of whether all images or just 
>one gets the result.

Actually, no, it doesn't.  It could be an OPTIONAL argument that is
required only on the root.  But it also doesn't need to be full size
on the other images.

>Reductions involve a tree operation.  The images accumulate the result 
>while going up the tree.  If all images get the result, it is broadcast 
>back down the same tree.  If only one gets the result, it is assigned on 
>that image and the other images just return.

That is ONE implementation.  There are many others.

>To minimize the space needed and number of copy operations, the RESULT 
>variable (if supplied, and SOURCE if not) is used as the internal 
>scratch space for the tree.   The outcome is just what we all want - 
>minimal memory use and minimal copying.

That is a case of one particular implementation driving the standard,
which is Just Not On.

>> I also forgot to mention that this needs an EXTRA copy in all cases,
>> which can cause a LOT of inefficiency.  Like this:
>>
>>     With the two-argument form, the source is read once and the result
>> written once.
>>     With your solution, the source is read, the result written, the
>> result read and the result written.
>>
>>>> Indeed, one of the main reasons to want a proper two-argument form
>>>> IS to avoid an unnecessary array copy and the consequent inefficient
>>>> use of space.
>>>
>>> Quite the opposite.  The one-argument form uses less space.  The
>>> two-argument form requires double the space.  The two-argument form,
>>> indeed, involves a copy from SOURCE to RESULT as part of the
>>> operation, since we are not changing the value of SOURCE.
>>
>> Again, not in the case when only one image wants the result.
>>
>>>> Consider a large number of images and reducing onto a single result
>>>> image.  You are now forcing all of the other images to copy the input
>>>> argument.
>>>
>>> If SOURCE is INTENT(IN) such a copy is required anyway.
>>
>> Why on earth is that needed?  MPI doesn't do that.
>
>MPI can't even move a value to another image without making a copy.  And 
>it does so using extra internal buffers, sucking memory away from the 
>user, instead of using already existing user variables.

That is not true, and the MPI went to great trouble to ensure that it
was not required.  I have used several MPIs which did not make an
internal copy.

>>> If SOURCE is left INTENT(INOUT) then the copy can be avoided.
>>
>> Not in well-engineered code, where read-only arguments are INTENT(IN).
>
>You seem to be hung up on the names SOURCE and RESULT.  Would you be 
>happier if the interface were
>
>co_sum (RESULT [,SOURCE, ...])
>
>which is more like assignment (where result is to the left of source). 
>With that configuration, the optional SOURCE would be INTENT(IN). 
>RESULT would be INTENT(INOUT).   This change would not be too disruptive 
>since calls like
>
>call co_sum(X)
>
>would be unchanged.  Since this seems to be a common usage pattern, many 
>of the exiting codes would still be conforming.

I need to think about that, but very probably.  My objection to the
current proposal is not to the syntax but the semantics.

>> For heaven's sake, let's not go introduce features where good software
>> engineering is incompatible with performance - that way lies C++!
>>
>>> The time for the copy is independent of the number of images, so it is
>>> not that material. Making a new temp is potentially the more
>>> problematic action.
>>
>> NO, it is NOT - not on any current multi-core system.  None of them
>> have enough memory system capacity to allow all cores to opy data in
>> parallel (or, often and increasingly, more than a very small number
>> of them).
>
>Well, that is a problem only if the data is not in cache.

NO, IT IS NOT!!!!!

Firstly, data are not either "in cache" or not - they can be in any one
of several levels of cache and, MUCH more importantly, can be shared or
exclusive (and, on some architectures, more than that).  The performance
difference of write accesses can be MUCH lower than that of read ones.

Secondly, ANY cache traffic (including fetches from most of the levels)
can cause overload effects that slow down completely unrelated accesses.
That can cause unrelated software to fail, occasionally even crashing
the system.

> I agree that 
>the memory bandwidth on modern processors is not nearly what it was in 
>days past.  But we all have the same issue - the Intel Xeon chips found 
>in most supercomputers are the same one used in white-box servers.

Yes, and they have exactly the problems I mention.

>Fortunately, in the case of the reductions, these memory operations tend 
>to be slightly time-shifted, so don't seem to be a problem.  We 
>intentionally do not have a global barrier at the beginning of the 
>collective subroutines partly for this reason.  The barriers are all 
>internal, as part of the tree traversal, and occur after these copies.

Oh, really?  Well, I have seen such problems.  See above about overload.

>> The point is that, for every user who uses machines like the ones you
>> are aiming for, there are tens or hundreds who use commodity desktops
>> and small servers.  And, if Fortran coarrays run like a drain if each
>> image is assigned to a core, that's a very, very bad idea.
>
>Properly implemented coarrays should be very fast on the shared-memory 
>configuration of most desktops and small servers. Certainly better than 
>MPI which involves additional internal memory copies.  Assigning an 
>image to each core is not a requirement.  Indeed, assigning an image to 
>each SMP node, and using OpenMP within the node is quite common.

It should be, but the current specification is NOT the way to ensure
that for some applications.


Regards,
Nick Maclaren.




More information about the J3 mailing list