(j3.2006) (SC22WG5.4478) Rice University Commentary on ISO/IEC JTC1/SC22/WG5 N1835 (Addition/Modification of CAF Features)

David Muxworthy d.muxworthy
Thu Jun 16 05:21:13 EDT 2011


The following message was blocked by the sc22wg5 mailserver.
David Muxworthy
-------------------------------------------------------------
From: Bill Scherer <scherer at rice.edu>
Subject: Rice University Commentary on ISO/IEC JTC1/SC22/WG5 N1835  
(Addition/Modification of CAF Features)
Date: Wed, 15 Jun 2011 19:26:49 -0500
Cc: cafcompiler-l at rice.edu
To: sc22wg5 at open-std.org, John.Reid at stfc.ac.uk


Attached, please find Rice University's Commentary on ISO/IEC JTC1/ 
SC22/WG5 N1835 (Addition/Modification of CAF Features).

Thank you for your consideration.

A Critique of ISO/IEC JTC1/SC22/WG5 N1835 (Addition/Modification of  
CAF Features)
------------------------------------------------------------------------ 
---------

    Laksono Adhianto, John Mellor-Crummey, Guohua Jin, Karthik  
Murthy, Dung Nguyen,
            William N. Scherer III, Scott Warren, and Chaoran Yang
{laksono, johnmc, jin, Karthik.S.Murthy, dxnguyen, scherer, scott,  
chaoran}@rice.edu


In this article, we provide commentary on the feature additions/ 
modifications
to J3/08-131r1 based on the discussion held in Sept 2010. This  
document is
available online as ftp://ftp.nag.co.uk/sc22wg5/N1801-N1850/ 
N1835.txt.  Our
commentary is based on our experiences with developing and using the  
Rice
Coarray Fortran 2.0 (Rice CAF 2.0) programming language, runtime, and
translator.


Proposal 1.
-----------

We generally support this proposal; however, we believe that a larger  
set of
intrinsics would be useful. In particular, the full set of collectives
supported by MPI seems worth considering.

Although we did not implement Rice CAF 2.0 collectives in this  
manner, having
an optional result parameter seems reasonable to us.


Proposal 2.
-----------

We agree that "raw" atomic operations are useful for development of
high-performance synchronization and concurrency routines. We suggest  
that
the committee consider the equivalent feature set from the Java  
programming
language, which appears in the java.util.concurrent library, as it  
has been
very successful in that community. Specifically, it supports two key  
features
that are missing from this proposal:

(1) atomic swap, also known as fetch-and-store, is necessary for the
implementation of commercially important algorithms including the  
acquire()
routine for the widely used MCS queue-based lock.  Although atomic  
swap can
be simulated via a looped CAS construct, this is an imperfect  
approximation
because the CAS loop can fail arbitrarily many times (starvation) before
success; atomic swap is guaranteed to complete within a bounded  
length of
time.

(2) CAS on pointer values -- equivalent to the
java.util.concurrent.AtomicReference class -- is necessary for the
implementation of virtually all concurrent algorithms that are  
currently in
use.  In C, support for integers is sufficient because its more  
powerful cast
operations allow the programmer to cast a pointer to an integer type;
however, the equivalent functionality is not present in Fortran due  
to its
stronger typing.

We note that the restriction of types to exclude variables of type  
real seems
arbitrary; however, we have no opinion on whether they should be  
explicitly
included as possible targets of the atomic instructions.

Finally, we observe that some level of protection against the so- 
called ABA
problem is desirable. The ABA problem occurs when a CAS is made  
against a
value that has changed but has then accidentally changed back to its  
original
value between when it was first read and when the CAS is effected.   
In this
case, it is usually wrong (algorithmically) for the CAS to succeed; this
leads to subtle corruption and difficult to track down race  
conditions in
code.

We additionally refer the committee to the C++ atomics  
standardization work
by Hans Boehm and Lawrence Crowl [1].


Proposal 3.
-----------

3b) We generally concur that restrictions should only be present when
absolutely necessary.

3c) In our view, there is already enough confusion in the world about  
the
difference between global and local synchronization.  They are very  
different
things; combining them into a single sync statement will only serve to
increase the confusion.

3d) We see no problem with allowing functions to have side effects.  
Rather
than an IMPURE attribute that proclaims a function free of side effects,
however, we espouse a PURE attribute that is an explicit promise,  
made by the
programmer, that a function is side effect free.

3e) Fundamentally, we disagree with requiring MPI in addition to  
Fortran in
order to have a complete programming model: [Coarray] Fortran should  
stand on
its own.  There is substantial utility to having a rich set of  
collectives;
and compiler support for them can greatly ease the burden on the  
programmer
(and reduce opportunities for error) when using them.  For example,  
in the
Rice CAF 2.0 implementation, we have built support in the compiler to
automatically compute sizes of data and to generate callback functions.

3f) Teams are needed for coupled codes and are very useful for linear  
algebra
applications.  Again, we disagree strongly with requiring MPI in  
addition to
Fortran in order to have a complete programming model.  This is  
particularly
true when an all-coarray Fortran program could be aesthetically  
pleasing.

3g) We dislike notify and query as we strongly prefer first-class  
events.
Instead of directly synchronizing with another processor, we find it  
a far
better programming model to synchronize with an event that is logically
connected to remote data.  Further, events provide a safe  
synchronization
space: If a library method notifies an event, that notification  
cannot be
picked up by a waiting operation in user code, but with direct
processor-to-processor synchronization, the same cannot be said.   
Debugging
synchronization errors of this form is slow, tedious, and painful.

3h) Rather than have an intrinsic isMyLock that is specific to locks, we
propose extending imageof() from handling just copointers to also  
handling
locks and events.  However, we note that many implementations will  
wish to
use a test-and-test-and-set lock, for which lock ownership  
information is not
normally stored with the lock.

Rather than isLocked(), we would suggest adding a trylock() function  
that
attempts to acquire a lock if it is unlocked and fails otherwise.
Programmers should not write their own spin loops.  Locks can  
implement their
own spins, including spin-then-yield code as appropriate.  This gains
efficiency since no traversal of data structures is necessary to find  
the
memory location to spin on.


On the subject of locks, we note that formal locksets allow multi-lock
locking to occur in a canonical order; this provides a degree of safety
against cyclic deadlock in multi-lock codes.  A very simple canonical  
order
would be the address of the lock variables.

3i) While compatibility is useful, we reiterate our stance that  
Fortran 2008
should stand on its own.  For example, the compiler can generate
multithreaded or CUDA code from a do concurrent loop.  Requiring CUDA +
OpenMP + MPI + CAF is far less aesthetically appealing than an all-CAF
solution.


Proposal 4.
-----------

This proposal is subsumed by our approach to copointers, the details  
of which
appear in Appendix II.  In particular, we observe that adding the  
cotarget
attribute to a non-coarray variable makes it a coscalar by requiring  
that it
be allocated in shared memory space.

We see no need for the relocate() statement nor for the image=  
qualifier to
the allocate statement.  Functionality equivalent to relocate() can be
achieved by just reallocating the scalar and copying date from the old
location to the new.  Functionality equivalent to the image=  
qualifier can be
achieved by placing a conditional around the allocation statement:

         if (mype .eq. 4) then
            allocate(foo)
         endif

We note that for caching purposes, it suffices to copy a coscalar to  
a local
variable.

In general, the heap is not symmetric; providing optimizations based  
on an
assumption otherwise seems ill advised.


Proposal 5.
-----------

We are in full agreement that asynchronous collective operations are  
useful
and desirable.  In fact, we have used them to good effect in  
developing Rice
CAF 2.0 implementations for the High Performance Computing Challenge  
(HPCC)
benchmarks [2].

Rice CAF 2.0 supports two variants of asynchrony for collectives.  In  
the
explicit model, an event variable is supplied as a parameter to the
collective.  Upon completion of the collective operation, the event is
notified.  This allows the programmer to determine when the collective
operation has completed so that subsequent code, predicated on  
completion of
the collective, may be executed.

         co_sum_async(some_coarray, some_event)  ! kick off a reduction
         ...                                     ! overlap  
computation with it
         event_wait(some_event)                  ! ensure it has  
completed


In contrast, in the implicit model, the programmer omits the event  
variables
and instead calls an explicit "cofence" to be sure that all pending
operations have completed:

         co_sum_async(some_coarray)  ! kick off an asynchronous  
reduction
         ...                         ! overlap computation with the  
reduction
         cofence                     ! ensure it has completed

For more details of the cofence, see Appendix I.


In addition to collectives, we have found substantial benefit in  
supporting
two other asynchronous functions:

(1) An asynchronous barrier offers the same functionality as does a
split-phased barrier.  Triggering the barrier is equivalent to a  
notify, and
waiting on the event/blocking with a cofence is equivalent to awaiting
completion of the barrier.

(2) A predicated asynchronous copy allows data to be transferred to/ 
from a
remote image as soon as it is ready, and automatically notify when  
the copy
has completed.  This is useful, for example, in a scenario where we have
initialization to perform and need data from a partner:

          copy_async(my_buffer, remote_buffer[partner], pred_event, &
                     data_copied_event)
          ...
          ! perform other initialization while waiting for the data
          ...
          event_wait(data_copied_event)      ! make sure we have the  
data
          ! proceed with computation

Here, we have overlapped the computation of our initialization with the
communication of data into my_buffer from partner's remote_buffer.


Proposal 6.
-----------

Issue A) We believe that this is a non-issue.  The allocation of  
coarray 'a'
on team c would overwrite the pointer to a on overlapping members;  
there can
be only one 'a' on any image.  This would of course be a programming  
error
that could be checked at runtime when trying to allocate an already- 
allocated
pointer.

In the Rice CAF 2.0 implementation, coarrays are registered after  
allocation;
the name duplication conflict would manifest in this stage if it had not
previously been detected.

Issue B) As detailed in Tony Skjellum's rationale for MPI libraries [3],
reindexing is crucial if support libraries are to be developed.  We  
agree
that having ranks > 1 poses several logistical problems from a language
viewpoint.  This is precisely why we oppose having more than one rank  
for
codimensions.

However, to give the functionality of multiple dimensions, we support
topologies.  In particular, with a cartesian topology, one can write  
code
that appears to index multiple ranks.  The indexing is reduced by the
topology to a linearized one-dimensional index into the single  
physical rank
for the coarray.  This resolves the issues described here.

Issue C). We believe that teams are very useful for many applications,
including coupled codes and linear algebra applications to name two.   
We urge
the committee not to remove them from the Fortran 2008 specification.


Proposal 7.
-----------

This proposal is subsumed by our approach to copointers, the details  
of which
appear in Appendix II.


Proposal 8.
-----------

We agree that this proposal has appeal.  In fact, an early version of  
our CAF
2.0 implementation supported asymmetric coarrays.  But when we tried  
it, it
caused chaos with reshaping of arrays.

Suppose, for example that we have 2D arrays of different sizes.  Now  
suppose
we pass column 3 to a local subroutine, which then tries to access that
column on another image.  We see no reasonable way to handle the case  
where
that column does not exist on the remote image.  Further, even if the  
remote
image *does* have a third column in the coarray, what if the columns  
are of
differing lengths? The subroutine has no good way to know the bounds  
of the
column on the remote image. A semantic problem occurs when we attempt to
access the entire column (via a ':' operator): does the colon refer  
to the
local or remote bounds?

For all of these reasons, we dropped support for asymmetric coarrays  
from our
CAF 2.0 compiler.


Proposal 9.
-----------

We note that when reading a standard, it is useful to have names that  
are
logically associated appear near to each other in the standard,  
including the
index and a table of intrinsics.  For this reason, we have adopted  
event_wait
and event_notify in the Rice CAF 2.0 implementation.

9.1) As detailed in our memory model notes (see Appendix I), notify  
should be
a "release" operation.  Coarray operations that appear after a notify  
can
execute before the notify, but no coarray operations before a notify  
should
execute after the notify.  This is needed to make events reasonable:  
If a
programmer writes to a remote coarray then performs a notify to  
signal that
the write has completed, the write had better not be delayed until  
after the
notification!

Similarly, query should be an "acquire" operation (antisymmetric  
dependences).

In general, the semantics of notify should be non-blocking.   
Notification
should occur after the communication completes, but there is no need  
to block
the caller until that time.  This would just make it harder to overlap
communication latency with computation (which is crucial for extracting
maximum performance in HPC environments).

9.2) It seems strange to separate the image number and event name  
when they
could be combined into a single parameter.  For example, the second  
statement
below seems far more intuitive and in keeping with existing coarray  
syntax:

         notify(3, some_event(i))       ! As proposed
         notify(some_event(i)[3])       ! Implemented in Rice CAF 2.0

9.3) We disagree with restricting the number of outstanding notifies  
to one.
For example, a bounded buffer implementation could take advantage of  
-- and
would require -- higher limits.

9.4) We note that image numbers should be relative to a team.  For  
example,
in the following call, j is relative to the team some_team, not an  
absolute
image number:

         notify(some_event(i)[j at some_team])

9.5) Please don't conflate notify and query with (asynchronous)  
barriers.
Point-to-point and collective operations should be kept as separate
operations.


On the subject of events, similar to the locksets we proposed earlier  
in this
document, we propose eventsets.  As implemented in Rice CAF 2.0, these
collections of events offer programmers the following convenient
functionality:

         notifyall:    perform a notify on each member event
         waitall:      wait for each member event to be notified
         wantany:      wait for one member event to be notified,  
similar to
                       the socket library select() method
         waitanyfair:  wait for one of the member events that has  
received
                       the fewest notifications to be notified

Since it may not be obvious, the intent behind waitanyfair is that by  
calling
it in a loop exactly N times, where N is the cardinality of the  
eventset, it
is guaranteed that each component event will have been notified  
exactly once
at the termination of the loop.


References
----------

[1] Hans-J. Boehm, Lawrence Crowl. C++ Atomic Types and Operations.  
ISO/IEC
JTC1 SC22 WG21 N2427 = 07-0297 - 2007-10-03.
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2427.html.

[2] HPC Challenge benchmark. http://icl.cs.utk.edu/hpcc.

[3] A. Skjellum, N. E. Doss, and P. V. Bangalore. Writing libraries in
MPI. In A. Skjellum and D. S.  Reese, editors, Proceedings of the  
Scalable
Parallel Libraries Conference, pages 166=E2=80"173. IEEE Computer  
Society Press,
October 1993.



Appendix I: Commentary on the Fortran 2008 Memory Model
-------------------------------------------------------

In this section, we present our views/comments on the memory model  
described
in the draft Fortran 2008 standard (while the memory model has not been
formally described in the F2008 standard, our views are based on  
information
mined from it, especially Section 8.5: Image Execution Control).


Comment #1: The draft standard does not define the consistency  
requirements
within a segment.

We recommend processor consistency for coarray reads/writes within a  
segment.

The absence of any form of consistency within a segment allows  
aggressive
compiler/hardware reorderings; this forces the programmer to introduce
numerous memory fences in the program for correctness, making it  
harder to
add optimization.


Comment #2: The current memory model effects a difficult programming  
model.

We believe that the average programmer should not be exposed to the
intricacies of the memory model (such as needing to use sync_memory) in
order to write correct code.
	
The current memory model supports a "performance-first" approach by  
allowing
aggressive compiler/hardware optimizations which re-order within a  
segment or
between segments which are not ordered via image control
constructs. Programmers have to use sync_memory to avoid subtle race
conditions in most places especially when asynchronous operations are
employed. We believe that average programmers should not need to  
learn the
intricacies of the memory model to obtain correct code.


Comment #3: The current memory model lacks predicated fences.

We believe that predicated fences, such as our cofence, are a necessary
addition.

The current memory fences (sync_memory) are not sufficiently flexible to
provide the required performance tuning that advanced programmers  
would need.
As currently described, sync_memory acts as a barrier for all memory and
coaarray operations. However, advanced programmers need constructs to
separately capture the local/global completion of coarray operations,
especially asynchronous ones.

The cofence construct allows programmers to control the local  
completion of
put, get, and implicitly synchronized asynchronous operations. The  
cofence
API is as follows:

         cofence({DOWNWARD=PUT/GET/PUT_GET}, {UPWARD=PUT/GET/PUT_GET})

Cofence takes two optional arguments.  The first specifies which  
categories
of implicit asynchronous operations i.e. "put/get" to allow  
downwards, and
the second argument specifies which category of implicit asynchronous
operations to allow upwards. Depending upon the argument values  
passed, the
cofence allows puts, gets, or both to pass across the cofence in the
specified direction.

Let us consider a collective asynchronous broadcast operation to  
understand
the use of cofences in tuning performance.

         ! process p is performing a broadcast
         broadcast_async(buffer, p)
         cofence(DOWNWARD=GET, UPWARD=PUT_GET)

         ! after the cofence, buffer can be safely overwritten
         buf = ...

         ! wait for global completion of the broadcast
         confence

In the above code sample, process p is performing an asynchronous
broadcast. Once p sends the broadcast data to its children (i.e. the
broadcast is locally complete in p) p does not need to participate in  
the
remainder the broadcast. Processor p can thus overlap useful work,  
such as
preparing the next iteration of buffer, with waiting for the  
broadcast to
complete.


While capturing this local completion, it is performance efficient to  
allow
other "get" memory operations to be performed later (allowed to pass
downwards) or "put/get" memory operations to be performed earlier  
(upwards)
relative to the cofence.  A full memory barrier would not allow these
efficiencies.

The broadcast is globally complete when all participating processes  
obtain
the broadcast data. Global completion is important for process p if
activities after the broadcast in p are dependent directly/ 
transitively on
the assumption that the other processes have received the broadcast  
data.


Comment #4: It is not clearly stated (but it is implied) that functions
should not have side effects. This should be clarified in the standard.



Appendix II: Copointers in Rice CAF 2.0
---------------------------------------

CAF 2.0 adds global pointers to the Fortran language in support of  
irregular
data decompositions, distributed linked data structures, and parallel  
model
coupling. The definition and use of these new "copointers" is as  
similar as
possible to ordinary Fortran pointers: they are declared with new  
attributes
analogous to 'pointer' and 'target', manipulated with the existing '=>'
pointer assignment statement, and inspected with the existing pointer
intrinsics. Accessing data via copointers is as similar as possible to
existing coarray accesses, with implicit access to the local image and
explicit access to remote images using a square-bracket notation. CAF  
2.0's
copointers may point to values of any type, including coarrays; we  
believe
that copointers to coarrays will be especially valuable for parallel  
model
coupling in systems like the Community Earth System Model. Copointers  
can be
implemented easily and efficiently in existing CAF compilers; we have  
already
begun adding them to our prototype CAF 2.0 compiler.

The rest of this note explains the copointer concept in more detail,  
then
describes how copointers are declared, created, copied, dereferenced,  
and
inspected. It closes by mentioning a few nonobvious semantic details and
sketching an implementation strategy.

The approach here is tutorial rather than formal and terminology is  
for the
most part programmer- oriented rather than compatible with the Fortran
standard documents. For instance, we usually say "variable" rather  
than the
standards' "entity" and "points to" rather than "is associated with".  
But not
always!

COPOINTERS AND COTARGETS

Copointers are typed "global pointers" which can point to storage on any
processor ("image") in a parallel computer. Each copointer points to a
specific typed block of storage (a Fortran "entity") allocated on a  
specific
image. Despite the"co" in their name, copointers are not distributed  
across
images like coarrays; each copointer is a small scalar value residing  
on a
single image. Apart from their global reach, the semantics of  
copointers is
nearly identical to the semantics of ordinary Fortran pointers:  
copointer
variables and copointer components of derived types may be declared,  
set to
point to other entities, copied, dereferenced, sectioned via  
subscripting to
yield copointers to subentities, and examined via the existing Fortran
pointer intrinsics. It may be helpful to think of a copointer as a  
pair <i,p>
where 'i' is an image number and 'p' is an ordinary Fortran pointer  
valid on
'i', although the implementation may be different.

Cotargets are entities which may become the destination of a  
copointer. Such
entities must be declared with the 'cotarget' attribute, just as  
potential
destinations of ordinary pointers must be declared with the 'target'
attribute. If a CAF2 implementation relies on special "shared memory"  
regions
for efficient communication between images, then it will allocate  
entities
with the 'cotarget' attribute in such a region. Cotarget entities are  
in all
other respects ordinary entities and may be used locally without  
restriction.

Copointer values may be freely copied, even from one image to  
another, and
each new copy points to the same specific storage block on the same  
specific
image as does the original copointer. Creating and copying copointers  
are
cheap, purely local operations. So is dereferencing a copointer that  
happens
to point to the image doing the dereferencing. Dereferencing a copointer
which points to a different image requires the same sort of  
communication as
a corresponding off-image coarray reference.

DECLARING COPOINTERS AND COTARGETS

Copointer and cotarget entities are declared with the usual Fortran
declaration syntax augmented with new 'copointer' and 'cotarget'
attributes. For instance, to declare an integer array and a copointer  
which
can point to it, we write

         integer, dimension(10), cotarget :: a1
         integer, dimension(:), copointer :: p1

This makes 'a1' an array of 10 integers allocated in shared memory  
and 'p1' a
copointer variable of compatible type. Copointers may point to  
entities of
any type, subject to the limitations of Fortran's attribute syntax as
explained in the next paragraph. In particular CAF 2.0 allows  
copointers to
coarrays, providing an expressive and efficient mechanism for model  
coupling
in large parallel codes. The 'copointer' and 'cotarget' attributes  
may be
combined with other Fortran attributes just as 'pointer' and 'target'  
may
be. For instance,

         type(t), dimension(:,:), save, contiguous, copointer :: p2

declares a copointer entity 'p2' which points to two-dimensional  
arrays of
elements of derived type 't', which retains its association across  
subprogram
invocations, and which can only be associated with contiguous cotarget
arrays.

Declaring cotargets needs no further explanation. To describe how  
copointer
types are declared we must first consider a key syntactic feature of
Fortran's existing type declarations: namely, that the textual order  
in which
an entity's attributes are given is insignificant. This feature both  
resolves
potential ambiguities and limits the set of data types which can be
expressed. For instance, both of the following declarations specify type
"pointer to array of integer":

         integer, pointer, dimension(:) :: p3
         integer, dimension(:), pointer :: p4

Since the order of appearance of 'pointer' and 'dimension' does not  
matter,
the ambiguity in interpretation is resolved by a rule we can write as
"pointer < dimension"; that is, 'pointer' has lower syntactic  
priority than
'dimension' and so is applied later during type formation, giving
"pointer(dimension1(integer))" as the specified type. Because of this  
rule,
there is no way to express the type "array of pointer to integer" in
Fortran. However, the missing type can be simulated by wrapping a  
pointer in
a derived type:

         type :: t; integer, pointer :: p; end type
         type(t), dimension(:) :: a2
         ! initialize a2 =E2=80=A6
         a2(1)%p = 0

We can now describe the precise syntactic intepretation of  
'copointer' in CAF
2.0 by the following rules:

         pointer < copointer < codimension < dimension

These precedence relations are consistent with the existing syntax of  
Fortran
2008 and give an unambiguous interpretation of every possible  
combination of
these four attributes in a type declaration. For instance, both of the
following declarations specify a copointer to a coarray of corank 1,  
rank 2,
and element type integer:

         integer, dimension(:,:), codimension(:), copointer :: p5
         integer, copointer :: p6(:,:)[*]

In each declaration, the three attributes 'copointer', 'codimension',  
and
'dimension' occur and are interpreted in that order to give the type
"copointer(codimension1(dimension1(integer)))".

Like Fortran 2008's, CAF 2.0's attribute interpretation rules resolve
ambiguity at the cost of limiting the set of types which can be directly
expressed. For instance, "array of copointer" can't be expressed but  
can be
simulated with derived types just as shown above for "array of pointer".

CREATING AND COPYING COPOINTERS

Copointers are created and copied via Fortran's existing 'allocate' and
pointer assignment statements in the same way as ordinary pointers.  
There are
four cases to consider.

(1) A copointer is created when an 'allocate' statement is executed  
with a
copointer variable as its argument. The allocated storage comes from the
current image's shared memory region so that it can be accessed from any
other image. A copointer to that storage is created and stored in the
argument variable.

(2) A copointer is created when a pointer assignment statement's  
right hand
side (RHS) is a plain data reference; a new copointer to the RHS is  
assigned
to the variable on the left hand side (LHS). (In Fortran terminology,  
the LHS
entity "becomes copointer associated with" the RHS data ref.) The RHS  
must
have the 'cotarget' attribute. The RHS may be either a reference to  
local
data on this image or a reference to remote data on another image; in  
either
case, a copointer is created which points to the RHS data. Of course,  
for an
RHS to reference remote data it must be a coarray reference or a
copointer-dereference expression (next section). For instance, the  
following
two statements both create copointers, one pointing to a local array  
and one
pointing to an array on another image:

         integer, dimension(:), copointer :: p7, p8  ! copointer to  
array of integer
         integer, dimension(10), cotarget :: a3[*]   ! coarray of  
array of integer
         p7 => a3                                    ! copointer to  
a3's local array
         p8 => a3[9]                                 ! copointer to  
a3 on image 9

(3) When a pointer assignment statement's RHS is an ordinary (i.e.  
local)
pointer, the local pointer cannot be copied as-is into the LHS  
because its
type is not correct. Instead, the pointer is converted into a  
copointer and
assigned to the LHS; this is a form of copointer creation. For instance:

         integer, dimension(:), pointer :: r  ! pointer to array of  
integer
         r => a3                              ! creates pointer to  
a3's local array
         p7 => r                              ! converts local  
pointer to copointer

(4) A copointer is copied when a pointer assignment statement's RHS is
already a copointer. Given the previous declarations, the following  
statement
copies an existing copointer:

         p7 => p8

DEREFERENCING COPOINTERS

Copointers may be "dereferenced" to get a data reference that can be  
used in
either RHS or LHS contexts. In general the data reference is remote, so
loading from it and storing into it require communication with another
image. For this reason, CAF 2.0 requires copointers to be explicitly
dereferenced via a new "co-dereference operator" ([ ]) to indicate this
communication cost in the source code. This is in contrast to Fortran's
implicit dereferencing of ordinary pointers. For instance, the  
previously
introduced variable 'p7' is a copointer to array of integer, so 'p7 '  
is just
an array of integer, and the following assignments copy integers and  
integer
arrays between this image and some other image:

         integer :: k
         integer, dimension(10) :: a4
         k = p7[ ](1)
         a4 = p7[ ]
         p7[ ](1) = a4(1)
         p7[ ] = a4

For additional expressiveness, CAF 2.0 allows a copointer to be  
dereferenced
implicitly when it is known that the copointer points to local data.  
This
indicates in the source code that the dereference operation requires no
communication. The result of an implicit dereference is undefined if the
copointer points to another image. For instance, if the value of 'p7'  
is a
copointer to this image we can write:

         k = p7(1)
         a4 = p7
         p7(1) = a4(1)
         p7 = a4

COPOINTER INTRINSIC FUNCTIONS

CAF 2.0 extends the pointer-related intrinsic procedures of Fortran  
2008 to
work with copointers as well. For instance, 'associated(p7)' returns a
boolean indicating whether 'p7' is associated with a target, and 'p7 =>
null()' sets 'p7' to disassociated status.

In addition, CAF 2.0 provides a new intrinsic function 'imageof' which
returns the image number to which an associated copointer points. It is
undefined if the copointer is disassociated.

SEMANTIC DETAILS

Here are a few related details of CAF 2.0 semantics.

(1) A copointer value may be implicitly converted into an ordinary  
pointer
when it is known that the copointer points to local data. The result is
undefined if it points to another image. For instance, if the value  
of 'p7'
is a copointer to this image we can write:

         r => p7

(2) Fortran 2008 forbids associating an ordinary pointer with a  
remote data
reference (a coindexed object, i.e.  all or part of a coarray).  
Similarly,
CAF 2.0 forbids associating an ordinary pointer with the result of
dereferencing a copointer. Thus the following statement is incorrect:

         r => p7[ ]	! not allowed, even though RHS is type-compatible
                         ! with 'r' ("array of integer")

(3) As mentioned above, CAF 2.0 allows all possible combinations of  
the four
type-determining attributes. In addition to our new attributes, this  
extends
Fortran 2008's use of existing attributes by allowing "pointer to
coarray". CAF 2.0 also eliminates Fortran 2008's restrictions on nesting
coarrays and on embedding coarrays within arrays.

IMPLEMENTATION

CAF 2.0's copointers can be easily and efficiently implemented so that
copointer dereferencing is no more expensive than a corresponding  
coarray
reference, and typically cheaper. To add copointers to a compiler which
already implements coarrays, one has only to factor the code  
generation for a
coarray reference into two parts: a generalized address calculation to
determine which bytes are needed from which image, followed by a
communication operation to obtain those bytes across the  
interconnect. Then
the code for a copointer dereference is just the communication code,  
because
a copointer's representation essentially caches the result of an address
calculation.

Specifically, our prototype CAF 2.0 compiler represents a copointer  
value as
a pair <i,p> where 'i' is an image number and 'p' is an ordinary Fortran
pointer valid on image 'i'. Our prototype dereferences a copointer to  
remote
storage by sending its pointer 'p' to the image 'i' that created it,
dereferencing the pointer normally on 'i', and receiving the fetched  
bytes in
reply. This is about the same communication cost as a corresponding  
coarray
reference. On a machine whose interconnect hardware supports one-sided
communication, the CAF 2.0 runtime could decode 'p' and use the  
corresponding
addresses, strides, and lengths to initiate low level hardware  
communication
directly.

Our prototype's representation does make an assumption about the  
underlying
Fortran compiler's storage allocator: the allocator must tolerate our  
copying
and storing of pointers beyond its reach. For instance, the allocator  
must
not do reference counting or garbage collection, nor storage  
compaction by
moving blocks and updating pointers, because the allocator cannot see  
our
copies of pointers on other images. The Fortran language does not  
necessitate
any of this, and in fact all commonly used Fortran compilers satisfy our
assumption. However, a simple change of representation would permit
implementing CAF 2.0 on an allocator which doesn't satisfy the
assumption. The pointer component 'p' is replaced by an opaque handle  
'h'
which can be looked up somehow on image 'i' to yield a corresponding  
pointer;
instead of sending 'p' to the remote image, one would send 'h'  
instead at the
same cost, and the rest of the implementation would be unchanged.


--
Bill Scherer
Research Scientist
Department of Computer Science
Rice University
Houston, Texas, USA





More information about the J3 mailing list