(j3.2006) (SC22WG5.5226) Vote on draft TS

John Reid John.Reid
Tue Apr 15 18:31:59 EDT 2014


-------------- next part --------------
Please answer the following question "Is N2007 ready for forwarding to 
SC22 as the DTS?" in one of these ways. 

2) Yes, but I recommend the following changes. 

1. Failed images

There has not been adequate discussion of an image stalling because of 
the failure of another image, which will cause it to fail despite there
being nothing wrong with its hardware. All we have is a warning in NOTE
5.7. If restarting is planned, as in the example A.1.2, such images 
should be available for reuse. Such stalling anyway indicates that the
team calculation has gone wrong, so I suggest addding a RENDEZVOUS 
specifier to the SYNC ALL statement. If a stalled image is detected, all
images of the team would re-execute the most recently executed 
RENDEZVOUS FOR ALL statement with STAT= value set to STAT_FAILED_IMAGE,
ignoring all pending synchronizations initiated since that FOR ALL was 
last executed. Alternatively, all images of the team could exit the
change team construct. That would give the required effect in A.1.2
without any code change. 

The programmer who wishes to guard against stalling because of accessing
an image that has failed may do do as follows:
    IF(ALL(FAILED_IMAGES()/=i)) THEN
       a = a[i]
    ELSE
       ! Do something else
    END IF
This is illustrated in my rewrite of A.2.1 below. I think we may need 
a logical function that tests a given image for failure. 

2. Teams

I am unhappy with the idea a code using an image selector for ancestor 
team not visible in the scope. For example, a library code should leave 
such work to the calling code. 

We have chosen to address an ancestor in an image selector by using its
team variable name. We should be doing the same for other references to
an ancestor. An ancestor team may be at different team distances on 
executing images of different descendant teams. I suggest the 
replacement of DISTANCE by TEAM as arguments of TEAM_ID, NUM_IMAGES,
and THIS_IMAGE, the removal of DISTANCE as an argument of GET_TEAM,
and the removal of TEAM_DEPTH. 

3. Edits

[10:27] Change "Apart from its final upper bound, its" to "Its".
[There is nothing special about the codimension-decl here, so
there is no need to say anything about the final codimension.]

[10:30] Change "established." to "established, apart from its 
final upper cobound".
[Here, the final upper cobound is likely to be different.]

[13:10-13] Replace by "The value of the default integer scalar 
constant STAT_FAILED_IMAGE is positive and different from the value 
of STAT_STOPPED_IMAGE, STAT_LOCKED, STAT_LOCKED_OTHER_IMAGE, or 
STAT_UNLOCKED. If the processor detects that an image of the current 
team has failed, the". 
[With the addition of FAIL IMAGE, we want a processor that cannot
detect a truly failed image to respond to the execution of 
FAIL IMAGE.] 

[15:6-24] Should an event variable be atomic?

[15:34] Wording like that at [17:12-14] is needed here. 

[23:26] EVENT_QUERY should be an atomic subroutine. 

[40:4, 40:25-32] To make the code tolerate failing spare images, 
replace the DO loop by the following:
     k = images_used
     DO i = 1, size(failed_img)
        IF (failed_img(i) == 1) ERROR STOP 'cannot recover'
        DO k = k+1, num_images()
           IF (all(failed_img(:)/=k) EXIT
        END DO
        IF (me == k) THEN
           me = failed_img(i)
           id = 1
           EXIT
        END IF
     END DO
If this is done, [40:4] should be deleted.     

[43:6-44:34] This example has bugs. I think the message would 
be clearer if this example were replaced by an example that is
not tolerant to failed images, followed by a modification that
is. Here is a draft. 

A.2.1 EVENT_QUERY example

The following example illustrates the use of events via a program in 
which image 1 acts as master and shares out work items to the other 
images. Only one work item at a time can be active on a worker image, 
and each deals with the result (e.g. via I/O) without directly feeding 
data back to the master image.

Because the work items are not expected to be balanced, the master 
keeps cycling through all the images to find one that is waiting for 
work.

An event is posted by each worker to indicate that it has completed 
its work item. Since the corresponding variables are needed only on 
the master, we place them in an allocatable array component of a 
coarray. An event on each worker is needed for the master to post the 
fact that it has made a work item available for it.

PROGRAM work_share
   USE, INTRINSIC :: iso_fortran_env
   USE :: mod_work, ONLY: & ! Module that creates work items
      work, & ! Type for holding a work item
      create_work_item, &  ! Function that creates work item
      process_item, & ! Function that processes an item
      work_done ! Logical function that returns true if all work done
   TYPE(event_type) :: submit[*] ! Whether work ready for a worker 
   TYPE :: asymmetric_event
      TYPE(event_type), ALLOCATABLE :: event(:)
   END TYPE
   TYPE(asymmetric_event) :: free[*] ! Whether worker is free
   TYPE(work) :: work_item[*] ! Holds all the data for a work item
   INTEGER :: count, i, n, nbusy[*]
   
    IF (this_image() == 1) THEN
      ! Get started
      ALLOCATE(free%event(2:num_images()))
      nbusy = 0 ! This holds the number of workers working
      DO i = 2, num_images() ! Start the workers working
         IF (work_done()) EXIT
         nbusy = nbusy + 1
         work_item[i] = create_work_item()
         EVENT POST (submit[i])
      END DO 
      ! Main work distribution loop
      master : DO  
         image : DO i = 2, num_images() 
            CALL EVENT_QUERY(free%event(i), count)
            IF (count == 0) CYCLE ! Worker is not free
            EVENT WAIT (free%event(i)); 
            nbusy = nbusy - 1
            IF (work_done()) CYCLE
            nbusy = nbusy + 1
            work_item[i] = create_work_item()
            EVENT POST (submit[i])
         END DO image
         IF ( nbusy==0 ) THEN ! All done. Exit on all images.
            DO i = 2, num_images() 
               EVENT POST (submit[i])
            END DO
            EXIT master
         END IF
      END DO master
   ELSE
      ! Work processing loop
      worker : DO 
         EVENT WAIT (submit)
         IF (nbusy[1] == 0) EXIT
         CALL process_item(work_item)
         EVENT POST (free[1]%event(this_image()))
      END DO worker
   END IF
END PROGRAM work_share


A.2.1a EVENT_QUERY example that tolerates image failure

This example is an adaptation of the example of A.2.1 to make it able 
to execute in the presence of the failure of one or more of the worker
images. The function create_work_item now accepts an integer argument 
to indicate which work item is required. It is assumed that the work 
items are indexed 1, 2, ... . It is also assumed that if an image fails
while processing a work item, that work item can subsequently be 
processed by another image. 

The internal subroutine failed tests whether a particular image has 
failed. 

PROGRAM work_share
   USE, INTRINSIC :: iso_fortran_env
   USE :: mod_work, ONLY: & ! Module that creates work items
      work, & ! Type for holding a work item
      create_work_item, &  ! Function that creates work item
      process_item, & ! Function that processes an item
      work_done ! Logical function that returns true if all work done
   TYPE(event_type) :: submit[*] ! Whether work ready for a worker 
   TYPE :: asymmetric_event
      TYPE(event_type), ALLOCATABLE :: event(:)
   END TYPE
   TYPE(asymmetric_event) :: free[*] ! Whether worker is free
   TYPE(work) :: work_item[*] ! Holds all the data for a work item
   INTEGER :: count, i, k, kk, n, nbusy[*], np, status
   INTEGER, ALLOCATABLE :: working(:) ! Items being worked on
   INTEGER, ALLOCATABLE :: pending(:) ! Items pending after image failure
   
    IF (this_image() == 1) THEN
      ! Get started
      ALLOCATE(free%event(2:num_images()))
      ALLOCATE(working(2:num_images()), pending(num_images()-1))
      nbusy = 0  ! This holds the number of workers working
      k = 1 ! Index of next work item
      np = 0 ! Number of work items in array pending
      DO i = 2, num_images() ! Start the workers working
         IF (work_done()) EXIT
         work_item[i] = create_work_item(k)         
         working(i) = k
         k = k + 1
         nbusy = nbusy + 1
         EVENT POST (submit[i], STAT=status)
         IF (status==STAT_FAILED_IMAGE) THEN
            working(i) = 0
            k = k - 1
            nbusy = nbusy - 1
         END IF
      END DO 
      ! Main work distribution loop
      master : DO  
         image : DO i = 2, num_images() 
            IF (ANY(FAILED_IMAGES()==i)) THEN ! Image has failed
               IF (working(i)>0) THEN ! It failed while working
                  np = np + 1
                  pending(np) = working(i)
                  working(i) = 0
               END IF 
               CYCLE image
            END IF
            CALL EVENT_QUERY(free%event(i), count)
            IF (count == 0) CYCLE image ! Worker is not free
            EVENT WAIT (free%event(i))
            nbusy = nbusy - 1
            IF (np>0) THEN
               kk = pending(np)
               np = np - 1
            ELSE
               IF (work_done()) CYCLE image
               kk = k
               k = k + 1
            END IF
            nbusy = nbusy + 1
            working(i) = kk
            work_item[i] = create_work_item(kk)
            EVENT POST (submit[i],STAT=status)
            ! If image i has failed, this will not hang and the failure 
            ! will be handled on the next iteration of the loop
         END DO image
         IF ( nbusy==0 ) THEN ! All done. Exit on all images.
            DO i = 2, num_images() 
               EVENT POST (submit[i],STAT=status)
               IF (status==STAT_FAILED_IMAGE) CYCLE
            END DO
            EXIT master
         END IF
      END DO master
   ELSE
      ! Work processing loop
      worker : DO 
         EVENT WAIT (submit)
         IF (nbusy[1] == 0) EXIT worker
         CALL process_item(work_item)
         EVENT POST (free[1]%event(this_image()))
      END DO worker
   END IF
END PROGRAM work_share



More information about the J3 mailing list