(j3.2006) (SC22WG5.5898) 3 levels of parallelism?
Bill Long
longb
Thu Jul 6 13:52:15 EDT 2017
> On Jul 6, 2017, at 7:28 AM, Clune, Thomas L. (GSFC-6101) <thomas.l.clune at nasa.gov> wrote:
>
>
>> On Jul 5, 2017, at 6:05 PM, Brian Friesen <bfriesen at lbl.gov> wrote:
>>
>> On Wed, Jul 5, 2017 at 12:49 PM, Clune, Thomas L. (GSFC-6101) <thomas.l.clune at nasa.gov> wrote:
>> I think the concern is that in practice, the user needs to be able to control the division of work between the two levels to optimize performance. A model where the compiler will make the decisions will be easier to use, but ?
>>
>> The history of optimization in real codes is littered with the difference between ?in theory? and ?in practice?.
>>
>> But isn't the point of the language to be descriptive, not prescriptive? Controlling the division of work among different types of parallelism seems like a job for programming models like OpenMP, OpenACC, etc. They evolve rapidly and follow architecture trends. OpenMP already supports both thread- and SIMD-level parallelism. (Don't know about OpenACC.)
>
> The problem is that despite the best efforts of compiler developers, it is still often the case that a human that understands the problem can do a better job of decomposing the work.
Indeed, experience suggests that this is the case. At least if the programmer actually understands the problem. And how to decompose a problem to run in parallel. Compiler developers are part of the equation here, but more important are the language constructs that make decomposition and parallel execution less painful for the programmer.
> Sure the language should provide means to let the compiler do as well as it can, but it should also allow the user to assert a bit more control. Perhaps the best example of the dichotomy was HPF. There the user was generally freed of concerns about domain decomposition, but was also generally deprived of performance. But the only HPF codes that I ever saw that scaled well, generally cheated and used MPI under the hood.
HPF was a failed idea almost from the start. In fact, the whole one-sided phenomenon starting with SHMEM and then coarrays (in the 1990?s) arose partly out of frustration with HPF. So, that is probably HPF?s greatest contribution - being so bad that people were forced to start over with a radically different scheme.
> The general assessment appears to be that user-control of domain decomposition at the coarsest levels for nontrivial applications is generally better than what a general-purpose compiler can achieve. Coarrays definitely leave that control to the user and make no attempt to determine the division of work.
Intentionally. The SPMD model, used by both Fortran and MPI, has proved to provide the best scaling and performance in real distributed-memory applications.
Cheers,
Bill
>
> As another extreme case, consider why we don?t use OpenMP for large-scale parallelism. There is nothing (or at least very little) in the model that would preclude using it in a distributed context. But the lack of ability for the user to indicate memory-affinity and other such concerns means that the scaling is terrible in almost every nontrivial case. (And I?m probably out-of-date. Recent incarnations have probably let the user provide some such information to aid the compiler/library.) Nonetheless, some experts have scaled OpenMP applications across 100s of nodes (on an SGI) ? demonstrating performance comparable to MPI. How did they do it? At the top of the code they allocate special buffers that are used to communicate between threads. Then the bulk of the application is entirely within one all-encompassing parallel loop. The extents of the local domain managed by a given thread are calculated and suitable thread-private arrays are allocated. First-touch rules are used to achieve good affinity. The end result looks a lot like MPI (or perhaps a bit more like CAF). Again - the performance was achieved by explicit domain decomposition.
>
> Within a node, memory access and such is a bit more symmetric. Compilers can often do a good job. But my understanding from various talks about optimization on GPU?s is that the user needs to spend a bit more time parameterizing the decomposition and then experimentally tuning the parallelism by running the code with varying parameters. (I have no hands-on experience, so I?m intentionally being abstract/vague.)
>
>>
>> Or did you have something more subtle in mind?
>
> Not known for subtlety. But I sometimes achieve obscurity. :-)
>
> I really am not angling for any agenda here. And to be honest if I?d connected array notation with vectorization, I probably would not have started the thread.
>
> Cheers,
>
> - Tom
>
> PS Nothing in this message was meant to disparage compiler developers in any way. They have the task of optimizing _all_ codes and appeasing (to varying degrees) all customers. I have no reason to suspect they have done anything other than a superlative job given the daunting difficulty of the task.
>
>> _______________________________________________
>> J3 mailing list
>> J3 at mailman.j3-fortran.org
>> http://mailman.j3-fortran.org/mailman/listinfo/j3
>
> _______________________________________________
> J3 mailing list
> J3 at mailman.j3-fortran.org
> http://mailman.j3-fortran.org/mailman/listinfo/j3
Bill Long longb at cray.com
Principal Engineer, Fortran Technical Support & voice: 651-605-9024
Bioinformatics Software Development fax: 651-605-9143
Cray Inc./ 2131 Lindau Lane/ Suite 1000/ Bloomington, MN 55425
More information about the J3
mailing list