(j3.2006) (SC22WG5.5898) 3 levels of parallelism?
Clune, Thomas L. GSFC-6101
thomas.l.clune
Thu Jul 6 08:28:20 EDT 2017
On Jul 5, 2017, at 6:05 PM, Brian Friesen <bfriesen at lbl.gov<mailto:bfriesen at lbl.gov>> wrote:
On Wed, Jul 5, 2017 at 12:49 PM, Clune, Thomas L. (GSFC-6101) <thomas.l.clune at nasa.gov<mailto:thomas.l.clune at nasa.gov>> wrote:
I think the concern is that in practice, the user needs to be able to control the division of work between the two levels to optimize performance. A model where the compiler will make the decisions will be easier to use, but ?
The history of optimization in real codes is littered with the difference between ?in theory? and ?in practice?.
But isn't the point of the language to be descriptive, not prescriptive? Controlling the division of work among different types of parallelism seems like a job for programming models like OpenMP, OpenACC, etc. They evolve rapidly and follow architecture trends. OpenMP already supports both thread- and SIMD-level parallelism. (Don't know about OpenACC.)
The problem is that despite the best efforts of compiler developers, it is still often the case that a human that understands the problem can do a better job of decomposing the work. Sure the language should provide means to let the compiler do as well as it can, but it should also allow the user to assert a bit more control. Perhaps the best example of the dichotomy was HPF. There the user was generally freed of concerns about domain decomposition, but was also generally deprived of performance. But the only HPF codes that I ever saw that scaled well, generally cheated and used MPI under the hood. The general assessment appears to be that user-control of domain decomposition at the coarsest levels for nontrivial applications is generally better than what a general-purpose compiler can achieve. Coarrays definitely leave that control to the user and make no attempt to determine the division of work.
As another extreme case, consider why we don?t use OpenMP for large-scale parallelism. There is nothing (or at least very little) in the model that would preclude using it in a distributed context. But the lack of ability for the user to indicate memory-affinity and other such concerns means that the scaling is terrible in almost every nontrivial case. (And I?m probably out-of-date. Recent incarnations have probably let the user provide some such information to aid the compiler/library.) Nonetheless, some experts have scaled OpenMP applications across 100s of nodes (on an SGI) ? demonstrating performance comparable to MPI. How did they do it? At the top of the code they allocate special buffers that are used to communicate between threads. Then the bulk of the application is entirely within one all-encompassing parallel loop. The extents of the local domain managed by a given thread are calculated and suitable thread-private arrays are allocated. First-touch rules are used to achieve good affinity. The end result looks a lot like MPI (or perhaps a bit more like CAF). Again - the performance was achieved by explicit domain decomposition.
Within a node, memory access and such is a bit more symmetric. Compilers can often do a good job. But my understanding from various talks about optimization on GPU?s is that the user needs to spend a bit more time parameterizing the decomposition and then experimentally tuning the parallelism by running the code with varying parameters. (I have no hands-on experience, so I?m intentionally being abstract/vague.)
Or did you have something more subtle in mind?
Not known for subtlety. But I sometimes achieve obscurity. :-)
I really am not angling for any agenda here. And to be honest if I?d connected array notation with vectorization, I probably would not have started the thread.
Cheers,
- Tom
PS Nothing in this message was meant to disparage compiler developers in any way. They have the task of optimizing _all_ codes and appeasing (to varying degrees) all customers. I have no reason to suspect they have done anything other than a superlative job given the daunting difficulty of the task.
_______________________________________________
J3 mailing list
J3 at mailman.j3-fortran.org<mailto:J3 at mailman.j3-fortran.org>
http://mailman.j3-fortran.org/mailman/listinfo/j3
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.j3-fortran.org/pipermail/j3/attachments/20170706/1f7d9a0f/attachment.html
More information about the J3
mailing list