(j3.2006) (SC22WG5.5898) 3 levels of parallelism?
Clune, Thomas L. GSFC-6101
thomas.l.clune
Wed Jul 5 15:49:16 EDT 2017
I think the concern is that in practice, the user needs to be able to control the division of work between the two levels to optimize performance. A model where the compiler will make the decisions will be easier to use, but ?
The history of optimization in real codes is littered with the difference between ?in theory? and ?in practice?.
On Jul 5, 2017, at 3:30 PM, Brian Friesen <bfriesen at lbl.gov<mailto:bfriesen at lbl.gov>> wrote:
On Wed, Jul 5, 2017 at 12:01 PM, Clune, Thomas L. (GSFC-6101) <thomas.l.clune at nasa.gov<mailto:thomas.l.clune at nasa.gov>> wrote:
Thanks. I should have realized that array notation was the missing bit.
It will be interesting to see if Nvidia sees the situation in a similar light. Gary? ...
I can (naively) imagine a scenario in which a GPU compiler would do the same thing that Bill mentioned for CPUs:
> Note that, based on the provided semantics, the compiler can choose threading, or vectorization, or both, for the loop, depending on the code involved in the loop body.
It seems to me that GPUs have 2 levels of intra-node parallelism as well, namely warps+threads. So a GPU compiler encountering a DO CONCURRENT could choose to divide the loop iteration space into particular configurations of warps and threads based on the amount of work in the loop, just like a CPU compiler. Unless I grossly misunderstand GPU parallelism (Gary?).
_______________________________________________
J3 mailing list
J3 at mailman.j3-fortran.org<mailto:J3 at mailman.j3-fortran.org>
http://mailman.j3-fortran.org/mailman/listinfo/j3
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.j3-fortran.org/pipermail/j3/attachments/20170705/45271bb4/attachment-0001.html
More information about the J3
mailing list