[J3] Performance Portability and Fortran: Making Fortran cool again
Ondřej Čertík
ondrej at certik.us
Wed Jan 16 14:16:57 EST 2019
Hi Bill,
On Wed, Jan 16, 2019, at 10:25 AM, Bill Long via J3 wrote:
> Hi Ondrej,
>
> This sort of insight is very valuable. Thanks for posting it.
>
> There seems to be a lot of focus on using GPU’s. (Maybe that’s why
> they asked Gary -who works for NVIDIA - to participate?)
>
> I would point out that a DO CONCURRENT construct has semantics that are
> quite compatible with execution on a GPU. Typically, DO CONCURRENT
> constructs are threaded, using the same underlying infrastructure as
> OpenMP. I’ve mentioned to our compiler developers about adding GPU
> support, but the chicken-egg problem is “no customer is asking for
> this”. If customers, especially ones as large and visible as LANL,
> ask, you might get. If the standard needs tweaks to better enable GPU
> execution of DO CONCURRENT, that is something we should look into.
Yes, that is exactly the kind of discussion that we have to have. In my experience, exactly as Damian Rouson said above, Fortran developers do not participate in such discussions, they just use whatever tools compiler people give to them and get the job done. That is good, but I think what we also need is to have these kind of discussions so that people can see that there is a way or at least a path for Fortran. And to put demand on some of the features. So I am at least trying to get the conversation started.
>
> > On Jan 15, 2019, at 11:54 PM, Ondřej Čertík via J3 <j3 at mailman.j3-fortran.org> wrote:
> >
> >
> > Probably not what you want to hear, but many people at my Lab are moving away from Fortran to C++/Kokkos, because Fortran currently doesn't have a clear path forward I am afraid. That's one reason I decided to be active here.
> >
> > Kokkos allows to have the same code/loop run in parallel on a CPU and a GPU, and to switch array memory layout accordingly.
> >
> > The closest that Fortran has is OpenACC/OpenMP which allows to have the same loop to run on both CPU and GPU, but it doesn't seem to have a mechamism to switch the memory layout like Kokkos does.
> >
> > Another problem is that only a few Fortran compilers support OpenACC currently, while Kokkos runs on most major C++ compilers.
>
> I think OpenACC is being replaced by the GPU features in the newer
> OpenMP specs. The relevant metric here is OpenMP support. Users
> should not be using OpenACC for new code.
Yes.
>
> >
> > So I would say that currently there is no equivalent of Kokkos in Fortran, so we can't do performance portability in the Kokkos's sense.
> >
> > However, if you give up on having the same code base run on both CPU and GPU, then Fortran has Cuda Fortran, which I think very naturally extends Fortran with a few keywords and constructs to run on a GPU. I think the resulting code is simpler than Cuda C, and I would argue simpler than Kokkos. But it only runs on a GPU, which some people don't like, but I also know people who think that's the solution, to structure their code so that only a minor part has to be targeted to a GPU specifically using Cuda Fortran.
> >
> > So you can mention Cuda Fortran in your slide.
>
> Cuda tends to be not portable. Will it work with the GPU’s from AMD or Intel?
I think currently it doesn't, but the changes that Cuda Fortran does to the language look very portable to me. Here is an example:
! Kernel definition
attributes(global) subroutine ksaxpy( n, a, x, y )
real, dimension(*) :: x,y
real, value :: a
integer, value :: n, i
i = (blockidx%x-1) * blockdim%x + threadidx%x
if( i <= n ) y(i) = a * x(i) + y(i)
end subroutine
! Host subroutine
subroutine solve( n, a, x, y )
real, device, dimension(*) :: x, y
real :: a
integer :: n
! call the kernel
call ksaxpy<<<n/64, 64>>>( n, a, x, y )
end subroutine
The only changes to the standard Fortran are the "device" keyword in the definition of the x,y arrays, the "attributes(global)" annotation to the subroutine, the kernel launching syntax "ksaxpy<<<n/64, 64>>>", and finally the implicit blockidx, blockdim and threadidx derived types.
Unfortunately I am still learning about GPUs, so I don't know if this can be mapped nicely to how GPUs from other companies work, but I think this is the direction that it should be implemented in Fortran (we can argue about the exact syntax, but that's for another discussion).
Regarding "DO CONCURRENT", yes, that should absolutely run on a GPU. But I think you need to be able specify more information. Here is an example of a loop using OpenACC (later this will become just OpenMP):
https://github.com/NVIDIA-developer-blog/code-samples/blob/ea178c7ffb013f6410f442456ef5f3e84c4d16fb/posts/002-openacc-example/step3/laplace2d.f90#L82
and as you can see, to get optimal performance on a (particular?) GPU, one has to specify the "gang(32)", "vector(16)" parameters. The same issue with a CPU, typically the OpenMP pragmas have information that needs to be communicated from the developer to the compiler.
So I don't know what the design should be, but I think Fortran is already heading in this direction that it has parallel constructs in the language itself and it makes for a very pleasant programming experience, and so it makes perfect sense to me to figure out how to extend the language in a portable way to be able to program these new platforms.
Ondrej
More information about the J3
mailing list