(j3.2006) [Fwd: Fortran 2008 query]
Bill Long
longb
Tue Oct 31 16:52:35 EDT 2017
Hi Tom,
Thanks for the detailed exposition of the issues faced by your models. Well within the spirit of ?use cases? . (Also, notably absent is any mention of needing preprocessing, Cmake, or configure :) ! )
Some observations below...
> On Oct 26, 2017, at 5:57 PM, Clune, Thomas L. (GSFC-6101) <thomas.l.clune at nasa.gov> wrote:
>
>
>> On Oct 26, 2017, at 5:47 PM, Bill Long <longb at cray.com> wrote:
>>
>>>
>>> On Oct 26, 2017, at 4:07 PM, Clune, Thomas L. (GSFC-6101) <thomas.l.clune at nasa.gov> wrote:
>>>
>>> Not on any systems acquired at NASA. Granted we?re no longer doubling memory-per-core with each generation of our cluster, but it is still growing. Previous machine wast 128 GB on 28 cores. Next machine is 256 GB on 40 cores. And this is primarily driven by the requirements of our Earth system model (including data assimilation).
>>
>> Well, this is today. Van was taking about the future. Memory-cpu bandwidth and latency are increasing a bottleneck. The current DIMM model for memory will not last on high-performance systems. Future memory schemes will involve (faster) memory incorporated into the processor chip, or stacked very close. And less memory/chip. With more cores, definitely less memory/core. Has your favorite Intel salesman not mentioned the 72-core chips?
>>
>>>
>>> We of course wish to reduce the memory footprint, as it would likely mean we could buy (and use) more cores. But in practice it is very hard to make headway.
>>
>> Indeed, changing the memory layout is hard. Basically it's equivalent to writing better parallelism into to the code so that the data can be distributed over a larger set of nodes. This takes thought, and sometimes a disruptive redesign. Fortunately, Fortran is better designed for this than other languages.
>
> Actually that is not our major problem. It is just the sheer number of arrays that are all decomposed in the same manner. Basically the ?state? of a modern Earth system model is very data intensive.
>
> We know that some of the arrays that currently persist across time-steps could be deallocated while other components execute. We don?t know how much could be saved that way but maybe as much as 2x. Perhaps even more could be saved by recomputing some arrays at each entry. Here we would be trading off between memory and computation. But could be a win in some cases.
Allocating and deallocating memory frequently can be expensive. Automatic or local arrays (typically stack fodder) are cheaper. Recomputing things instead of saving them for reuse far in the future is becoming more profitable.
>
> The holy grail would be to distribute the components themselves across processors. Here the problem is that it implies a fair bit of data movement as these systems are coupled. The cases where it can generally be worthwhile are things like ocean-atmosphere where the overlap is just a 2D region. Then you just have to solve the load balance problem that varies with each compiler and version of the model. Oh and the time-stepping algorithm goes from stable to weakly unstable because you have to use a lagged state. In practice it is shown to be stable enough to work (almost always). GFDL has already done this, and I think that NCAR has these capabilities. I suspect my org will go that way within the next 5 years, but it is only a modest improvement in scalability - less than 2x.
>
> Currently to use more processors for a fixed grid size we have to drop down to a footprint that is less than 30x30 per core. Our dynamical core use 3 guard cells on each side, which therefore already induces a fairly large penalty. And before you say that we should use a different algorithm, this is the core that recently won a bakeoff by NOAA to choose their next dynamical core. At the end of the day, this core produces faster results for a given computing resource than its competitors. Apparently was not even close. Alternatively we can scale further by going to larger grid sizes - indeed for demonstration purposes we often do that. But for weather forecasting we need to produce a 10 day forecast in ~ 2-3 hours. Larger grids don?t help us meet this target.
>
> Longer term I think we could squeeze a modest improvement in scalability from the dynamical core by better overlap of communication and computation. But it could also be a lot of work with little payoff. (The core is admittedly implemented in a way that makes it difficult to make this change.)
In my experience, overlap of communication and computation is useful and much easier to achieve with coarrays. The compiler has enough information to ?back-schedule? remote loads before the values will be needed. While you can do this manually with MPI, it is a lot easier for the language semantics to do it for you. And the code is easier to read and maintain.
>
> We will gain another ~25% in scalability in the very near future by switching to hybrid MPI+OpenMP. Basically this reduces the fraction of the computations that happen in the guard cells. It also improves the length of the inner loops which I think helps with vectorization but I don?t actually have any hardware counter proof of this (yet).
MPI+OpenMP is so legacy.. Why not coarrays + DO CONCURRENT? This approach does seem to be a likely win, saving both total memory and communications. Although on-node communications are relatively inexpensive. Hopefully you are getting vectorization already. The hardware vectors on the current systems tend to be really short. What matters more is contiguous memory access. (In contrast to the old vector systems where longer was better and there was no penalty for discontiguous access.)
>
> I.e. we believe we are facing a bleak future in terms of exploiting exascale machines. And it is not just NASA. I think all of the major data assimilation systems across the world are in a similar boat. They see how another ~2x of scalability can be squeezed out from code improvements but will see increasing run-times as they increase the resolution. Each 2x improvement in resolution lets us use 4x more processes but costs 8-10x more in computation. Vertical grid does not double resolution as frequently, but time-steps generally go down by a factor of 2 when the resolution doubles.
The usual weather/climate argument for exascale that I've seen is ?finer grid -> more accurate prediction?. Interesting to see that there are downsides to that argument.
>
> Note: the role of ensembles is increasing - so we?ll definitely want larger computers as we go along. But there is a limit to how many members one has in an ensemble before you reach diminishing returns.
>
> Maybe there will be a breakthrough somewhere, but this community would not be the first one to find itself unable to (fully) adapt to computing technology as it evolves. There are some serious scientific algorithms that are fundamentally serial (or close to it).
I seems that the parallelism is in the data (partition of components like air, ocean, ?), rather than in the algorithms used for computations on that data.
Cheers,
Bill
>
>
> Sorry for the lengthy post, but thought it might be useful to demonstrate that we?ve at least given some thought to this.
>
> Cheers,
>
> - Tom
>
>
>
>
>>
>> Cheers,
>> Bill
>>
>>>
>>>
>>>> On Oct 26, 2017, at 4:27 PM, Bill Long <longb at cray.com> wrote:
>>>>
>>>>
>>>>> On Oct 26, 2017, at 2:58 PM, Van Snyder <Van.Snyder at jpl.nasa.gov> wrote:
>>>>>
>>>>>>
>>>>>> We could say that it is a processor-dependent value greater or equal to 1023.
>>>>>
>>>>> Ridiculously small. Why design a language for machines that were obsolete twenty years ago? We should be designing for the future, not the past.
>>>>
>>>> You mean the real future where the amount of memory/core continues to decline, and where broadcasting the executable out to all the participating processors becomes slower as the size of the executable increases (because of initialized arrays)? The better option in cases like this is usually to read the data in from an unformatted file when the program starts. Maybe just to image 1 and broadcast it from there.
>>>>
>>>> Cheers,
>>>> Bill
>>>>
>>>>
>>>> Bill Long longb at cray.com
>>>> Principal Engineer, Fortran Technical Support & voice: 651-605-9024
>>>> Bioinformatics Software Development fax: 651-605-9143
>>>> Cray Inc./ 2131 Lindau Lane/ Suite 1000/ Bloomington, MN 55425
>>>>
>>>>
>>>> _______________________________________________
>>>> J3 mailing list
>>>> J3 at mailman.j3-fortran.org
>>>> http://mailman.j3-fortran.org/mailman/listinfo/j3
>>>
>>> _______________________________________________
>>> J3 mailing list
>>> J3 at mailman.j3-fortran.org
>>> http://mailman.j3-fortran.org/mailman/listinfo/j3
>>
>> Bill Long longb at cray.com
>> Principal Engineer, Fortran Technical Support & voice: 651-605-9024
>> Bioinformatics Software Development fax: 651-605-9143
>> Cray Inc./ 2131 Lindau Lane/ Suite 1000/ Bloomington, MN 55425
>>
>>
>> _______________________________________________
>> J3 mailing list
>> J3 at mailman.j3-fortran.org
>> http://mailman.j3-fortran.org/mailman/listinfo/j3
>
> _______________________________________________
> J3 mailing list
> J3 at mailman.j3-fortran.org
> http://mailman.j3-fortran.org/mailman/listinfo/j3
Bill Long longb at cray.com
Principal Engineer, Fortran Technical Support & voice: 651-605-9024
Bioinformatics Software Development fax: 651-605-9143
Cray Inc./ 2131 Lindau Lane/ Suite 1000/ Bloomington, MN 55425
More information about the J3
mailing list