(j3.2006) [Fwd: Fortran 2008 query]
Clune, Thomas L. GSFC-6101
thomas.l.clune
Tue Oct 31 17:29:30 EDT 2017
On Oct 31, 2017, at 4:52 PM, Bill Long <longb at cray.com<mailto:longb at cray.com>> wrote:
Hi Tom,
Thanks for the detailed exposition of the issues faced by your models. Well within the spirit of ?use cases? . (Also, notably absent is any mention of needing preprocessing, Cmake, or configure :) ! )
Yes - this was response was just to the issue about focusing on scalability. I don?t see CMake is much relevant to this list, except insofar as it does technically make my code non-standard conforming due to line length and use of CPP/FPP and __FILE__. We have good reasons why we (sparingly) use these layers.
BTW, this is an excellent example of ?use case? vs ?implementation?. If Fortran adds enough other features, we might be able to forego FPP. But it is not a short list. In this particular case, if fortran provided a procedure which returns the file_name of the source that contains it (as a deferred length string) and the line number of the call or something very similar, I?d immediately stop using __FILE__ and __LINE__.
(And maybe some of the recent stuff about compiler info has this, and I just missed it?)
Some observations below...
On Oct 26, 2017, at 5:57 PM, Clune, Thomas L. (GSFC-6101) <thomas.l.clune at nasa.gov<mailto:thomas.l.clune at nasa.gov>> wrote:
On Oct 26, 2017, at 5:47 PM, Bill Long <longb at cray.com<mailto:longb at cray.com>> wrote:
On Oct 26, 2017, at 4:07 PM, Clune, Thomas L. (GSFC-6101) <thomas.l.clune at nasa.gov<mailto:thomas.l.clune at nasa.gov>> wrote:
Not on any systems acquired at NASA. Granted we?re no longer doubling memory-per-core with each generation of our cluster, but it is still growing. Previous machine wast 128 GB on 28 cores. Next machine is 256 GB on 40 cores. And this is primarily driven by the requirements of our Earth system model (including data assimilation).
Well, this is today. Van was taking about the future. Memory-cpu bandwidth and latency are increasing a bottleneck. The current DIMM model for memory will not last on high-performance systems. Future memory schemes will involve (faster) memory incorporated into the processor chip, or stacked very close. And less memory/chip. With more cores, definitely less memory/core. Has your favorite Intel salesman not mentioned the 72-core chips?
We of course wish to reduce the memory footprint, as it would likely mean we could buy (and use) more cores. But in practice it is very hard to make headway.
Indeed, changing the memory layout is hard. Basically it's equivalent to writing better parallelism into to the code so that the data can be distributed over a larger set of nodes. This takes thought, and sometimes a disruptive redesign. Fortunately, Fortran is better designed for this than other languages.
Actually that is not our major problem. It is just the sheer number of arrays that are all decomposed in the same manner. Basically the ?state? of a modern Earth system model is very data intensive.
We know that some of the arrays that currently persist across time-steps could be deallocated while other components execute. We don?t know how much could be saved that way but maybe as much as 2x. Perhaps even more could be saved by recomputing some arrays at each entry. Here we would be trading off between memory and computation. But could be a win in some cases.
Allocating and deallocating memory frequently can be expensive. Automatic or local arrays (typically stack fodder) are cheaper. Recomputing things instead of saving them for reuse far in the future is becoming more profitable.
Much depends on the details here. And the dollar cost comes in. We could probably suffer a 10% reductiion in run time if by reducing the memory requirement we could buy a computer with significantly more cores. (Allows other work to happen and or to run in a faster but less efficient regime on more cores.)
The holy grail would be to distribute the components themselves across processors. Here the problem is that it implies a fair bit of data movement as these systems are coupled. The cases where it can generally be worthwhile are things like ocean-atmosphere where the overlap is just a 2D region. Then you just have to solve the load balance problem that varies with each compiler and version of the model. Oh and the time-stepping algorithm goes from stable to weakly unstable because you have to use a lagged state. In practice it is shown to be stable enough to work (almost always). GFDL has already done this, and I think that NCAR has these capabilities. I suspect my org will go that way within the next 5 years, but it is only a modest improvement in scalability - less than 2x.
Currently to use more processors for a fixed grid size we have to drop down to a footprint that is less than 30x30 per core. Our dynamical core use 3 guard cells on each side, which therefore already induces a fairly large penalty. And before you say that we should use a different algorithm, this is the core that recently won a bakeoff by NOAA to choose their next dynamical core. At the end of the day, this core produces faster results for a given computing resource than its competitors. Apparently was not even close. Alternatively we can scale further by going to larger grid sizes - indeed for demonstration purposes we often do that. But for weather forecasting we need to produce a 10 day forecast in ~ 2-3 hours. Larger grids don?t help us meet this target.
Longer term I think we could squeeze a modest improvement in scalability from the dynamical core by better overlap of communication and computation. But it could also be a lot of work with little payoff. (The core is admittedly implemented in a way that makes it difficult to make this change.)
In my experience, overlap of communication and computation is useful and much easier to achieve with coarrays. The compiler has enough information to ?back-schedule? remote loads before the values will be needed. While you can do this manually with MPI, it is a lot easier for the language semantics to do it for you. And the code is easier to read and maintain.
Agreed, but until coarrays are widely (and robustly portable) we can?t make huge headway on this. I had a good intern in 2016 that I wanted to look at this for our dynamical core, but the existing limitations in the non-Cray implementations prevented much progress. We?ll evaluate again sometime in the ?near? future. Or another relevant group will and raise this in our priorities.
We will gain another ~25% in scalability in the very near future by switching to hybrid MPI+OpenMP. Basically this reduces the fraction of the computations that happen in the guard cells. It also improves the length of the inner loops which I think helps with vectorization but I don?t actually have any hardware counter proof of this (yet).
MPI+OpenMP is so legacy.. Why not coarrays + DO CONCURRENT? This approach does seem to be a likely win, saving both total memory and communications. Although on-node communications are relatively inexpensive. Hopefully you are getting vectorization already. The hardware vectors on the current systems tend to be really short. What matters more is contiguous memory access. (In contrast to the old vector systems where longer was better and there was no penalty for discontiguous access.)
MPI+OpenMP works reliably today. If I understand correctly, we will need to wait for F2015 support for the LOCAL clause to reliably use DO CONCURRENT. Our loops are almost all canonical I,J,k in the ?correct? ordering. Halos at the ends of the loops keep this from being fused as one long 1D loop in many cases. But other issues also confound the compiler from vectorization on a not-too-infrequent basis. In the end we think we?re doing ok on vectorization, but believe we could extract some more and do have someone working on that.
I.e. we believe we are facing a bleak future in terms of exploiting exascale machines. And it is not just NASA. I think all of the major data assimilation systems across the world are in a similar boat. They see how another ~2x of scalability can be squeezed out from code improvements but will see increasing run-times as they increase the resolution. Each 2x improvement in resolution lets us use 4x more processes but costs 8-10x more in computation. Vertical grid does not double resolution as frequently, but time-steps generally go down by a factor of 2 when the resolution doubles.
The usual weather/climate argument for exascale that I've seen is ?finer grid -> more accurate prediction?. Interesting to see that there are downsides to that argument.
How is this surprising to you? Finer gird has an increased cost irregardless of domain, and nothing scales perfectly. Some algorithms can do 3D grids with 3D decomposition and fair better, but even those often have a time-step criterion that slows absolute performance even in the presence of perfect scaling. Granted this is for algorithms that solve PDEs, but that?s a pretty big segment of the HPC space that Cray is in.
Note: the role of ensembles is increasing - so we?ll definitely want larger computers as we go along. But there is a limit to how many members one has in an ensemble before you reach diminishing returns.
Maybe there will be a breakthrough somewhere, but this community would not be the first one to find itself unable to (fully) adapt to computing technology as it evolves. There are some serious scientific algorithms that are fundamentally serial (or close to it).
I seems that the parallelism is in the data (partition of components like air, ocean, ?), rather than in the algorithms used for computations on that data.
Again with atmos and ocean you are correct because they only share a 2D interface. But I can?t say put pressure on one process and wind velocity on another. Various algorithms involve both quantities. The existing design is pretty much optimal from a data-locality point of view. If fabric bandwidth was infinite, then we could pursue the approach you suggest. But realistically memory access is faster than the fabric and thus it is better to have all of the data in the atmosphere (for one subdomain) on a single process with possibly a few exceptions. Unfortunately, it really is that coupled.
Cheers,
- Tom
Cheers,
Bill
Sorry for the lengthy post, but thought it might be useful to demonstrate that we?ve at least given some thought to this.
Cheers,
- Tom
Cheers,
Bill
On Oct 26, 2017, at 4:27 PM, Bill Long <longb at cray.com<mailto:longb at cray.com>> wrote:
On Oct 26, 2017, at 2:58 PM, Van Snyder <Van.Snyder at jpl.nasa.gov<mailto:Van.Snyder at jpl.nasa.gov>> wrote:
We could say that it is a processor-dependent value greater or equal to 1023.
Ridiculously small. Why design a language for machines that were obsolete twenty years ago? We should be designing for the future, not the past.
You mean the real future where the amount of memory/core continues to decline, and where broadcasting the executable out to all the participating processors becomes slower as the size of the executable increases (because of initialized arrays)? The better option in cases like this is usually to read the data in from an unformatted file when the program starts. Maybe just to image 1 and broadcast it from there.
Cheers,
Bill
Bill Long longb at cray.com<mailto:longb at cray.com>
Principal Engineer, Fortran Technical Support & voice: 651-605-9024
Bioinformatics Software Development fax: 651-605-9143
Cray Inc./ 2131 Lindau Lane/ Suite 1000/ Bloomington, MN 55425
_______________________________________________
J3 mailing list
J3 at mailman.j3-fortran.org<mailto:J3 at mailman.j3-fortran.org>
http://mailman.j3-fortran.org/mailman/listinfo/j3
_______________________________________________
J3 mailing list
J3 at mailman.j3-fortran.org<mailto:J3 at mailman.j3-fortran.org>
http://mailman.j3-fortran.org/mailman/listinfo/j3
Bill Long longb at cray.com<mailto:longb at cray.com>
Principal Engineer, Fortran Technical Support & voice: 651-605-9024
Bioinformatics Software Development fax: 651-605-9143
Cray Inc./ 2131 Lindau Lane/ Suite 1000/ Bloomington, MN 55425
_______________________________________________
J3 mailing list
J3 at mailman.j3-fortran.org<mailto:J3 at mailman.j3-fortran.org>
http://mailman.j3-fortran.org/mailman/listinfo/j3
_______________________________________________
J3 mailing list
J3 at mailman.j3-fortran.org<mailto:J3 at mailman.j3-fortran.org>
http://mailman.j3-fortran.org/mailman/listinfo/j3
Bill Long longb at cray.com<mailto:longb at cray.com>
Principal Engineer, Fortran Technical Support & voice: 651-605-9024
Bioinformatics Software Development fax: 651-605-9143
Cray Inc./ 2131 Lindau Lane/ Suite 1000/ Bloomington, MN 55425
_______________________________________________
J3 mailing list
J3 at mailman.j3-fortran.org<mailto:J3 at mailman.j3-fortran.org>
http://mailman.j3-fortran.org/mailman/listinfo/j3
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.j3-fortran.org/pipermail/j3/attachments/20171031/918542ee/attachment-0001.html>
More information about the J3
mailing list