[J3] Question to compiler developers about compilation speed
Clune, Thomas L. (GSFC-6101)
thomas.l.clune at nasa.gov
Tue Oct 29 15:32:42 UTC 2024
A bit of background, The Earth system model that I support has millions of lines of source code, and as such build times can be a major factor in software development. The scientists generally don’t care so much as they usually build the model in batch and spend their time monitoring the execution and performing analysis of the results. But the model developers, and esp. my team of infrastructure developers spend a considerable amount of time building the model in order to verify changes. In particular this creates significant latency for our CI tests.
A raw optimized build takes O(15 minutes) on a current generation server node. This time includes ~10x speedup from parallelization. Unoptimized debug compiles a bit faster (30%), but not as much as you might think due to the parallelization bottlenecks. Our CI generally is slower because it only runs on 2 cores in the cloud, and routinely takes > 30 minutes to pass. (And of course infrastructure layers and their tests require less time, but we need to verify the model as a whole still works, and that dominates.)
Current generations of HPC nodes now have O(100 cores), and that leads me to ask the question how much faster could we build the model. We know from the profile of the build that module dependencies (unsurprisingly) play a significant role in producing bottlenecks during the build. Further we have a relatively small number of files that are disproportionately expensive to compile on their own. We can usually split those into separate modules and achieve superlinear speedup.
My thought was to push the envelope on what submodules can do. Consider the extreme approach where every module procedure is placed into its own submodule. (And yes, > 99% of our source code is in the form of modules.) My expectation was that in this scenario, the “depopulated” modules would compile quickly, and we could keep dozens of threads busy building the independent submodules. And indeed we do see that scalability is improved with this approach in that we can keep more threads busy than before. But I was quite surprised to find the actual build time did not significantly improve. First, we have found that each new submodules often takes a similar amount of time to compile as the original module. Second, we also found that the depopulated module was often nearly as expensive to compile as the original. The details vary, but the conclusions generally apply across multiple compilers (intel, gnu, nag).
Note that it has still been beneficial to create the submodules to avoid cascade bottlenecks, but for that we can put all of the module procedures into a single submodule.
My questions to compiler developers:
* Do these results surprise you?
* Is there a general explanation for why what I’m attempting is limited?
* Is there other advice on how to optimize for build time?
Thanks in advance,
* Tom
PS We’ve also looked at placing our source files on solid-state disk, but disk access does not appear to be a significant part of the build cost.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.j3-fortran.org/pipermail/j3/attachments/20241029/3017fa4a/attachment.htm>
More information about the J3
mailing list