GPU Parallelization of NIM

The Boulder HPC Facility: Exploring New Computing Technologies for NOAA

Image of the Icosahedral Grid
The Icosahedral Grid used by the FIM and NIM models.

Background: The Non-hydrostatic Icosahedral Model (NIM) is a next generation global model that builds on the success of its predecessor, the Flow-following finite volume Icosahedral Model (FIM). FIM was run at the Texas Advanced Computing Center (TACC) during the 2008 and 2009 hurricane season, and its scheduled to go into operations in 2010. Both FIM and NIM use the icosahedral horizontal grid and were designed to run efficiently on thousands of processors. NIM is being developed to run at cloud-resolving scales (3-4 km), and is being designed to run on GPUs, multi-core, and other architectures.

The NIM model is divided into two basic components: dynamics and physics. NIM dynamics was developed by a team of modelers, software engineers, and parallelization experts with decades of experience in code development, parallelization, and optimization. Experience working with NVIDIA GPUs on select FIM routines, allowed additional architectural considerations to be incorporated into NIM code design and development.

The Fortran-to-CUDA translator was used to generate an initial version of the code, which was further modified and optimized by hand. Parallelization of NIM is proceeding in multiple stages:

Dynamics: Single Node Parallelization

Status: Completed
Results: The CUDA code runs 25 times faster than on the CPU (Intel Harpertown). We plan to compare these results, generated wtih our Fortran-to-CUDA translator, to the PGI GPU compiler (Beta version available).

Dynamics: Multiple Node Parallelization

Status: Starting in November 2009.


Status: a suitable package has not been selected.

Dynamics: Single Node Parallelization

Background: Like most weather and climate models, NIM is a streaming code, run-time performance is dominated by fetching and storing data; in most cases, very little time is spent doing calculations. NIM uses indirect addressing to access horizontal points on the grid. Earlier studies indicate negible performance impacts of indirect indexing on the CPU because an inner k-loop mitigates the cost over the vertical levels. We also determined no performance impact on the GPU either because run-time performance is dominated by loads and stores to and from GPU global memory.

Optimizations we identified that helped GPU run-time performance also improved CPU performance. For example, the most CPU intensive routine consumed over 40 percent of the dynamics runtime. This routine called a BLAS routine from an inner loop over one-million times per model timestep. By replacing this general routine with a hand-coded routine, the CPU code ran 10 faster.

Key factors in achieving 25x speedup over the optimized CPU time were:

  1. Minimize data transfers: All of NIM dyniamics is executed on the GPU, eliminating required to transfer data between the CPU and GPU. The only data transfers required are (1) model initialization, (2) communication of halos (when we run on multiple nodes), and (3) model output.
  2. Coalesce loads and stores: Coalesced loads and stores occur when adjacent threads need to load data that is adjacent and aligned in memory. Thread blocks in NIM were defined over the 96 vertical levels and data was contiguous in memory over the vertical levels ("k" dimension). In some cases, increasing the rank of arrays to include the vertical dimension (" k ") in some routines, improved coalesced loads and stores from global memory and led to big performance gains.
  3. Maximize GPU utilization: NIM contains 96 vertical levels. Since 96 is a multiple of a warp (32 threads), the unit of computation, good utilization of the GPU hardware was achieved. Scaling tests indicate 96 threads per block, while fewer than the 128 or 256 threads NVIDIA recommends, was sufficient for good performance.
  4. Minimize shared memory and register use: The use of registers and shared memory has a direct impact on performance. If either resource is high, the number of thread blocks available for execution will be reduced, limiting occupancy and performance. The programmer has limited control over the use of registers; the only technique I have found useful is to break large routines, where register usage is high (> 50 registers per thread), into multiple kernels. Shared memory declarations are controlled by the programmer. This fast memory is only beneficial when it replaces multiple loads or stores from global memory.

Dynamics: Multiple Node Parallelization

The Scalable Modeling System (SMS) was used for distributed memory parallelization for the CPUs and resulted in good performance and scaling. Minor code changes are expected for the extension to multiple GPU nodes. We are expecting to use pinned memory to improve transfer times between CPU and GPU.