Skip to main content
Springer Nature - PMC COVID-19 Collection logoLink to Springer Nature - PMC COVID-19 Collection
. 2020 May 26;12137:102–117. doi: 10.1007/978-3-030-50371-0_8

A Massively Parallel Algorithm for the Three-Dimensional Navier-Stokes-Boussinesq Simulations of the Atmospheric Phenomena

Maciej Paszyński 15,, Leszek Siwik 15, Krzysztof Podsiadło 15, Peter Minev 16
Editors: Valeria V Krzhizhanovskaya8, Gábor Závodszky9, Michael H Lees10, Jack J Dongarra11, Peter M A Sloot12, Sérgio Brissos13, João Teixeira14
PMCID: PMC7302277

Abstract

We present a massively parallel solver using the direction splitting technique and stabilized time-integration schemes for the solution of the three-dimensional non-stationary Navier-Stokes-Boussinesq equations. The model can be used for modeling atmospheric phenomena. The time integration scheme utilized enables for efficient direction splitting algorithm with finite difference solver. We show how to incorporate the terrain geometry into the simulation and how to perform the domain decomposition. The computational cost is linear Inline graphic over each sub-domain, and near to Inline graphic in parallel over 1024 processors, where N is the number of unknowns and c is the number of cores. This is even if we run the parallel simulator over complex terrain geometry. We analyze the parallel scalability experimentally up to 1024 processors over a PROMETHEUS Linux cluster with multi-core processors. The weak scalability of the code shows that increasing the number of sub-domains and processors from 4 to 1024, where each processor processes the subdomain of Inline graphic internal points (Inline graphic box), results in the increase of the total computational time from 120 s to 178 s for a single time step. Thus, we can perform a single time step with over 1,128,000,000 unknowns within 3 min. The number of unknowns results from the fact that we have three components of the velocity vector field, one component of the pressure, and one component of the temperature scalar field over 256,000,000 mesh points. The computation of the one time step takes 3 min on a Linux cluster. The direction splitting solver is not an iterative solver; it solves the system accurately since it is equivalent to Gaussian elimination. Our code is interfaced with the mesh generator reading the NASA database and providing the Earth terrain map. The goal of the project is to provide a reliable tool for parallel, fully three-dimensional computations of the atmospheric phenomena.

Keywords: Massive parallel computations, Alternating direction solver, Navier-Stokes Boussinesq, Finite difference method

Introduction

Air pollution is receiving a lot of interest nowadays. It is visible, especially in the Kraków area in Poland (compare Fig. 1), as this is one of the most polluted cities in Europe [1]. People living there are more and more aware of the problem, which causes the raising of various movements that are trying to improve air quality. Air pollution grows because of multiple factors, including traffic, climate, heating in the winter, the city’s architecture, etc. The ability to model atmospheric phenomena such as thermal inversion over the complicated terrain is crucial for reliable simulations of air pollution. Thermal inversion occurs when a layer of warm air stays over a layer of cool air, and the warm air holds down the cool air and it prevents pollutants from rising and scattering.

Fig. 1.

Fig. 1.

Pollution with fog and thermal inversion over the same area near Kraków between October 2019 and January 2020

(photos by Maciej Paszyński)

We present a massively parallel solver using the direction splitting technique and stabilized time-integration schemes for the solution of the three-dimensional non-stationary Navier-Stokes-Boussinesq equations.

The Navier-Stokes-Boussinesq system is widely applied for modeling the atmospheric phenomena [2], oceanic flows [3] as well as the geodynamics simulations [4]. The model can be used for modeling atmospheric phenomena, in particular, these resulting in a thermal inversion. It can be used as well for modeling several other important atmospheric phenomena [5, 6]. It may even be possible to run the climate simulation of the entire Earth atmosphere using the approach presented here. The time integration scheme utilized results in a Kronecker product structure of the matrices, and it enables for efficient direction splitting algorithm with finite difference solver [7], since the matrix is a Kronecker product of three three-diagonal matrices, resulting from discretizations along x, y, and z axes. The direction splitting solver is not an iterative solver; it is equivalent to the Gaussian elimination algorithm.

We show how to extend the alternating directions solver into non-regular geometries, including the terrain data, still preserving the linear computational cost of the solver. We follow the idea originally used in [8] for sequential computations of particle flow. In this paper, we focus on parallel computations, and we describe how to compute the Schur complements in parallel with linear cost, and how to aggregate them further and still have a tri-diagonal matrix that can be factorized with a linear computational cost using the Thomas algorithm. We also show how to modify the algorithm to work over the complicated non-regular terrain structure and still preserve the linear computational cost.

Thus, if well parallelized, the parallel factorization cost is near to Inline graphic in every time step, where N is the number of unknowns and c is the number of cores. We analyze the parallel scalability of the code up to 1024 multi-core processors over a PROMETHEUS Linux cluster [9] from the CYFRONET supercomputing center. Each subdomain is processed with Inline graphic finite difference mesh. Our code is interfaced with the mesh generator [10] reading the NASA database [11] and providing the Earth terrain map. The goal of the project is to provide a reliable tool for parallel fully three-dimensional computations of the atmospheric phenomena resulting in the thermal inversion and the pollution propagation.

In this paper, we focus on the description and scalability of the parallel solver algorithm, leaving the model formulation and large massive parallel simulations of different atmospheric phenomena for future work. This is a challenging task itself, requiring to acquire reliable data for the initial state, forcing, and boundary conditions.

Navier-Stokes Boussinesq Equations

The equations in the strong form are

graphic file with name M7.gif 1
graphic file with name M8.gif 2
graphic file with name M9.gif 3
graphic file with name M10.gif 4
graphic file with name M11.gif 5

where u is the velocity vector field, p is the pressure, Inline graphic is the Prandt number, Inline graphic is the gravity force, Inline graphic is the Rayleigh number, T is the temperature scalar field.

We discretize using finite difference method in space and the time integration scheme resulting in a Kronecker product structure of the matrices.

We use the second-order in time unconditionally stable time integration scheme for the temperature equation and for the Navier-Stokes equation, with the predictor-corrector scheme for pressure. For example we can use the Douglass-Gunn scheme [13], performing an uniform partition of the time interval Inline graphic as

graphic file with name M16.gif

and denoting Inline graphic. In the Douglas-Gunn scheme, we integrate the solution from time step Inline graphic to Inline graphic in three substeps as follows:

graphic file with name M20.gif 6

For the Navier-Stokes equations, Inline graphic, Inline graphic, and Inline graphic, and the forcing term represents Inline graphic plus the convective flow and the pressure terms Inline graphic treated explicitly as well. The pressure is computed with the predictor/corrector scheme. Namely, the predictor step

graphic file with name M26.gif 7

with Inline graphic and Inline graphic computes the pressure to be used in the velocity computations, the penalty steps

graphic file with name M29.gif 8

and the corrector step updates the pressure field based on the velocity results and the penalty step

graphic file with name M30.gif 9

These steps are carefully designed to stabilize the equations as well as to ensure the Kronecker product structure of matrix, resulting in the linear computational cost solver. The mathematical proofs of the stability of the formulations, motivating such the predictor/corrector (penalty) steps, can be found in [7, 12] and the references there.

For the temperature equation, Inline graphic, Inline graphic, and Inline graphic, and the forcing term represents the advection term treated explicitly Inline graphic.

For mathematical details on the problem formulation and its mathematical properties, we refer to [12].

Each equation in our scheme contains only derivatives in one direction, so they are of the following form

graphic file with name M35.gif 10

or the update of the pressure scalar field. Thus, when employing the finite difference method, we either endup with the Kronecker product matrices with sub-matrices being three-diagonal, or the point-wise updates of the pressure field

graphic file with name M36.gif 11

where Inline graphic or Inline graphic, depending on the equation, which is equivalent to

graphic file with name M39.gif 12

These systems have a Kronecker product structure Inline graphic where the sub-matrices are aligned along the three axis of the system of coordinates, one of these sub-matrices is three-diagonal, and the other two sub-matrices are scalled identity matrices. From the parallel matrix computations point of view, discussed in our paper, it is important that in every time step, we have to factorize in parallel the system of linear equations having the Kronecker product structure.

Factorization of the System of Equations Possessing the Kronecker Product Structure

The direction splitting algorithm for the Kronecker product matrices implements three steps, which result is equivalent to the Gaussian elimination algorithm [14], since

graphic file with name M41.gif 13

Each of the three systems is three-diagonal,

graphic file with name M42.gif 14

and we can solve it in a linear Inline graphic computational cost. First, we solve along x direction, second, we solve along y direction, and third, we solve along z direction.

Introduction of the Terrain

To obtain a reliable three-dimensional simulator of the atmospheric phenomena, we interconnect several components. We interface our code with mesh generator that provides an excellent approximation to the topography of the area [10], based on the NASA database [11]. The resulting mesh generated for the Krakow area is presented in Fig. 2.

Fig. 2.

Fig. 2.

The computational mesh generated based on the NASA database, representing the topography of the Krakow area.

In our system of linear equations, we have several tri-diagonal systems with multiple right-hand-sides, factorized along xy and z directions. Each unknown in the system represents one point of the computational mesh. In the first system, the rows are ordered according to the coordinates of points, sorted along x axis. In the second system, the rows are ordered according to the y coordinates of points, and in the third system, according to z coordinates. When simulating the atmospheric phenomena like the thermal inversion over the prescribed terrain with alternating directions solver and finite difference method, we check if a given point is located in the computational domain. The unknowns representing points that are located inside the terrain (outside the atmospheric domain) are removed from the system of equations. This is done by identifying the indexes of the points along x, y, and z axes, in the three systems of coordinates. Then, we modify the systems of equations, so the corresponding three rows in the three systems of equations are reset to 0, the diagonal is set to 1, and the corresponding rows and columns of the three right-hand-sides are set 0.

For example, if we want to remove point (rst) from the system, we perform the following modification in the first system.

The rows in the first system they follow the numbering of points along x axis. The number of columns corresponds to the number of lines along x axis perpendicular to OYZ plane. Each column of the right-hand side correspond to yz coordinates of a point over OYZ plane. We select the column corresponding to the “st” point. We factorize the system with this column separately, by replacing the row in the matrix by the identity on the diagonal and zero on the right-hand side. The other columns in the first system are factorized in a standard way.

graphic file with name M44.gif 15

Analogous situation applies for the second system, this time with right-hand side columns representing lines perpendicular to OXZ plane. We factorize the “rt” column in the second system separately, by setting the row in the matrix as the identity on the diagonal, and using 0.0 on the right-hand side. The other columns in the second system are factorized in the standard way.

graphic file with name M45.gif 16

Similarly, in the third system we factorize the “rs” column separately. The other columns in the third system are factorized in a standard way.

graphic file with name M46.gif 17

Using this trick for all the points in the terrain, we can factorize the Kronecker product system in a linear computational cost over the complex terrain geometry.

Parallel Factorization with Domain Decomposition Preserving the Linear Computational Cost

The computational domain is decomposed into several cube-shape sub-domains. We generate systems of linear equations over each sub-domain separately, and we enumerate the variables in a way that interface unknowns are located at the end of the matrices. We compute the Schur complement of the interior variables with respect to the interface variables. We do it in parallel over each of the subdomains. The important observation is that the Schur complement matrices will also be three-diagonal matrices. This is because the subdomain matrix is three-diagonal, and the Schur complement computation can be implemented as forward eliminations, performed in the three sub-systems, each of them stopped after processing the interior nodes in the particular systems. Later, we aggregate the Schur complements into one global matrix. We do it by global gather operation. This matrix is also tri-diagonal and can be factorized in a linear cost. Later, we scatter the solution, and we use the partial solutions from the global matrix to backward substitute each of the systems in parallel. These operations are illustrated in Fig. 3.

Fig. 3.

Fig. 3.

Illustration on the parallel solver algorithm.

We perform this operation three times, for three submatrices of the Kronecker product matrix, defined along three axes of the coordinate system. We provide algebraic details below.

Thus, assuming we have Inline graphic rows to factorize in the first system (Inline graphic rows in the interior, and Inline graphic rows on the interface), we run the forward elimination over the first matrix, along the x direction, and we stop it before processing the r-th row (denoted by red color). This partial forward elimination stopped at the r-th row ensures that below that row we have the Schur complement of the first Inline graphic rows related with the interior points in the domain with respect to the next Inline graphic rows related with the interface points (the Schur complement is denoted by blue color). This Schur complement matrix is indeed tri-diagonal:

graphic file with name 496146_1_En_8_Equ18_HTML.gif 18
graphic file with name M52.gif 19

We perform this operation on every sub-domain, and then we gather on processor one the tri-diagonal Schur complements, we aggregate them into one matrix along x direction. The matrix is still a tri-diagonal matrix, and we solve the matrix using linear Inline graphic computational cost Gaussian elimination procedure with the Thomas algorithm.

Next, we scatter and substitute the partial solutions to sub-system over subdomains. We do it by replacing the last Inline graphic rows by the identity matrix and placing the solutions into the right-hand side blocks. Namely, on the right-hand side we replace rows from Inline graphic (denoted by blue color) by the solution obtained in the global phase, to obtain:

graphic file with name M56.gif 20
graphic file with name 496146_1_En_8_Equ21_HTML.gif 21

and running backward substitutions over each subdomain in parallel.

Next, we plug the solutions to the right-hand side of the second system along y axis, and we continue with the partial factorization. Now, we have Inline graphic rows in the interior and Inline graphic rows on the interface.

We compute the Schur complements in the same way as for the fist sub-system, thus we skip the algebraic details here. We perform this operation on every sub-domain, then we collect on processor one and aggregate the Schur complements into the global matrix along y directions. The global matrix is three-diagonal, and we solve it with Thomas algorithm. Next, we scatter and substitute the partial solutions to sub-system on each subdomain, and we solve by backward substitutions.

Finally, we plug the solution to the right-hand side of the third system along z axis, and we continue with the partial factorization. Now, we have Inline graphic rows in the interior and Inline graphic rows on the interface. The partial eliminations follow the same lines as for the two other directions, thus, we skip the algebraic details.

We repeat the computations for this third direction, computing the Schur complements on every sub-domain, collecting them into one global system, which is still three-diagonal, and we can solve it using the linear computational cost Thomas algorithm.

Next, we substitute the partial solution to sub-systems. We replace the last Inline graphic rows by the identity matrix, and place the solutions into the right-hand side, and run the backward substitutions over each subdomain in parallel.

Parallel Scalability

graphic file with name 496146_1_En_8_Figa_HTML.jpg

graphic file with name 496146_1_En_8_Figb_HTML.jpg

The solver is implemented in fortran95 with OpenMP (see Algorithm 1) and MPI libraries used for parallelization. It does not use any other libraries, and it is a highly optimized code. We report in Fig. 4 and Table 1 the weak scalability for three different subdomain sizes, Inline graphic, Inline graphic, and Inline graphic. The weak scalability for the subdomains of Inline graphic internal points, shows that increasing the number of processors from 4 to 1024, simultaneously increasing the number of subdomains from 4 to 1024, and the problem size from Inline graphic to Inline graphic, results in the increase of the total computational time from 120 s to 178 s for a single time step. Thus, we can perform a single time step with over 1,128,000,000 unknowns (three components of the velocity vector field, and one component of the pressure and the temperature scalar fields over 256,000,000 mesh points) within 3 min on a cluster. For the numerical verification of the code, we refer to [15].

Fig. 4.

Fig. 4.

Weak scalability for subdomains with Inline graphic, Inline graphic, and Inline graphic internal points, one subdomain per processor, up to 1024 processors (subdomains). We increase the problem size with the number of processors. For the ideal parallel code, the execution time remains constant.

Table 1.

Weak scalability up to 1024 processors (subdomains). Each grid box contains one subdomain with Inline graphic, Inline graphic, or Inline graphic internal points, respectively, one subdomain per processor.

Subdomains = Processors Grid Inline graphic Time [s] Inline graphic Time [s] Inline graphic Time [s]
1 (1, 1, 1) 19 58
2 (1, 1, 2) 23 63
4 (1, 1, 4) 23 66 120
8 (2, 1, 4) 63 85 157
16 (2, 2, 4) 36 97 152
32 (4, 2, 4) 42 100 150
64 (4, 4, 4) 49 115 157
128 (8, 4, 4) 63 129 160
256 (8, 8, 4) 72 144 166
512 (16, 8, 4) 170
1024 (16, 16, 4) 178

We report in Figure 5 and Table 2 the strong scalability for six different simulations, each one with box size Inline graphic, with 8, 16, 32, 64, 128 and 256 subdomains. Since the number of nodes is multiplied by the number of unknowns (three components of the velocity vector field, one component of the pressure scalar field and one component of the temperature scalar field), we obtained between Inline graphic millions, to Inline graphic millions of unknowns. We can read the superlinear speedup for these plots, which is related to the optimization of cache usage on smaller subdomains, with optimizing the memory transfers to the computational kernel and loop unrolling technique, as illustrated in Algorithm 2.

Fig. 5.

Fig. 5.

Strong scallability for meshes with different sizes, for different numbers of processors. For larger meshes, it is only possible to run them on maximum number of processors.

Table 2.

Strong scallability up to 256 processors.

Processors ndofs * 1,000,000 4 8 16 32 64 128 256
5 120 63
10 157 36
20 152 42
40 150 49
80 157 63
160 160 72

In Fig. 6, we show some snapshots from the preliminary simulations. In here, we focused on the description and scalability of the parallel solver algorithm, leaving the model formulation and large massive parallel simulations of different atmospheric phenomena for the future work. This will be a challenging task itself, requiring to acquire reliable data for the initial state, forcing, and boundary conditions.

Fig. 6.

Fig. 6.

Snapshots from the simulation

Conclusions

We described a parallel algorithm for the factorization of Kronecker product matrices. These matrices result from the finite-difference discretizations of the Navier-Stokes Boussinesq equations. The algorithm allows for simulating over the non-regular terrain topography. We showed that the Schur complements are tri-diagonal, and they can be computed, aggregated, and factorized in a linear computational cost. We analyzed the weak scalability over the PROMETHEUS Linux cluster from the CYFRONET supercomputing center. We assigned a subdomain with Inline graphic finite difference mesh to each processor, and we increased the number of processors from 4 to 1024. The total execution time for a single time step increased from 120 s to 178 s for a single time step. Thus, we could perform computations for a single time step with around 1,128,000,000 unknowns within 3 min on a Linux cluster. This corresponds to 5 scalar fields over 256,000,000 mesh points. In future work, we plan to formulate the model parameters, initial state, forcing, and boundary conditions to perform massive parallel simulations of different atmospheric phenomena.

Acknowledgments

This work and the visit of prof. Petar Minev in AGH University is supported by National Science Centre, Poland grant no. 2017/26/M/ ST1/ 00281.

Contributor Information

Valeria V. Krzhizhanovskaya, Email: V.Krzhizhanovskaya@uva.nl

Gábor Závodszky, Email: G.Zavodszky@uva.nl.

Michael H. Lees, Email: m.h.lees@uva.nl

Jack J. Dongarra, Email: dongarra@icl.utk.edu

Peter M. A. Sloot, Email: p.m.a.sloot@uva.nl

Sérgio Brissos, Email: sergio.brissos@intellegibilis.com.

João Teixeira, Email: joao.teixeira@intellegibilis.com.

Maciej Paszyński, Email: paszynsk@agh.edu.pl.

References

  • 1.European Environment Agency: Air Quality in Europe - 2017 report. 13/2017
  • 2.Marras S, et al. A review of element-based Galerkin methods for numerical weather prediction: finite elements, spectral elements, and discontinuous Galerkin. Arch. Comput. Methods Eng. 2016;23:673–722. doi: 10.1007/s11831-015-9152-1. [DOI] [Google Scholar]
  • 3.Song Y, Hou T. Parametric vertical coordinate formulation for multiscale, Boussinesq, and nonBoussinesq ocean modeling. Ocean. Model. 2006;11:298–332. doi: 10.1016/j.ocemod.2005.01.001. [DOI] [Google Scholar]
  • 4.Schaeffer N, Jault D, Nataf H-C, Furnier A. Turbulent geodynamo simulations: a leap towards Earth’s core. Geophys. J. Int. 2017;211(1):1–29. doi: 10.1093/gji/ggx265. [DOI] [Google Scholar]
  • 5.Zhang, Z., Moore, J.C.: Mathematical and Physical Fundamentals of Climate Change, Chap. 11 - Atmospheric Dynamics, pp. 347–405 (2015)
  • 6.Zeytounian R. Asymptotic Modeling of Atmospheric Flows. Heidelberg: Springer; 1990. [Google Scholar]
  • 7.Guermond J-L, Minev PD. High-order time stepping for the Navier-Stokes equations with minimal computational complexity. J. Comput. Appl. Math. 2017;310:92–103. doi: 10.1016/j.cam.2016.04.033. [DOI] [Google Scholar]
  • 8.Keating J, Minev P. A fast algorithm for direct simulation of particulate flows using conforming grids. J. Comput. Phys. 2013;255:486–501. doi: 10.1016/j.jcp.2013.08.039. [DOI] [Google Scholar]
  • 9.Bubak M, Kitowski J, Wiatr K, editors. eScience on Distributed Computing Infrastructure. Cham: Springer; 2014. [Google Scholar]
  • 10.https://github.com/Podsiadlo/terrain
  • 11.Farr, T.G., et al.: The Shuttle Radar Topography Mission, Reviews of Geophysics, vol. 45, no. 2 (2005)
  • 12.Guermond JL, Minev PD. A new class of massively parallel direction splitting for the incompressible Navier–Stokes equations. Comput. Methods Appl. Mech. Eng. 2011;200:2083–2093. doi: 10.1016/j.cma.2011.02.007. [DOI] [Google Scholar]
  • 13.Douglas J, Gunn JE. A general formulation of alternating direction methods. Numerische Mathematik. 1964;6(1):428–453. doi: 10.1007/BF01386093. [DOI] [Google Scholar]
  • 14.Golub GH, Van Loan C. Matrix Computations. 3. Baltimore: John Hopkins University Press; 1996. [Google Scholar]
  • 15.A. Takhirov, R. Frolov, P. Minev, Direction splitting scheme for Navier-Stokes-Boussinesq system in spherical shell geometries. arXiv:1905.02300 (2019)

Articles from Computational Science – ICCS 2020 are provided here courtesy of Nature Publishing Group

RESOURCES