The lattice Boltzmann method is a powerful technique for the computational modeling of a wide variety of complex fluid flow problems including single and multiphase flow in complex geometries. It is a discrete computational method based upon the Boltzmann equation. It considers a typical volume element of fluid to be composed of a collection of particles that are represented by a particle velocity distribution function for each fluid component at each grid point. The time is counted in discrete time steps and the fluid particles can collide with each other as they move, possibly under applied forces. The rules governing the collisions are designed such that the time-average motion of the particles is consistent with the Navier-Stokes equation.
This method naturally accommodates a variety of boundary conditions such as the pressure drop across the interface between two fluids and wetting effects at a fluid-solid interface. It is an approach that bridges microscopic phenomena with the continuum macroscopic equations. Further, it can model the time evolution of systems.
Because the lattice Boltzmann method is resource-intensive, we have implemented it with a parallel computation scheme. In general, running simulations on large systems (greater than 100x100x100 grid points) is not practical due to the lack of memory resources and long processing times. Because of these extreme demands on memory and computation, and the fact that the LB method generally needs only nearest neighbor information, the algorithm is an ideal candidate for parallel computing.
The code was implemented in C with MPI for portability. There are multiple features that enable large problems to be run quickly:
Single Program Multiple Data (SPMD) Model
The data volume is divided into spatially contiguous blocks along one axis; multiple copies of the same program run simultaneously, each operating on its own block of data. Each copy of the program runs as an independent process and typically each process runs on its own processor. At the end of each iteration, data for the planes that lie on the boundaries between blocks are passed between the appropriate processes and the iteration is completed.
Memory Management
In order to run large problems, we use multiple techniques to keep within-processor memory needs as small as possible.
Since only nearest neighbor information is needed, all computation within a process is performed with only three temporary planes, Thus temporary memory requirement grows at a much smaller rate than problem size.
For fluid flow in complex geometries, we have both active sites (that hold fluid) and inactive sites (that consist of material such as sandstone). For efficient use of memory we use an indirect addressing approach where the active sites point to fluid data and the inactive sites point to NULL. Hence only minimal memory needs to be devoted to inactive sites. At each active site we point to the necessary velocity and mass data for each fluid component. Over the course of an iteration we visit each active cell in the data volume and calculate the distribution of each fluid component to be streamed to neighboring cells. New mass and velocity values are accumulated at each active cell as its neighbors make their contributions.
We ran a series of performance tests on multiple machines. We found in all cases that the non-parallelizable part of the computation accounts for between 0.7% and 3% of the total computational load. In one of the test cases the performance data from the SGI Origin 2000 closely matches this formula (T is the total time in seconds for an iteration; N is the number of processors): T = 0.090 + 11.98/N. The non-parallizable part of the computation is 0.090 seconds, while the parallelizable portion of the computation uses 11.98 seconds. So, for example, a single iteration took 12.08 seconds on one processor but only 1.11 seconds on 12 processors.
Other timing tests indicate that the time for the parallelizable portion of the code is roughly proportional to the number of active sites over the entire volume, while interprocess communication time is roughly proportional to the size of a cross-section of the volume. So as we process larger systems, the time for the parallelizable portion of the code should increase proportionally with the cube of the linear size of the system, while the non-parallelizable portion should increase with the square of the linear size of the system. This means that for larger systems, a larger proportion of the time is in the parallelizable computation and greater benefits can be derived from running on multiple processors.
Return to High Performance Computing