Parallel multigrid and multiscale flow solvers for high-performance-computing architectures

Placeholder Show Content

Abstract/Contents

Abstract
To realize the potential of the latest High-Performance-Computing (HPC) architectures for reservoir simulation, scalable linear solvers are necessary. Thus, the objective of this research work is to design, demonstrate, and assess the performance of highly efficient flow-solver algorithms to exploit the computational power of emerging HPC architectures. First, we have designed and demonstrated a massively parallel black-box multigrid solver [4, 32], capable of handling highly heterogeneous structured 2D problems. The parallel implementation exploits the inherent parallelism in every module of the algorithm, including both the setup stage and the solution stage. The parallel algorithm was implemented on two, inherently different, shared-memory parallel architectures, namely, the multi-core architecture and the massively parallel GPU architecture. The GPU implementation is found to be always faster than the multi-core implementation running on 12 Intel® Xeon® E5-2620 cores, achieving as high as 3.5x speed-up for the 16-million-cell highly heterogeneous problem derived from the bottom layer of the SPE10 Second Dataset Benchmark [27]. Then, as an extension to the massively parallel 2D black-box multigrid solver, we developed a massively parallel version of the semicoarsening multigrid solver [36, 91], which is capable of handling highly heterogeneous and anisotropic 3D reservoir models. In this solver, the massively parallel 2D black-box multigrid solver is used as a key "building block" in the setup and solution kernels. The parallel implementation involves a parallelization of the 2D solver kernel in both the setup and solution stages, without modifying the order of the original steps in the algorithm. The solver was again implemented and tested on two, inherently different, parallel architectures, namely, the multi-core architecture and the massively parallel multi-GPU architecture. The multi-GPU implementation is found to be faster than the multi-core implementation running on 12 Intel® Xeon® E5-2620 cores, for models with planes large enough (i.e., with sizes of at least 1 million cells) to get good utilization of the GPU resources. Despite the favorable multi-GPU to multi-core performance results, the solver is found to have lower scalability on the multi-core architecture for models with a relatively few planes and deep V-cycles. In addition, the performance of the solution stage on the multi-GPU architecture is found to be less scalable than the performance of the setup stage, due to both the small number of cycles and the multiple data movements between the host and the device needed per one 'full' relaxation sweep. Based on the assessment of the limitations to the scalability of the 3D semicoarsening multigrid solver, a parallel algebraic mulitscale solver (AMS) [112, 108], whose basic algorithm does not suffer from those limitations, is considered next. The design and implementation of a scalable AMS on shared- and distributed-memory architectures, including the decomposition, memory allocation, data flow, and compute kernels, are described in detail. These adaptations are necessary to obtain good scalability on state-of-the-art HPC systems. The specific methods and parameters, such as the coarsening ratio (Cr), basis-function solver, and relaxation scheme have significant impact on the asymptotic convergence rate and parallel computational efficiency. The balance between convergence rate and parallel efficiency as a function of the coarsening ratio (Cr) and the local stage parameters is analyzed in detail. The performance of AMS is demonstrated using heterogeneous 3D reservoir models, including geostatistically generated fields and models derived from SPE10. The problems range in size from several-million to 2 billion cells. The parallel AMS shows excellent scalability on shared-memory architectures, where for a 128-million cell problem, a speed-up of more than twelve-fold is achieved on a 20-core architecture (dual-socket multi-core Intel® Xeon® E5-2690-v2). Moreover, the solver robustness and scalability are also compared with the state-of-the-art AMG Solver, SAMG [99]. Finally, the performance of the parallel AMS is also demonstrated on a large cluster of modern multi-core systems, having a total of 32 nodes and ∼ 900 cores. While the solver exhibits good multi-node scalability, the coarse-scale system factorization kernel can limit the scalability for extremely large cases. An approach based on aggressive-coarsening was suggested and shown to be effective in improving the scalability of the parallel AMS in such cases.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2015
Issuance monographic
Language English

Creators/Contributors

Associated with Manea, Abdulrahman M
Associated with Stanford University, Department of Energy Resources Engineering.
Primary advisor Tchelepi, Hamdi
Thesis advisor Tchelepi, Hamdi
Thesis advisor Aziz, Khalid
Thesis advisor Clapp, Robert G. (Robert Graham)
Advisor Aziz, Khalid
Advisor Clapp, Robert G. (Robert Graham)

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Abdulrahman M. Manea.
Note Submitted to the Department of Energy Resources Engineering.
Thesis Thesis (Ph.D.)--Stanford University, 2015.
Location electronic resource

Access conditions

Copyright
© 2015 by Abdulrahman Mohammad Manea
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...