The Gyrokinetic Toroidal Code developed at Princeton University/PPPL (GTC-P) is a highly scalable particle-in-cell (PIC) code which solves the 5D Vlasov-Poisson equation with efficient utilization of modern parallel computer architectures at the petascale and beyond. It is designed with special considerations of incorporating benefits from computer science advances such as deploying multi-threading capability on the leading computing systems. The multiple levels of parallelism, including inter-node domain and particle decomposition, intra-node shared memory partition, and vectorization have enabled pushing the scalability of the PIC method to extreme scales. Facilitated by the fact that GTC-P does not depend on any third party libraries, it is highly portable and has now been compiled and successfully deployed on a large variety of computing platforms that include the top 7 supercomputers world-wide. In particular, it has efficiently utilized up to 1.5M cores on a large variety of systems that include Tianhe2 (China), Titan (US), Sequoia (US), K computer (Japan), Mira (US), Piz Daint (Switzerland), Stampede (US) and Blue Waters (US).
GTC-P is a MPI and OpenMP hybrid code written in C language. The CPU version of the code also supports Intel Xeon Phi architecture in native, symmetric and offload mode. The GPU version of the code is written in CUDA language.
Bei Wang (Princeton University), Stephane Ethier (PPPL), Kamesh Madduri (Penn State University), Khaled Ibrahim (LBL)
This code is developed in Princeton University and Princeton Plasma Physics Laboratory.
Please contact Bei Wang, firstname.lastname@example.org for accessing the source code.
Please contact Bei Wang, email@example.com for accessing the documentation.
B. Wang, S. Ethier, W. Tang, T. Williams, K. Ibrahim, K. Madduri, S. Williams and L. Oliker, Kinetic Turbulence Simulations at Extreme Scale on Leadership-Class Systems, SC13, Proceedings of 2013 International Conference on High Performance Computing, Networking, Storage and Analysis, pp 82 http://dl.acm.org/citation.cfm?id=2503258
History of the Family of GTC Code:
GTC-P is based on the original version of GTC code developed by Z. Lin at UC Irvine. The GTC code was written in FORTRAN 90 language and has three levels of parallelism: a one-dimensional domain decomposition in the toroidal dimension, a particle decomposition in each toroidal domain, and multithreaded, shared memory loop-level parallelism implemented with OpenMP. The method scales very well to a large number of processor cores with respect to the number of particles. However, the approach can suffer from performance bottlenecks when simulating large scale plasmas, e.g., ITER (the largest device currently under construction in France), because of the increasing grid-related computation and memory requirement on each process. This is due to the fact that with only one-dimensional domain decomposition in the toroidal dimension, each process will keep a copy of the full poloidal grid and the number of grid points in the poloidal grid increases by a factor of 4 for a plasma device of twice the minor radius. In order to effectively address the open questions in fusion plasma physics such as the scaling of the energy confinement time with system size, a key additional level of domain decomposition in the radial dimension was introduced and the code was named as Gyrokinetic Toroidal Code at Princeton (GTC-P). The first version of GTC-P was written in FORTRAN 90 with reduced memory requirements through the additional domain decomposition; however, this version of the code showed substantial load imbalance at high processor counts and lacked multithreading capability for some grid-based subroutines (e.g., Poisson solver), which impeded the performance on multicore and manycore architectures. Recently, we have developed a new version of GTC-P code by completely rewriting the original FORTRAN code in C, simplifying the porting process to GPUs, Intel Xeon Phis, and forthcoming multicore technologies. Additionally, the rewrite enabled easier exploitation of low-level optimizations. The code is designed with special considerations of incorporating benefits from computer science advances such as deploying multi-threading capability on the leading computing systems. The multiple levels of parallelism, including inter-node 2D domain decomposition and particle decomposition, and intra-node shared memory partition, as well as vectorization within each core, have allowed effective GTC scalability at extreme scales.