Performance Analysis and Optimization of Parallel Scientific Applications on CMP Clusters

Main Article Content

Xingfu Wu
Valerie Taylor
Charles Lively
Sameh Sharkawi


Chip multiprocessors (CMP) are widely used for high performance computing.
Further, these CMPs are being configured in a hierarchical manner to compose
a node in a cluster system. A major challenge to be addressed is efficient
use of such cluster systems for large-scale scientific applications. In this
paper, we quantify the performance gap resulting from using different number
of processors per node; this information is used to provide a baseline for
the amount of optimization needed when using all processors per node on CMP
clusters. We conduct detailed performance analysis to identify how
applications can be modified to efficiently utilize all processors per node
using three scientific applications: a 3D
particle-in-cell, magnetic fusion application Gyrokinetic Toroidal Code
(GTC), a Lattice Boltzmann Method for simulating fluid dynamics (LBM), and
an advanced Eulerian gyrokinetic-Maxwell equation solver for simulating
microturbulent transport in plasma (GYRO). In terms of refinements, we use
conventional techniques such as loop blocking, loop unrolling and loop
fusion, and develop hybrid methods for optimizing MPI{\_}Allreduce and
MPI{\_}Reduce. Using these optimizations, the application performance
for utilizing all processors per node was improved by up to 18.97{%} for
GTC, 15.77{%} for LBM and 12.29{%} for GYRO on up to 2048 total processors
on the CMP clusters.

Article Details

Special Issue Papers