Leveraging Graph Analysis to Pinpoint Root Causes of Scalability Issues for Parallel Applications

YY Jin and HJ Wang and XC Tang and ZH Guo and YQ Zhao and T Hoefler and T Liu and X Liu and JD Zhai, IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 36, 308-325 (2025).

DOI: 10.1109/TPDS.2024.3485789

It is challenging to scale parallel applications to modern supercomputers because of load imbalance, resource contention, and communications between processes. Profiling and tracing are two main performance analysis approaches for detecting these scalability bottlenecks. Profiling is low-cost but lacks detailed dependence for identifying root causes. Tracing records plentiful information but incurs significant overheads. To address these issues, we present ScalAna, which employs static analysis techniques to combine the benefits of profiling and tracing - it enables tracing's analyzability with overhead similar to profiling. ScalAna uses static analysis to capture program structures and data dependence of parallel applications, and leverages lightweight profiling approaches to record performance data during runtime. Then a parallel performance graph is generated with both static and dynamic data. Based on this graph, we design a backtracking detection approach to automatically pinpoint the root causes of scaling issues. We evaluate the efficacy and efficiency of ScalAna using several real applications with up to 704K lines of code and demonstrate that our approach can effectively pinpoint the root causes of scaling loss with an average overhead of 5.65% for up to 16,384 processes. By fixing the root causes detected by our tool, it achieves up to 33.01% performance improvement.

Return to Publications page