Identifying Performance Inefficiencies of Parallel Program With Spatial and Temporal Trace Analysis
ZB Xuan and X Sun and X You and HL Yang and ZZ Luan and Y Liu and DP Qian, IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 36, 1387-1400 (2025).
DOI: 10.1109/TPDS.2025.3566735
Performance inefficiencies can lead to performance anomalies in parallel programs. Existing performance analysis tools either have a limited detection scope or require significant domain knowledge to use, which constrains their practical adoption to identify performance inefficiencies. In this paper, we propose STAD, a performance analysis tool for parallel programs that considers both spatial and temporal patterns within trace data. STAD captures the spatial communication patterns between processes using a spatial communication pattern graph. It then adopts a dynamic graph neural network-based unsupervised model to learn the evolving temporal patterns along the timeline. Additionally, STAD diagnoses the root causes of performance anomalies by exploiting the aggregated feature of anomalies along the call tree. Our evaluation results demonstrate that STAD can effectively detect performance anomalies with acceptable overhead and diagnose the root causes attributed to both the program itself and the running environment.
Return to Publications page