dpdata: A Scalable Python Toolkit for Atomistic Machine Learning Data Sets

JZ Zeng and XL Peng and YB Zhuang and HD Wang and FB Yuan and D Zhang and RX Liu and YZ Wang and P Tuo and YZ Zhang and YX Chen and YF Li and CT Nguyen and JM Huang and AY Peng and M Rynik and WH Xu and ZZ Zhang and XY Zhou and T Chen and JH Fan and WR Jiang and BW Li and DN Li and HX Li and WS Liang and RH Liao and LP Liu and CX Luo and L Ward and KW Wan and JJ Wang and P Xiang and CQ Zhang and JC Zhang and R Zhou and JX Zhu and LF Zhang and H Wang, JOURNAL OF CHEMICAL INFORMATION AND MODELING, 65, 11497-11504 (2025).

DOI: 10.1021/acs.jcim.5c01767

Seamless management of atomistic data sets is a critical prerequisite for the successful development and deployment of machine learning potentials (MLPs). Here, we present dpdata, an open-source Python library designed to streamline every aspect of MLP data handling. Built upon a flexible, plugin-based architecture, dpdata supports reading, writing, and converting between a broad range of file formats-from popular quantum-chemistry packages and molecular-dynamics engines to specialized MLP frameworks. Users may define custom data types, formats, drivers, and minimizers, enabling effortless extension to emerging software. Key utilities include automated train-test splitting, coordinate perturbation for active learning, outlier-energy removal, Delta-learning data set generation, error-metric computation, and unit conversion. Through efficient NumPy-backed storage and system-level operations, dpdata achieves significant memory saving and inference speedups over configuration-by-configuration tools such as ASE. We also highlight practical impact, with dpdata used across published studies, for format conversion, data storage, coordinate perturbation, and utilization in other projects for data processing.

Return to Publications page