Proportional clustering-based undersampling for imbalanced data classification

CS Zhang and ZR Shi and WW Lu and Z Jin and S Feng and ML Xu, KNOWLEDGE AND INFORMATION SYSTEMS, 67, 12299-12333 (2025).

DOI: 10.1007/s10115-025-02593-1

Class imbalance is an important challenge in machine learning and data mining, as it hinders the detection of rare but important instances. Clustering-based undersampling methods are widely used to address this issue. However, they often struggle to choose appropriate clustering algorithms and identify representative instances, resulting in suboptimal resampling. In this paper, a clustering-based undersampling method is proposed to address the class imbalance problem. The method uses the DBSCAN algorithm for clustering. The number of instances to select from each cluster is determined proportionally, and a linear model optimized by Differential Evolution is used to identify the specific instances to retain, completing the undersampling process. Comparative experiments are conducted on 30 datasets using five classifiers. The results demonstrate that the proposed method significantly outperforms baseline methods in terms of MCC, F-measure, and AUC. Additional experiments further show the impact of clustering algorithms on resampling performance and highlight the effectiveness of the learning-to-rank algorithm in selecting representative instances.

Return to Publications page