Profiling prediction errors to diagnose performance limitations in machine learning models for geogenic contaminated groundwaters

2025-12-31

Hailong Cao, Xianjun Xie, Ziyi Xiao, Wenjing Liu,
Profiling prediction errors to diagnose performance limitations in machine learning models for geogenic contaminated groundwaters,
Journal of Cleaner Production,
Volume 525,
2025,
146605,
ISSN 0959-6526,
https://doi.org/10.1016/j.jclepro.2025.146605.
(https://www.sciencedirect.com/science/article/pii/S0959652625019559)
Abstract: Modeling of Geogenic Contaminated Groundwaters modeling increasingly adopts machine learning approaches. While prediction errors consistently occur in machine learning models, they are typically quantified solely in terms of magnitude but are not subjected to systematic attribution analysis. Consequently, model performance improvement has not fully leveraged error characterization. Using a nationwide groundwater fluoride dataset from India, this study investigated the sensitivity of prediction errors to three performance determinants: randomness, modeling algorithms, and training data. A comprehensive error profiling framework was developed, incorporating neighborhood characteristics, class probability density, low-dimensional distribution patterns, and geographical characteristics. The results revealed that 38 % of the errors were completely independent of the performance determinants, constituting a distinct category of stable errors. These stable errors exhibit three defining characteristics: noisy labeling, distant positioning from class boundaries, and spatial distribution patterns intermingled with correctly classified samples. These characteristics exclude the possibility of stable errors representing unlearned rare classes, instead indicating their origin in human-induced data quality issues. Notably, removing these errors enhanced model robustness against overfitting while maintaining performance levels, despite a 17 % reduction in training data size. Based on these insights, the study proposes integrating error profiling as a standard component into groundwater machine learning modeling pipelines.
Keywords: Machine learning; Geogenic contaminated groundwaters; Error profiles; Noisy labels; Data quality