Beyond the surface: Quasi-SMILES machine learning approaches for precise estimation of organic sorption
Raouf Hassan, Alireza Baghban,
Beyond the surface: Quasi-SMILES machine learning approaches for precise estimation of organic sorption,
Materials Today Communications,
Volume 49,
2025,
114126,
ISSN 2352-4928,
https://doi.org/10.1016/j.mtcomm.2025.114126.
(https://www.sciencedirect.com/science/article/pii/S2352492825026388)
Abstract: This study develops advanced machine learning models to predict soil sorption of organic compounds (logKd), a critical factor influencing their environmental fate. We aim to create robust models that capture complex interactions between soil properties and chemical characteristics for accurate sorption predictions across diverse soil environments. Utilizing a comprehensive dataset of 20,945 experimental records, covering 419 organic compounds and 1037 soil types, we applied a range of algorithms including XGBoost, LightGBM, Random Forest, SVMs, linear regression, and others. Input features included equilibrium concentration (log Ce), soil-to-solution ratio (log SS ratio), soil organic content (SOC%), cation exchange capacity (CEC), pH, pKa, pKb, and Kd/Kf. Model performance was rigorously assessed using R-squared, MSE, and visualization tools, with data reliability ensured through Monte Carlo outlier detection. XGBoost, LightGBM, and Random Forest achieved superior performance, with R-squared values up to 0.9957 and MSE as low as 0.0067 on test data, significantly outperforming other methods. SHAP analysis revealed Kd/Kf as the dominant predictor, followed by log Ce and log SS ratio, highlighting their critical roles in sorption processes. These findings demonstrate the power of machine learning in delivering precise, reliable predictions for soil sorption, offering valuable insights for environmental risk assessment and pollutant management. Unlike previous studies that typically rely on smaller datasets or linear regression approaches, our work integrates one of the largest curated soil sorption datasets to date with advanced ensemble learning and interpretable AI tools. This combination enables not only highly accurate predictions but also mechanistic insights into nonlinear soil–compound interactions, marking a significant advancement over earlier modeling efforts.
Keywords: Soil sorption prediction; Organic compounds retention; Machine learning models; SHAP analysis; Monte Carlo outlier detection