Report by: Vladimir Chupakhin (Simulations Plus)
Contact: vlad.chupakhin@simulations-plus.com
- Compounds were standardized with ADMET Predictor v12 (Simulations Plus) [1].
- A single representative of the non-dominant data entry was taken for each duplicated data entry.
Physicochemical descriptors and ADMET-related models were computed using ADMET Predictor (Simulations Plus) [1] and used to build an ensemble of 50 CatBoost regression models [2]. Each model was refitted on a subset of features selected based on non-zero SHAP values [3], and the final prediction was obtained by taking the median of the ensemble predictions.
To enhance model performance, we introduced additional features, including fragment-based, metabolic sites, and symbolic regression-derived features:
- Fragment Features: Explicitly computed for both training (5479) and test sets (1524), with only overlapping fragments (646) used in modeling. Fragment descriptors were based on RDKit Morgan descriptors of radius = 3 [4].
- Metabolic Site Features: Generated using a predefined set of SMARTS rules (30-69).
- Symbolic Regression Features: Derived from symbolic regression models using the same training set [5].
All features were initially filtered based on SHAP values averaged over 50 ensemble runs before being used in the building of the final models.
Final model submission was based on the TabPFN model [6], because in a 5-CV benchmark it provided a 1–5% gain in quality metrics compared to the CatBoostRegressor.
References
[1] Simulations Plus. ADMET Predictor v12. Retrieved from https://www.simulations-plus.com/software/admet-predictor/
[2] Dorogush, A.V., Ershov, V., & Gulin, A. (2018). CatBoost: Unbiased boosting with categorical features. Retrieved from https://catboost.ai/
[3] Lundberg, S.M., & Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems.
[4] RDKit. Open-source Cheminformatics. Retrieved from https://www.rdkit.org/
[5] Ma, X., et al. (2020). PySR: Symbolic regression in Python. Retrieved from https://github.com/MilesCranmer/PySR
[6] O'Malley, S., et al. (2021). TabPFN: A Transformer-based Model for Tabular Data. Retrieved from https://github.com/automl/TabPFN