Skip to content

Instantly share code, notes, and snippets.

@tobigithub
Created May 13, 2020 22:54
Show Gist options
  • Save tobigithub/bd8f3fca88f1ea15b73ed1c1ccfa17d0 to your computer and use it in GitHub Desktop.
Save tobigithub/bd8f3fca88f1ea15b73ed1c1ccfa17d0 to your computer and use it in GitHub Desktop.
Unidentified peaks remain a major problem in untargeted metabolomics by LC-MS/MS. Confidence in peak annotations increases by combining MS/MS matching and retention time. We here show how retention times can be predicted from molecular structures. Two large, publicly available datasets were used for model training in machine learning: the Fiehn hydrophilic interaction liquid chromatography dataset (HILIC) of 981 primary metabolites and biogenic amines, and the RIKEN Plant Specialized Metabolome Annotation (PlaSMA) database of 852 secondary metabolites that uses reversed-phase liquid chromatography (RPLC). Five different machine learning algorithms have been integrated into the Retip R package: the random forest, Bayesian-regularized neural network, XGBoost, light gradient-boosting machine (LightGBM) and Keras algorithms for building the retention time prediction models. A complete workflow for retention time prediction was developed in R. It can be freely downloaded from the GitHub repository (https://www.retip.app). Keras outperformed other machine learning algorithms in the test set with minimum overfitting, verified by small error differences between training, test and validation sets. Keras yielded a mean absolute error (MAE) of 0.78 minutes for HILIC and 0.57 minutes for RPLC. Retip is integrated into the mass spectrometry software tools MS-DIAL and MS-FINDER, allowing a complete compound annotation workflow. In a test application on mouse blood plasma samples, we found a 68% reduction in the number of candidate structures when searching all isomers in MS-FINDER compound identification software. Retention time prediction increases the identification rate in liquid chromatography and subsequently leads to an improved biological interpretation of metabolomics data.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment