Skip to content

Instantly share code, notes, and snippets.

@shug3502
Last active March 18, 2025 15:44
Show Gist options
  • Select an option

  • Save shug3502/f0fe74f443d1d82fa73bf869148c7cdc to your computer and use it in GitHub Desktop.

Select an option

Save shug3502/f0fe74f443d1d82fa73bf869148c7cdc to your computer and use it in GitHub Desktop.

Overview for submission for Polaris Antiviral Competition 2025

Validation

Noting high similarity between the holdout set and a subset of the training set, we elected to use 5-fold random splits for cross-validation

Data

We identified suspicious data points and removed them from the training set. Measurements of 0 for HLM and MLM. Presumably these were out-of-bounds measurements which we elect to remove rather than placing at the assay minimum. These measurements break the global trend between Clearance and LogD, have a big change in clearance compared to nearest neighbours, and decrease model performance in random-split cross-validation. In the potency task, we filter unusually low measurements (pIC50 < 3). We also remove suspicious measurements breaking the global trend between MERS and SARS pIC50 and at extremes in the distributions with a large change in pIC50 compared to nearest neighbours. For each ADME property, public data was gathered from ChEMBL, filtering to keep only compounds with high similarity to compounds in the training or holdout sets. The public data was included in the final model submission for each property where cross-validation metrics were improved by augmenting the training sets with the public data. These were LogD, Solubility, Mouse Liver Microsomal clearance, and Human Liver Microsomal clearance.

Models

We explored model architectures such as MolE[1], MolGPS[2], and modelling methods available in the MolFlux package [3]. For each property, we selected the modelling architecture with the highest cross-validation metrics, choosing models based on MolGPS for each endpoint.

[1] - Méndez-Lucio, O., Nicolaou, C.A. & Earnshaw, B. MolE: a foundation model for molecular graphs using disentangled attention. Nat Commun 15, 9431 (2024)

[2] - arXiv:2404.11568

[3] - https://github.com/Exscientia/molflux

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment