baoilleach/2023_Sheffield_Conference.txt

## 2023_Sheffield_Conference.txt
I have no Twitter notes from the first day. Here are my notes from Days 2 and 3...

#shef2023 Adele Hardie (Uni Edinburgh) on an sMD/MSM approach for rational design of allosteric modulators.
Have come up with a workflow to predict allostery. Examples from two protein systems.
Orthosteric inhibition is where you stick a molecule into the active site blocking it. Allosteric inhibition is whether the molecule interacts somewhere else and affects protein activity. How can we predict this? Using MD.
Diff methods have diff cost. We use classical mechanics to compute the energies of the system, bonds, angles, torsion angles. The constants come from sets of precomputed params called forcefields. We can look at systems as big as protein-ligand, and ns timescales.
We can do Markov State Modelling (MSM), where we model probs of states (conformations). If the probabilities of the active vs inactive state change in the presence of a ligand then it's a modulator. Difficulty is that this is millsec to sec timescale - too slow for MD.
The solution is to make the change happen instead of waiting for it. Steered MD (sMD) - steer the system from one conf to another, via one or more collective variables (CVs). Modifiy the system E based on deviations from ....
This is a biased trajectory - we don't want to use this for our model. We save snsapshots from sMD trajectory and then use these as seeds for further equilibrium MD - we can leverage parallel computing and observe state transitions. These get combine into the Markov model.
PTP1B system: A difficult drug target with large amounts of exptal data. Activity characterized by the conf of the WPD loop: active closed, inact open. When we introduce cmpd 1, the active state prob drops considerably. But not when we repeat it...!
The problem was that the steered MD was steering the loop too fast for the rest of the allosteric network to catch up with it. The network *needs* to be taken into account. Can also do it with covalently bound inhibitors by forcing a distance constraint.
How does the allosteric effect work? Shows poses with PI stacking. The active conf has a down rotamer and the active has an up rotamer. This helps explain how the ligand works. Also shows the covalent linkage example. The MD can help explain why we are seeing these changes.
#shef2023 Second example. EPACs - exchange proteins directly activated by cAMP, assoc with large movement of RR (regulatory region). Key point here is about selection of features. We define certain important features of the protein that are rel to activition: RR and hinge RMSD.
#shef2023 Difficult to match a large region though, so we use a particular angle as a feature that helps distinguish between active and inactive states. Important to choose relevant features when building MSMs.
#shef2023 We've shown a new way to combine sMD and MSMs to model allosteric modulation. Steering the allosteric network alongside the steered trajectory is important. In future, will be looking at ion channel PKD2, another project in collab with UCB.
#shef2023 Not yet a high-throughput method, but want to work towards a useful tool for screening.
#shef2023 All code is on GitHub at

GitHub - michellab/AMMo
Contribute to michellab/AMMo development by creating an account on GitHub.
https://github.com/michellab/AMMo


#shef2023 David Palmer (Uni Strathclyde) on Simultaneous Entropy, Enthalpy and Free Energy Prediction using a Physics-Informed Neural Network and Multi-task Learning.
#shef2023 Some work on calculating thermodynamic props and enthalpies and entropies. Subtitle: "AI & Chemistry: New tricks for old dogs?" New things to do with 'old' chemistry methods.
#shef2023 In the cancer diagnostics company he works in, they have a microscope slide which acts as the crystal for the FTIR. There's an autosampler that moves the microscope slide over the aperture. They sell this. If you apply ML to the output, you can detect cancer.
#shef2023 Shows a retrospective study where they can detect a whole range of cancers. Can tune the algorithm towards high sensitivity. Import to diagnose at early stages. Application of new AI to old IR.
#shef2023 AI gives you a different view. RNN for time domain modelling of FTIR spectra. The dogma is that there are differences from one sample to another. So you need to preprocess, e.g. get derivatives, scale it, smooth it, EMSC (fit to a reference). Get rids of variance.
#shef2023 Preprocessing improves your ROC curve. Repeatability get better and accuracy gets better. Many studies about the best preprocessing. But there's an elephant in the room. What about the FT? The largest part of the preprocessing.
#shef2023 Can feed the raw spectra into the ML model, and then the pre-processing steps have little effect.
#shef2023 Data augmentation for generative models for cancer data. Low prevalence disease - so need more data. Can simulate fake spectra from real spectra (WGAN). If used for augmentation, can improve results from classifier.
#shef2023 Going to focus now on modelling solvation. Chris Hunter - the solvent influences pratically everything. Putting the solute in the solvent influences it. Relative water density around solute. Instead of using molecule as descriptor, use the solvent structure.
#shef2023 How can we model the solvent, either explicit solvent (MD, MC) or implicit but doesn't tell us anything about the structure of the solvent.
#shef2023 3D RISM (3D ref interaction site model). Works with density distribution fns. The density around the solute is given on a 3D grid, whereas the solvent is described at interaction sites. There are various equations (!), and we approximate the bridge.
#shef2023 Gives you hydration free E and partial molar volume. GAsol was the method we developed for this. Places water molecules in the density computed by 3d RISM - code is available.
#shef2023 We are going to predict bioaccumulation factor, BCF. Train 3D CNN on 3D RISM input, just uses solvent density. Works fairly well, but is a bit of a pain. Quite slow - hours for some solutes. Disk space and memory requirements.
#shef2023 Distrib fns are not invariant to rot of solute. Dependent on solute and solvent confs. Not very convenient.
#shef2023 So we were interested in whether we could use a simpler model, 1D RISM. No 3D distrib. Only defined in terms of distance from atoms. Can solve equations within a few minutes. Can get solvation free Es (goes back to Chandler's work in early 80s).
#shef2023 Descriptors are physically interpretable, and there is no sampling noise. However, no open source and maintained 1D RISM solute-solvent solvers.
#shef2023 We developed pyRISM - Abdullah Ahmad. Open source solver. XRISM and DRISM. The number of distribution fns is dependent on the no. of atoms in the molecule - not what we want. Omit the integration - just have continuous.
#shef2023 Can we improve the accuracy. Create data set for solvation free E. The standard theory is widely inaccurate. Let's replace the free E functional with a DL free E functional. Results are now much more accurate. Multiple solvents + multiples temperatures. All neutral mol
#shef2023 The next step is to look at ionised solutes. Still quite well - models trained separately. Can be extended to predict entropy, enthalpy and free E using multi-task learning (at same time) and transfer learning.
#shef2023 Have recently found a great dataset from WH Green, CombiSolv-Exp and are building on this.


#shef2023 Richard Gowers on The Open Free Energy Consortium
#shef2023 Software project around free E calculations. It's funded by a pre-competitive group of partners. Going to explain about alchemical free E methods. Can can they achieve. Will describe a novel algorithm for mapping atoms.
#shef2023 If we could predict the free E of binding, "it would be nice". Capturing this process using conventional MD is impossible. But we are interested in multiple molecules which makes it worse. Actually - it makes it easier. The thermodynamic cycle, for the relative change..
#shef2023 Taking one ligand converting it into ligand - alchemical transformation. Occurs on timescales accessible to MD sims. Shows examples of converting OH into CH3. Have a parameter that controls the gradual switching of one molecule "off" and the other "on".
#shef2023 This moves the model through a non-physical "alchemical" state. The trick is that you have to smoothly convert between the two to get overlap. This is the crux of it.
#shef2023 This is the method we are building solutions around. We are taking academic code and then tidying it up, raising the quality and maintaining it beyond the normal 3 year cycle of PhD funding. Everything is MIT licensed and made available. Robust, scalable.
#shef2023 Here's some benchmarking. Method is perses-like RBFE protocol. HREX. 3 repeats w/11 windows per edge. 1 ns equil, 5 ns prod, HMR / 4 fs. [Ed: no idea what this is!] Min spanning graph + LOMAP [I've heard of this]
#shef2023 Shows some benchmarking results for diff systems. These are absolute free Es. We've done a maximum likelihood to convert from relative to absolute to compare to exptal results. It's about getting the rank ordering. MUE 0.95, RSD = 1.35 kcal/mol. Compare vs PMX.
#shef2023 Cost? 1 GPU-day - about $10 in AWS bills. Accuracy of around ~1 kcal/mol. This is state of the art. Highly dependent on the actual target/ligand. Even when absolute acc is off, often rank order is still correct.
#shef2023 Now about OpenFF - a cousin organisation under the same umbrella. The novel thing they did is to use SMARTS patterns - obvious to everyone in this room, but wasn't standard practice.
#shef2023 Shows the intrstructure they use for the training loop to develop the FF before release. This was one of the first times this has been made transparent. Has led to improvements in how FFs are produced. Current is Sage, OpenFF 2.0.0. Outperforms other public small mol FF
#shef2023 In a few years, is catching up on OPLS3. What's coming. Version 3 will cover proteins as well. 3.1 will use CGNNs to generate AM1-BCC-like charges.
#shef2023 Usage example in Python. "import openfe", etc. Is available as Python package, as CLI. Various software vendors will provide GUIs. All on GitHub. openfe, gufe, pdbinf, lomap, kartograf.
#shef2023 Gufe - Grand Unifed Free E. Helps to converts thermodynamic cycle into code. PDBinf for protein informatics to overcome problems assigning bond orders in RDKit to non-standard amino acids.
#shef2023 These templates are available from the PDB. The atom-mapping problem. Which atoms are transmuted into which atoms in an alchemical transformation. And need to define a free E network. Trad tool for this is LOMAP.
#shef2023 The extra cycles give you more robust network and more accuracy. We've been looking at a diff approach, Kartograf, which is based on the geometries. We are working with pre-docked ligands. MCS not best way - instead use distance-matrix between atoms in each molecule.
#shef2023 Solve using a linear-sum-assignment algo to determine correspondence. Chemistry comes in late in the project. Doesn't get tripped up by stereochemistry compared to LOMAP. Comparison to LOMAP.
#shef2023 We are working on implementing abs binding free Es - takes about 10x computational effort.
and https://t.co/rRPjOl0AJi.try.openfree.energy

https://docs.openfree.energy


#shef2023 Uschi Dolfus (Hamburg) on Full modification control over retrosynthetic routes for guided opt of lead structures
#shef2023 A cooperation project with Bayer. Synthesisability often trips us up in silico design lead candidates. But often only applied as a postprocessing step. Ready made solutions neglects user's expertise.
#shef2023 We give the med chemist an opportunity to settle early on on a route and then generate synthetically accessible structural analogs. Already published. Synthesia.
We start with a lead structure, a retrosynthetic route and building blocks. We create the desired structural diversity but preserve the applicability of the retrosyn route.
#shef2023 Shows the verification of tree compatibility. We want to change B into S; if the generic reaction can't be applied, then we dismiss the candidate substitution. This algo is for exchange of reactants. Only provides limited control.
#shef2023 If we want to exchange multiple reactants at the same time, we need to verify all subtrees originating from nodes which are exchanged. Nodes are sorted and processed in reverse topological order. Tends to result in larger struct change in the target cmpd and more analog
#shef2023 Showing example of futibatinib. Generated an initial retro route with AIZsynthfinder. In a real-life example we found expect the user to provide this. Here's a route with 6 rxns.
#shef2023 As a next step, we decided to do a reaction exchange. The med chemist needs to provide a list of generic rxns, e.g. from lab journals. AiAzynthfinder can provide a list from USPTO rxns. Create a new tree for each proposed substitute.
#shef2023 We decided to exchange rxn no 4. We need an acid chloride for our first reactant. Replaced acid chloride with carboxylic acid.
#shef2023 During tree traversal, automatically detect rxn nodes that block otherwise suitable substitutes and skip rxn node if possible. Enables shortening of route if suitable reactant structures are available. Shows example - skipping deprotection rxn.
#shef2023 Up to this point, algo is only for specific info on where/how to mod the route. Where have no prior knowledge, have implemented an approach to help. Can specify which part of target structure you want to have modified. Algo identifies responsible nodes via atom mapping.
#shef2023 Enable exchange w/o prior knowledge of the route.
Now, application scenarios. Structural diversity for single structure - keep synthetic accessibility. Also, can be use for bioisoteric linker replacements based on route compatiablity.
#shef2023 We use the list of common linkers provided by P Ertl. We were able to exchange a linker in the molecule by identifying 5 out of these linker substitutes based on Enamine Global building blocks. Various linkers can be replaced. V useful as often more complex in synthesis
Synthesisa is about modification control over retrosyn routes. The tool is dependent on the quality of the route, the availability of building blocks and the generic rxns. The software will be available soon.


#shef2023 Sebastien Guesné (Lhasa) on Beyond balanced accuracy: Balanced Matthews' Correlation Coefficient
#shef2023 What I would like you to remember, if you are doing classification, you are facing a confusion matrix. One of the metrics is accuracy, and balanced accuracy. Balanced accuracy is a calibrated metric. This can be applied to other metrics like Matthews Corr Coeff.
#shef2023 It's all about solving the problem of inbalanced data. One of the troubles is when you have a shift in prevalence. Can cause problems.
#shef2023 Definitions of TP, FN, FP and RN. Positive prevalence is TP+FN / N. -ve prevalence is FP + TN / N. Sensitivity = TP / (TP+FN). Spec is TN / (TN + FP). Model perf against test sets improves iif delta sens and delta spec are both > 0.
#shef2023 Showing example with results on internal test set and external test set. 90% prevalence of +ve. But what about if prevalence has changed. MCC increases, accuracy increases. Comparing metrics such as MCC or accuracy cannot be compared - they are dependent on the prevalen
#shef2023 Need to try to maintain prevalence or cannot compare. If not possible, use the balanced accuracy (or sens and spec). Unless there is a way beyond balanced accuracy....
#shef2023 Various papers published on imbalanced data., e.g. "Master your metrics with calibration"., "The impact of class imbalance in classification performance metrics based on the binary confusion matrix".
#shef2023 From the initial confusion matrix you can derive a matrix containing just Sen, Pre+ and N. That is, TP, FN, FP and TN can be represented in those terms. e.g. TP = Sen.Pre+.N. Now use this as the defn of accuracy, gives, acc = sen x Pre+ + spe x (1-Pre+).
#shef2023 For a given sens and spec, can plot acc versus test set prevalence. Get straight line. Val at Prev 0 and 1 are the spec and sens. Get horizontal line if both values the same. When Pre+ = 0.5, then we get the balanced accuracy. Acc0.5 = (Sen + SPE) / 2.
#shef2023 Shows eqn for MCC in terms of TP and TN, etc. Now substitute in the dfns in terms of Sen, Spe and Pre+. Shows plot of MCC versus test set prevalence. Umbrella shape. Vals for 0 and 1 are always 0. Now substitute for prevalence of 0.5, we get...
#shef2023 Shows eqn for balanced MCC.
Now going to show use cases. Starting with applicability domain. Removing cmpds could shift the prevalence. Similarly with comparing external and internal test sets. Non-stationary data streams - the composition of the dataset is changing.
#shef2023 Another application is when you do a cluster split. Diff clusters may have diff prevelance.

Back to tale of two MCCs. Shows that balanced MCC takes into account the prevalence. When sens and spe decrease, the balanced metrics should go in the same direction.
#shef2023 *Mind the prevalence* - Pros: fair comparison. Cons: does not represent the metric at the true prevalence.
#shef2023 Ref to GHOST: in the questions.

https://pubs.acs.org/doi/10.1021/acs.jcim.1c00160
#shef2023 Question about kappa coefficient: shows backup slide with the same thing done.


#shef2023 Marc Lehner (ETH) on DASH: Dynamic attention-based substructure hierarchy for partial charge assignment
#shef2023 Showing MD simulation moving around particles. Non-bonded interactions modelled by Coulomb Potential. Involves qi, qj partial charges. Started in 1916, Lewis partial charges. Gasteiger/Marsili in 1978 developed to predict/interpret rxns.
#shef2023 Not v suitable for MD. With QM, Dewar developed AM1 in 1985. Underpolarized for MD simuls. MMFF94 were optimized for MDs. RESP was an atom-in-molecule type approach. AM1-BCC 2002 - still used. AM1 with bond charge corrections. Well suited for MD. Too slow for big mols.
#shef2023 Even better DDEC 2010, MBIS 2016. More advanced, but inc comp cost drastically. Needs high level of theory QM calcs. So people started developing ML models to predict these, e.g. 2018, RF model for DDEC (Bleiziffer JCIM 2018).
#shef2023 We did the same. GNN - attentiveFP architecture. It works but no explainability, or confidence interval, poor software robustness.
#shef2023 To explain it, we can use the GNNExplainer from PyTorch geometric. Numerical vals are extracted for the attention of each node. Shows that the GNN did a good job or bad job, but v. comp expensive, but doesn't give any confidence interval.
#shef2023 We decided to go a diff route. DASH! We discrete the molecule into fps, based on features like atom type, how many bonds, the formal charge. Then we build up a substructure based on these fps.
#shef2023 We use the attention from the GNN to prioritize the selection at each branch point. We build up the fragment by adding atoms in this way.
#shef2023 Cumulative attention is used as the stop criteria. Instead of 122 atom types, we have 10^7 patterns. Shows that DASH compared to other methods. Much better.
#shef2023 Did we improve on all points, or did we lose something? How fast? We are in the range of the GNN [ed: it's a log plot....]. Much faster than AMI-BCC.
#shef2023 The DASH tree is a rule-based fragment assignment method. Everything integrated into OpenFF.


#shef2023 Max Beckers (NIBR) on Leveraging Large-scale in silico ADMET Predictions to Estimate Small Molecule Developability
#shef2023 Has been working on leveraging historical optimization data. What can we learn from the past? New insights, patterns?
#shef2023 Previously worked on reconstruction of Novartis chemical series, tracing cmpds during optimisation and analysis of property evolution over time (JCIM 2022). Permeability was the only prop that went in the wrong direction.
#shef2023 Now to use the data to get prospective tools for cmpd and series evolution. We take the SMILES, then predict ADMET, in vivo PK and SAFETY profile (MELLODDY plus internal models). No measured data. No target activities.
#shef2023 Trained a neural net on the cmpds, along with annotated milestones. Predict likelihood of cmpd to progress beyond PK studies. Solubility, plasma protein binding, clearance, safety assays.
#shef2023 Dataset contains cmpds that progressed quite far long. The application of the test dataset. Showing the results. Using explainable AI. Explaining bPK scores. Shapley-value analysis.
#shef2023 To control for any bias due to similar molecules appear in the test set, we curated a public dataset that resembles in-house cmpd archives. Cmpds that reached the clinic showed quite a separation from cmpds that did not. No diff for clinical phases.
#shef2023 Looked at alternative ML approaches. E.g. GNN, XGBoost. One last source of train-test leakage, exploiting MELLODDY test-folds. May have seen some of the data in the public set. Curated a dataset.
#shef2023 Application to 3 in-house projects to see how the predictions agree with what they have seen. Shows the series progression.
#shef2023 Application to in silico generated virtual cmpds. We enumerated around exit vectors of three scaffolds. Works quite well. Looked at in-house projects also. Some challenges remain. The amount of training data. New modalities are rare in training data.
#shef2023 False negatives make training hard. Further applic to de novo gen of molecules. Screening followups. Monitoring progress of optimization projects. DC identification.
#shef2023 Question about making data available: will make code available, be as open as possible.


#shef2023 Moritz Walter (BI) on Integrating heterogeneous assay data for ML-based ADME prediction
#shef2023 Drug discov is a MPO problem. Potency, PK/ADME, safety. Absorption, distribution, metabolism/excretion.
#shef2023 There are several tiers of assays. Tier 0 - LogD. Tier 1 is microsomal stab (LM), and SOL. Tier 2 - HEP, PPB, Caco2. Tier 3 - F (bioavail), Vd, Cl. Cheap/high throughput assays required to test large nos of cmpds. Only measure promising candids in cmplx assays.
#shef2023 Q is can we use ML preds to prioritise cmpds for synthesis/to replace expts? We are going to use multi-task modelling. We could have different models for each of the endpoints; instead, we have one model that predicts multiple end points.
#shef2023 The theory is that it learns a more meaningful representation than a single task model. Assays are related. Data-poor assays might benefit from signal in data-rich assays. Implementation is Chemprop (a graph convolutional NN).
#shef2023 Input is the chemical graph. All the predictions are from an ensemble of 5. Are multitask (MT) models superior to ST (singletask) models? Also look at RF model. When predicting higher tier assays, is there a benefit in including available exptl data of lower tiers.
#shef2023 Temporal data split. Train up to 2020, and evaluate on 2021. Training set sizes: Tier 0 120K, 1 50K-125K, 2 1K-17K, 3 2K-10K. MT-ChemProp outperforms ST approach and RF approach (AlvaDesc). R2 is 0.9 for 1, 0.6 for 1 0.5-0.6 for 2, 0.2 to 0.4 for 3.
#shef2023 PPB_rat. Logit scale. 50% PPB is 0, 90 is 0.95, 99 to 2, 99.9 3.0. Actually makes a difference. 1% free drug vs 0.1%. Understanding the success of MT models. Which auxiliary assays are the most useful? And why?
#shef2023 Plot of diff between R2 of MT vs ST, vs the abs Person coefficient of data points measured in both assays. Also compared to size of dataset. Size of aux dataset more relevant than correlation for success of MT model.
#shef2023 There is an additional benefit if exptal data of lower tiers is available. Data-rich assays seem to be the most useful assays in the MT-model (despite low correlation to target assay).
#shef2023 Q about potential bias with MT models as cmpds measured in Tier 3 do well at all the lower tiers. A: are aware of this, but don't have a solution right now.


#shef2023 Tuomo Kalliokoski (Orion) on Efficient SBVS of ultra-large enumerated chemical spaces using machine learning boosted docking (HASTEN)
#shef2023 "efficient" means something you can actually used in drug discovery.
Orion is over 100 years old, 3.5K workers.
#shef2023 I know that SBVS with docking has a severe issue with false positives. This talk is not about fixing the numerous problems with docking. This is only about making it much faster.
#shef2023 SBVS. We have a large set of small molecules. We dock against a target structures, and sort by their fit to the target structure. The better the score, the greater the prob that the compound will be active in an assay.
#shef2023 Ultra-large enumerated chemical spaces. Billions of virtual cmpds that can be made economically by request within weeks. Can be used in hit finding. What's the problem? Dock them all. If you can dock 1B / week, it will take a month - cannot brute force.
#shef2023 The trick is that ML preds are much faster than docking. Pick some cmpds by random. Dock them. Build a model. Predict scores, and then choose more compounds, etc. The original idea from Gentile - Deep Docking. ML is Chemprop - v nice work from MIT. Why repeat?
#shef2023 To be frank, I could not run the Deep Docking code myself. Also, I was interested in accelerating smaller scale docking. I wanted to use Schrodinger software + chemprop. And I wanted to do some in the cloud and some local. Publsihed in MolInf. Slides will be available.
#shef2023 Pick random set. Convert to 3D. Dock. Then use docked scores to train ML model using ChemProp. Then sort the predictions and pick top N.
#shef2023 How to quantify the performance? Recall of virtual hits. A cmpd that has good enough docking score (such that it fits the top 1% of the screen, or an absolute value cutoff like <= -10.0). Estimated recall based on a random sample.
#shef2023 HASTEN validation study on a million scale. The Deep docking paper has data for 12 targets with 3M cmpds each docked with FRED. Data for one internal target (HAT) that was docked with Glide with core constraints. Note the improvement from active learning vs just single
#shef2023 Works fine with 100M dataset as well. Lyu et al Nature 2019, 566, 224. Shows scale of MolPort versus 100M AMPC study vs Enamine REAL. In the Pipeline Derek Lowe had post on 19 Mar 2021 that no-one can dock 1B molecules.
#shef2023 Used Finnish national supercomputing resources docked 2x1.56B cmpds to two academic drug targets (168 d using 640 cores on CSC Mahti supercomputer). Glide HTVS versus SurA and GAK.
#shef2023 You only need to dock 0.5% of the library to find >80% of the high-scoring cmpds. One should exclude cmpds failing to dock from the ML training instead of giving them an arbitrary score. This gave us the confidence to tackle Enamine REAL.
#shef2023 You don't need a supercomputer. 64-core processors, AMD Ryzen. For ML, can have consumer-grade card. Need quite a bit of diskspace. We are limited by software licenses (64 x 3), 21 h per iteration. ML 3 h iteration 2, 6h iteration 3, 9 h iteration 4. Can use AWS.
#shef2023 He pre-filters set of in-stock cmpds from Enamine and MolPort, and dock those, import to HASTEN. Then use on Enamine REAL.
#shef2023 Example. Time allowed for the screening, 1 month. Giga-scale docking in drug discovery. Enamine 5.5B. Can filtered for HAC, RotBonds, TPSA down to 1.6B. Can also control for stereocenters and num aromatic rings.
#shef2023 These cmpds were diverse. The problem was the virtual hit triage. We ordered 300 out of 2.8M. You need to use multiple criteria.
#shef2023 HASTEN v 1.1 now out. Several bugfixes and opts for giga-scale screening implemented. Available on GitHub via version 1.1 branch.
#shef2023 You could have an ensemble, to get deg of uncertainty and then use Bayesian sampling, but that takes time, and it works well enough now.


#shef2023 Rajarshi Guha (Vertex) on Virtual Screening of Virtual Libraries using a Genetic Algorithm
#shef2023 Based in Boston, sites in Oxford and San Diego. VS an important topic in our group.
#shef2023 Douglas Adams quote on the mind-bogglingly big size of 'chemical' space. Lyu et al, the larger the space, the more likely you are to find chemical matter. We want to be able to use arbitrary scoring fns. We want to keep up with the increasing size of virtual libs.
#shef2023 Many ways to search VS. Combinatorial approaches: don't enumerate - operate on BBs and rxns. Scoring fns need to additive, e.g. InfiniSee from BiosolveIT. Alternatives are brute force, Guided/adaptive/active learning, and sampling. Sampling scales with reactants.
#shef2023 We've been using Thompson sampling. This work inspired by Gabby, but much simpler. Create a random pop of products from a rxn. The reagents used for each product rep its genes. Apply crossover/mutation to gen a child pop. In gen, the pop will evolve towards greater fit.
#shef2023 We first define an individual. The reagents for the rxn are the genotype; the phenotype is the product. The fitness is an objective fn. Select a rxn randomly, then reagents randomly from that.
#shef2023 Crossover swaps the reagents. Mutation randomly mutates one of them. Now we can apply the evolutionary cycle. Has early termination check.
#shef2023 There are some params, e.g. size of pop, num gen, probs of crossover/mutation, etc., etc. It's relatively robust to the choice of params. The GA is implemented to support sclaing. We are using Nextflow (Tower). AWS Batch. Scoring on Lambdas, S3 results storage.
#shef2023 We have 271 GAs in parallel on each of the reactions. We use Tanimoto sim using 2D circular fps, < 30min. ROCS shape sim, 3-4 h. FRED docking, 7-9 hours.
#shef2023 Used Enamine REALSpace (33B) with 2D and 3D scoring functions. For a subset of reactions, we explicitly enumerated the product spae and id the optimal hit (for a given scoring fn). Assess ability to 're-discover' pre-existing mols. Compare to a null model and alternatve
#shef2023 The GA is time and sample efficient for 2D similarity. Identified 76 rxns with product space with < 500K mols. Shows example of optimising Tanimoto to query molecule. There are rxns where it can shift the distrib, but also those where there is not a big sep.
#shef2023 The GA finds the most similar hit in many cases, 31/76 times. When you don't find the optimal, it still finds something v. similar. Also works with Shape. We did it with ROCS. Similar results. Only samples small fraction of product space.
#shef2023 The ROCS scored pop evolves to better fitness. 37 reactions with < 100K rxns were searched by brute force. The GA ids the best hit 13/37 times.
#shef2023 Can it rediscover pre-existing molecules? Out of 10 runs, 3 were able to find the query molecule.
The other times, it was often able to find something v similar. We did this with some other query molecules.
#shef2023 Searching Enamine REALSpace. 33B virtual mols. Pop size = 100, Ngen = 250, 5 runs. Took 26 min and sampled 0.0061% of the full library.
#shef2023 Null model is random sampling. GA is doing quite a bit better than random. How does it compare to Thompson sampling - this is our work horse method that we use. There is overlap. The GA gives more examples at the upper end of the range.
#shef2023 Ren et al (Chem Sci, 2023) employed generative models and SBDD. Tanimoto sim and ROCS searches in REALSpace gives us hits that explore the RHS. A docking fitness fn could have led to exploration of the LHS and identified the pyrazole (or bioisosteres).
#shef2023 A GA is a viable approach to VS of virtual libraries, with arbitary scoring fns
*Do we need generative models if we have access to make on demand virtual libs?*
They are synthesisable and available. We can search them flexibly. They can get us close to where we want.


#shef2023 Lauren Reid (MedChemica) on SARkush: Automated Markush-like Structure Generation using Matched Pairs and Generic Atom Scaffolds
#shef2023 The name is a combo of SAR and Markush. A structural rep designed to communicate SAR, e.g. X for alighphatic ring atom, E for linker atoms, R for side chains, x for aromatic atom. We want to decompose SAR into a table of R groups.
#shef2023 Markush structures are important as they are an established way to communicate SAR. Can provide input to Free-Wilson or QSAR approaches. But manual production is time consuming, and decomposition algos require the user to describe a core.
#shef2023 SARKush tool brief. Auto matically cluster cmpds, remove the need for users to separate cmpds into series and cores. Input all of your SAR data and get back a Markush table.
#shef2023 Shows an example output, an Excel spreadsheet with SARkush structural depcitions. For each you also get a text file with all of the decomposition info and the data. There's a GUI in development. Shows a video of it working. Data uploaded to server and it runs the job.
#shef2023 Can change how it does the clustering. Inspect the smallest and largest member to see if you like the clustering. Can look at statistics assoc with that Sarkush. There's also a network view in development to give an idea of the chemical space covered.
#shef2023 The algo described. We find MMPs in the data. Separately we do scaffold clustering. These are combined with the MMPs to build a network of cmpds. Disconnected parts indicate the subseries, then we do another round of scaffold clustering within those series.
#shef2023 Finally we do a desymmetrised SARkush decomposition. Everything will be explained...
#shef2023 Step 1 is to create a generic atom scaffold. Cut off ring and branched centre substituents. Set the atoms to generic (with isotope labels). Use RDKit canonicalisation to create scaffold SMILES. We can match identical generic scaffolds.
#shef2023 After doing this for the dataset we merge cmpd clusters with matching scaffolds. We merge larger scaffolds into smaller if substructure present and atom overlap criteria (80%) is passed.
#shef2023 Now about MMPs. Pairs of cmpds that differ by a small chemical change. MedChemica's algo has two methods: MCSS and fragment-and-index. Now showing example of how network is built from clusters. Edges are added if any MMPs exist between clusters.
#shef2023 Now separate disconnected networks. These represent diff chemical series. Further clustering with more relaxed cutof. Why important to do this? We want to avoid "Regression to benzene" problem. If we have a very large and diverse dataset, there's a risk that everything.
#shef2023 ...is clustered into benzene. Separating series into sub networks reduces this risk.
Now decomposition algo. If we have a symmetrical scaffold (e.g. benzene), you would want the substituents in the same position in each scaffold.
#shef2023 For the first cmpd, just match in any orientation. For the subsequent cmpds, iterate over each possible match and get the highest possible scoring alignment. As we move on, we score against all the previous ones.
#shef2023 COVID Moonshot. A crowd funded project to develop SARS CoV2 MPro inhibitors. 3220 structures.... They started with this initial hit and explored around: 143 cmpds associated with SARKush #6. Then they change a monocycle to a bicycle, giving SARkush #1, ring closure SK#2
#shef2023 Finally, all of the active cmpds (from chiral sep) gave SARkush#60 (56 cmpds). This led to the preclinical candidate (DNDi-6510). Shows plot of how the potency progressed through time in terms of SARkush structures.
#shef2023 Can look at the variable groups through SARkush over time. Can be used on its own or as an input to Free-Wilson or QSAR for example. Temporal analysis quite information. In future, Moonshot has transformed into ASAP Discovery ().

AI-driven Structure-enabled Antiviral Platform (ASAP)
ASAP uses artificial intelligence and computational chemistry to accelerate structure-based open science antiviral drug discovery and deliver oral antivirals for pandemics with the goal of global, equ…
http://asapdiscovery.org
#shef2023 GUI will be released later in the year.


#shef2023 Henriette Willems (ALBORADA Drug Discovery Institute) PI5P4K Subtype-selective Inhibitors: Three Binding Modes from One Privileged Motif
#shef2023 Have talked on this before, part 1, virtual screening. Now part 2, what worked well and what didn't. ALBORADA is one of 3 institutes funded by Alzheimers Research UK (Cambridge, Oxford, UCL). Each operates as a company with 30 people. In house, except for DMPK, ADME...
#shef2023 We are interested in enhancing protein clearance. Misfolded proteins are a problem, shows plaques. Autophagy is where misfolded proteins being enfolded and expelled. Inhibit PI(5)P and it upregulates autophagy, and get rid of misfolded proteins.
#shef2023 PI5P4K lipid kinases. Not on the main kinase tree. It's separate - low homology with other kinases. Has 3 isoforms with high homology. alpha, beta and gamma.
#shef2023 No xtal structures with inhibitors in the PDB at that time. No potent ligands in ChEMBL at that time. That's life - just needed to forge ahead. We screened the BioAscent cmpd lib. 30K. We did some filtering, GOLD screen, then MOE pharmacophore screen. 6K.
#shef2023 Redocked with more expensive protocol, more filtering, and finally 960 cmpds selected by cluster and docking rank. 1% hit rate would be good, i.e. 10 cmpds. We found 9 hits for alpha, and 5 hits for gamma.
#shef2023 I looked back at the various methods in a retrospective analysis. If we had just used the docking, we would have just found half of them. But docking and then rescoring, would have found them all. Shows a picture of the hits. Published 2023 (Willems, RSC Med Chem).
#shef2023 Some singletons but also a few related cmpds. There was selectivity. Also published in Rooney JMC, 2023 (?).
#shef2023 Now SAR by catalog. For hit 1, bought 39, 13 active, best was 7.1 pIC50. Goes through the rest. Not as good. e.g. bought 60, 2 active. Another 0 actives. Another no relevant cmpds to purchase. Similar story for gamma, though best hit was 20 bought, 10 active, 7.4 pIC50.
#shef2023 So 14 hits down to 2 series very quickly. In parallel we were working on this literature cmpd, not very soluble, and couldn't dock it. Hard to know how it bound and how to improve it. HDX-MS suggested certain residues are involved in binding, but there's no pocket.
#shef2023 Ended up doing trad med chem on this. Describes profile of key cmpds. Clearance, solubility, efflux ratio, etc.
Now, we were in a position to get xtal structure. We were in for a surprise...
#shef2023 Cmpd 40 turned out to bind in this allosteric site. We had not previously seen this site. The loop moves up to the ATP site blocking it. Very nice binding pose. Three H bonds, very tight, looks good but... This pocket is not there in the apo structure.
#shef2023 There are quite a few residues that needed to move to fit the ligand in.
Cmpd 1607 also had surprises. The core has flipped compared to ATP - doesn't bind in same way.
#shef2023 Also got structures for the alpha cmpds. This activation loop moves to become part of the active site forming a new lipophilic pocket. There's a rearrangement of the water molecules. But there's another xtal structure published now, and it doesn't have the loop movement
#shef2023 Then if we compare the structures for our two hits, does one structure inform the other. Not really. That's why the SAR doesn't really match at all, and why we were puzzled. The structures aren't really in the same place due to water network.
#shef2023 When I look back at the docking I did, the use of hinge constraints resulted in incorrect binding mode for gamma ligand. However, constraint needed for docking to apo alpha structure. The pharmacophore probably helped here to pick out the interactions the mol makes.
#shef2023 Predicting the conformation of the SM can also be a problem. There were two possibilities for a particular torsion, QM calcs give around the same energy. Gives rise to strange SAR due to influencing E of particular torsion.
#shef2023 Summary: three diff series were identified. The best results from VS were obtained by docking, followed by a pharmacophore screen using the docked pose as input. Thanks to ALBORADA DDI and Peak Proteins.


#shef2023 Samuel Genheden (AZ) on AiZynthFinder: Developments and Learnings from Three Years of Industrial Application
#shef2023 Talking about their retrosynthesis tool. Here's all the people that contributed. [Ed: There's 25-30 faces.] You are given a molecule, and you want to know how to synthesise. Break a bond, get prescursors, and keep going until you can buy it. There's a big search tree.
#shef2023 3 years ago we published the 1st version. 2020 J Cheminf. A template-based retrosyn approach. We extracted general transformation rules for templates. Then we trained a one-step retrosyn method to rank templates. Apply top-ranked templates to the target to get precursor
#shef2023 Then we re-implemented the seminal work of Segler MCTS to balance exploration and exploration to do the tree search. We were not satisfied with the performance at that time. We put it out there. It's easy to install, it's extendable, comes with models trained on patents
#shef2023 Most of the code we use internally is on GitHub - all the science is on GitHub.
Let's look at recent developments.
#shef2023 In 2020, we only supported a template-based 1 step retrosyn. We extended this via an expansion policy. We have a SMILES-based policy. Model Zoo library developed by PhD students. Can plug any 1 step retrosyn model in. We can directly look precursors in our internal rxns
#shef2023 We don't need to predict how to make, just look it up. If we've made something in the past we can add it to the search tree. Chen 2020 has been converted to ONNX and can be used. We have also looked at template-free retrosyn. Treat it as a language problm.
#shef2023 We have pretrained a BERT model to learn the SMILES, and fine tuned for many tasks, e.g. retrosyn. Irwin 2022 Mach Learn Sci Technol. We can plug into AiZ and have compared it to current approach. Shows results comparing across rxn classes. Chemformer works well.
#shef2023 There's an issue with reproducibility in the field. We put out a package to help solve this AiZynthTrain, 2023, JCIM.
#shef2023 Now route finding. PaRoutes. We realised that there are no agreed methods to compare route predictions. A target can be synthesised in more than one way, but how similar are they? Genheden 2022, Digital Discov. This benchmark set contains two sets of ref routes.
#shef2023 Key is a top-n metric of finding the exact ref routes. We used this to compare diff search algos. E.g. Retro* (from A*), MCTS, DFPN. DFPN is the clear loser. MCTS and Retro* are v similar, but more diverse with MCTS.
#shef2023 We tried to benchmark single step vs multi step. The students took four datasets with diff diversity and size. Hassen 2022 arXiv. They trained, and did a route finding exercise. Our baseline template-based model is lagging behind more recent Chemformer.
#shef2023 Very sensitive to models trained on diff data. Search time important. Chemformer extremely slow compared to AZF.
#shef2023 We recently published on choosing the right hyperparams. Westerlund 2023 ChemRxiv. The tree depth, tot no of iterations, width. Global opt vs manual grid search vs ML prediction. Two factors we want to balance: median search time, solved targets.
#shef2023 The conclusion is that the hyperparameter has a big influence. The global optimisation favoured the mean search time too much. We think a manual scan is the best. We should find 10% more solutions.
#shef2023 Now large-scale expts for typical target sets. A target set of 70K AZ designed structures. In their design tool they automatically get a link to AZF. We train on ELN+Pistachio+Reaxys, and stock is AZ stock + Vendors. Also other sets with different train and stock.
#shef2023 For all targets except GDB we find synthetic routes for 71%. GDB is v challenging (11%). Longer routes in general for AZ and...
Acylation is the most common class. How much of the template space is being used. Only 10% of the 180K templates are used for AZ and Reinvent
#shef2023 Do we really need 180K templates? Or are we 'bad' at exploiting the templates?
#shef2023 Some challenges. Route scoring and comparison. We are using it in production. We don't know how big the impact of our changes over time. What is the best ref ste for benchmarking.
#shef2023 Challenge is to predict human-like routes. Currently used as an idea generation. Problems with routes that chemists have to dealt with.


#shef2023 Aras Asaad (Oxford Drug Design) on Persistence Homological Statistical Summaries for Ligand-based Virtual Screening
#shef2023 Talking about a field of mathematics for representing molecules in a persistent and stable way. There are different ways to rep a 3D mol. Super-positional methods, e.g. ROCS, Align it, ESPsim, etc. Non super methods include USR-CAT, Whales, Electroshape, RGMolSA, ....
#shef2023 Today will talk about TDA (Topological Data Analysis). A rather new field of maths. Two branches: 1. persistent homology, 2. mapper algorithm (all about projecting data into diff dimensions while preseving homology - not talking about this...)
#shef2023 Algebraic topology (groups, rings, fields, ideals) to study topological spaces. All about the shape, not the geometry. How things are connected and their closeness. To study this we use a simplicial complex. The main tool is called homology (not *that* homology).
#shef2023 What is a simplicial complex (SC)? It's a process of building a graph from vertices (atoms). A 1-simplex is an edge, a 2-simplex is a triangle. Put this together to build a graph. How many holes/loops/cavities do I have - this is what it can tell.
#shef2023 How to build SCs from data? If the distance between any two points is less than threshold then connect. Consider a sequence of distance thresholds and analyse the pattern of change in the topology of the corresponding SCs.
#shef2023 This process is known as a filtration of the space. The information is stored as a barcode, a persistence barcode. Keep track of the holes as you increase the distance. At smaller d, it's telling you about the geometry. At larger d, it's telling you about the topology.
#shef2023 We have this for connected components, the loops and cavities also. It's stable. A small perturbation will only affect the small d. Can also plot of Death vs Birth of features: persistence diagram. Each bar in barcode is interpretable. We need to featurize for ML.
#shef2023 Shows example of ilfenprodil. Input can be 3D atom positions, or 4D with partial charges, or 5D with lipophilicity as well. Sometimes the topology is telling you useful information. Shows three dominant bars, but another bar at later state - tells you the foldedness.
#shef2023 How to featurize? Can do it as persistence image and use it in ML. Can do persistence landscapes or graph. How about just use statistical summaries: average of birth, death, lifespace, etc. Paper in ARXIV 2022.
#shef2023 Used DUD-E dataset. Confs generated using an internal ODD pipeline. ML classifier is Light-GBM classifier. [other details...]
#shef2023 Shows targets that work v well and those that don't work v well. [Ed: Would be good to see comparison to baseline result, e.g. ECFP4 FP].
#shef2023 Comparison to ES5D - clear that our results are better. The method is stable regardless of what target is chosen. Conclusion is that the method encodes both global top features as well as geom features. Performance is SOTA tested on DUD-E, MUV and in-house antimicrobial


#shef2023 Benoit Baillif (Uni Cambridge) on Applying Atomistic Neural Networks to Bias Conformer Ensembles towards Bioactive-like Conformations
#shef2023 Try to find bioactive conformations. V important for pose id, VS. The CSD conf generator retrieves 1 bioactive confs for 90% of the molecules in the Platinum dataset, but which of these 3 confs is more bioactive-like. "ARMSD"
#shef2023 Collected data from PDBbind. Filtered as needed to match to LigandExpo structures. Generated 250 confs with CSD conf generator. 1.8M confs.
#shef2023 Did random split, but also scaffold split. 80% train, 10% val, 10% test. Using an atomistic NN (AtNN) to predict the RMSD. Embed the atomic numbers, 3D encode the coords. There's a basis fn, interaction blocks, processed atom embeddings giving single atom values....
#shef2023 Sum pooling gives a single value for the ARMSD. Using three AtNNs with inc level of expressiveness. SchNet with bond length. DimeNet++: also has angle. ComENet: also has dihedral.
#shef2023 The first baseline would be random order: randomly shuffling the confs. CSD probability: prob of torsion angle values based on the CSD. MMF94s energy. The last baseline is a bioact baseline based on finding similar molecule and then doing an MCS.
#shef2023 TFD - torsion fingerprint deviation. Using BEDROC metric for early enrichment analysis. alpha = 20 (?). Generate confs, predict ARMSD, rank by ARMSD, and then compare.
#shef2023 The more expressive the NN the better it is able to find bioactive conformations. Similar results for the scaffold split, except that SchNet did not outperform the baseline.
#shef2023 AtNNs show higher enrichment for tested mols having a large MCS to the closest training molecule. If we have a large substructure in the training set, the activity based baseline works well.
#shef2023 Analysed the results based on the target. AtNNs early enrichment is consistent mostly on over-represented classes.
Showing example with 3IVH vs 3N4L. The conf energies are not very helpful, but the ComENet pred is quite good.
#shef2023 The method fails when closest training molecule shows additional atom branches.
#shef2023 Now the application to PDBbind re-docking. Here the idea is to reduce the no of confs you are using for VS. We retrieve already 50% (vs max of 70%) with only 1% of confs.
#shef2023 AtNNs can help accelerate bioact conf id. Test fewer confs to reduce comput time. The main limitation is for molecules with scaffolds unseen during training. Recommended is for H2L, LO, id new ligands within the known chemical space.


#shef2023 Roger Sayle (NextMove) on FNGRPRNTS: Processing Just the Bits you Need, and None of the 1s you Don’t
#shef2023 About chemical similarity and Tanimoto coeff. The von Neumann bottleneck. Previous work Swamidass and Baldi....
#shef2023 Tanimoto coefficent is intersection over union. Various set theory equations. |A| + |B| = |A U B| + |A intersect B|. If A and B are constant, Tanimoto (and others) can be calculated from just the intersect.
#shef2023 How big is the problem? 1B molecules requires 252GB 2048 bit fp, but 34GB for 256-bit. 1TB of mem can hold Enamine REal 2023 as 1024 bit fps. ZINC22 (35B) does not even fit as 256-bit fps.
#shef2023 John von Neumann. CPU-RAM-DISK. Rate-limiting step is how fast you can get data between these. Compares speed of different RAM speeds. DDR3 thru DDR5. Max is 1.19B 256-bit FP/s. Often multiple channels. So modern procs can do 6B to 14B 256-bit FP/s.
#shef2023 Crash course in computer electronics. Each core can speak faster to the memory beside it, vs the memory further away. "numactl -h". So long as I read from the memory close to me, it's nice and fast. NUMA.
#shef2023 FPSim2 doesn't speed up if you increase the number of threads. But if you turn on a kernel setting about Numa, it suddenly speeds up.
#shef2023 As you add more and more threads, you eventually hit a bottleneck. There's a theoretical limit at 80% or 90% of the max. It doesn't matter what we do, we have 60 or so CPUs idle. In a nutshell, "To go faster we need to look at less data" - John Mayfield.
#shef2023 Swamidass and Baldi Bit Bounds. JCIM 2007. Actually used by Daylight in Thor and Merlin prior also. Partition the data into buckets based on POPC. Can rule out needing to look in particular buckets depending on similarity cutoff. But depends on FP and db.
#shef2023 Pruning ability really isn't that great. If I do a db search and want to find the top X% hits, how much of the db do I need to search? For Enamine REAL, even to find the nearest 5 or 10 nbrs I need to look at amost 100% of the data. For ChEMBL, maybe you save 10%.
#shef2023 Describing Arthor. If most of the bits are zero, there is no need to look at those bits when calculating the intersection. Arthor 4.0 insight is to use an inverted index. Rather than have the bits for each mol we can have the mols for each bit.
#shef2023 Compression techniques are needed though and must balance space vs decoding speed. A query with 0.5 bits in theory does 0.5 the memory accesses. Can skip sections of the database. It's only the bits in common that we care about.
#shef2023 Intersect/union are efficiently implemented on bitmaps. Shows how unrolled loops are faster for totting things up.
#shef2023 Bit twiddling. More bit twiddling.
#shef2023 Is there a density or POPC where this becomes less efficient? V1 of our software makes out at 0.6 speed B/s. Shows that the latest version works much better. Before 10 B/s. Now 30-50 B/s. Now scaling effects.
#shef2023 Back to Baldi bounds. Partial sum bounds. As I am adding things up, I can do early termination. Process from the rarest things first to the most common bit.
#shef2023 I am delaying the inevitable for another year or two. Dbs are getting too big. GPUs not cost effective for this sort of thing.