Create a gist now

Instantly share code, notes, and snippets.

My notes from the 11th International Conference on Chemical Structures 2018
#11thICCS Hitesh Patel presents SAVI - Synthetically accessible virtual inventory
Q: What can I make easily, reliably, safely and cheaply? Make a db of 1 billion mols where you know that this is the case. 1-step rxns. Freely available db.
Building blocks (Sigma) + transforms (LHASA) + cheminformatics engine (Xemistry) = SAVI.
Use some of the transforms used in LHASA (written in CHMTRN/PATRAN) but additional ones also.
Related: Enamine REAL db; CHIPMUNK db (TU Dortmund), ChemPass, ChemAxon Reactor, Proximal Lilly, Pfizer GVL, BI CLAIM
"CHMTRN/PATRAN transforms are smarter than SMARTS" - uses scores - if this, then dec score by 10, if that, then inc score; if the other, then kill the reaction
I think this is from the Corey group. Not sure of the details.
Original run (2016) resulted in 283M unique products from 14 transforms.
Procedure generated 999 novel simple aromatic (SA) rings. (As defined by Peter Ertl)
Next version of SAVI: ~600K building blocks, ~50 transforms covering ~20 chemistries. Example of new one: Suzuki-Miyaura coupling. More than 1 billion possible reactant pairs. One problem is that particular transforms lead to more products.
One of the challenges is how to handle possible mixtures of products. Also everyone stores reactions in different ways - no standardised form (ed: reaction SMILES?)
More and better annotations are coming such as indicating the reacting group. Also to come: Predict reaction conditions with machine learning. Additional reaction steps. "Grow" molecule in a particular direction for binding sites.
Transform rules are manually inspected by synthetic chemists in CADD group. Iterative process.
#11thICCS Peter Pogany - Fast molecular searching tools and their extension at GSK
Search tools at GSK: Uses MadFast from Chemaxon; SmallWorld from NextMove; FFSS from GSK; Fraggle from GSK
Reduced graphs represented as SMILES where particular features are represented using unusual elements, e.g. [Sc] for aromatic.
The reduced graph fp is useful for pharmacophoric search. Attachment points are largely ignored, for example.
Describes use of edit distance to measure chemical similarity, with SmallWorld from @nmsoftware. Then Fraggle search via fragmentation and then Tversky search - also requires postprocessing.
Why performance matters. We cluster the whole screening collection (>2M) every weekend. Highly parallelisable. Similarity and clustering is the rate limiting step in cmpd acquisition. Real time search is neccessary.
Dbs are getting bigger quickly. ZINC and Enamine REAL. Need to know if dev cmpds are in these databases as can't patent them if so.
MadFast stores fps in memory and so is faster. Do an all against all similarity calc every weekend for sphere exclusion clustering.
Overlap calculations uses InChIKeys. This caused problems due to hash collision, and are moving to SMILES (?)
Get data from SureChEMBL. Data available with days of patent becoming available.
Q: DB sizes not increasing linearly, but more than that. What will do? A: We're okay for now (up to 3/4 billion) but might be a problem later.
#11thICCS Chad Allen on the Analysis of ToxCast & Tox21 cmpd set using GHS toxicity annotations and in-silico derived protein-target descriptors
Motivation: the need for tox data outstrips the output of traditional toxicology. In-silico methods can help.
Including heterogenous data can improve performance of tox models. Wanted to repeat approach of Alex Tropsha on a larger dataset.
Problem: generating the dataset with sufficient overlap of data domains.
Introducing GHS pictograms, which are derived from quantitative data. Categories are international standards. Rich source of data for tox annotations. Several public regulatory GHS data sources, e.g. ECHA.
Dataset of 3,336 cmpds. For each we had qHTS data, (plus 3 other things that I missed)
CAS numbers are required for looking up GHS data.
Used in-house target prediction tool: PIDGIN. Available on GitHub.
A lot of details on how the dataset was put together and cleaned up. In the end, was binned into toxic/non-toxic.
Shows nice plot that GHS classifications correlate with one another across sources. But different administration routes have low correlation, so focussed on oral toxicities after this point.
Compared to ToxAlerts. Number of alerts did not correlate with GHS prediction. But presence/absence of reactive/unstable/toxic ToxAlerts had modest correlation. Conclusion is that ToxAlerts probably shouldn't be used as a filter.
Classes appeared to have different intra and interclass similarity, and may be separable in chemical space. Similarly when distinguished in terms of protein target. Used LDA (linear discrim analysis) to try to separate.
Models are about as good as existing chemical descriptor based models, so why use? More interpretable. This is the subject of my current work.
#11thICCS Greg Landrum on How do you build and validate 1500 models and what can you learn from them?
Really..."the Monster Model Factory".. Have >1500 datasets from CHEMBL that I want to build models for. Needs to be automated. Ideally we can learn s.t. about what makes model work vs not work
CRISP-DM - standard process for data mining solutions - see wikipedia
Key steps: Init, Load, Transform, Learn, Score, Evaluate, Deploy. With KNIME, create a workflow that does each step. A meta workflow.
Extracting the data. FIltering ChEMBL. Need at least 50 activies. <100nm. Finally 2.5 million data points and 1.5 million cmpds. Sidebar: Data is biased towards active compounds. Ratio of act:inacts not realistic. To fix, add assumed inactives.
It's a Knime workflow, so it could be cronjobbed.
Clean-up the chemical structures with #rdkit - only allow standard organic subset. Generate fps.
H2O library for gradient boosting. RF, Naive Bayes. 10 different stratified random partitions. Take the best of these models based on EF at 5% (EF5). Model params came from a full param opt on 70 assays. Used to pick a standard set.
Surprised to find a fairly low number of trees (100) for gradient boosting. (All slides will be on slideshare right after the talk)
Execution: build/test workflows on laptops. The server is on AWS. The actual running took place on AWS with distributed executors - a new/coming feature of KNIME.
Performance. mean AUC is 0.958 and s.d. is 0.070. Cohen's kappa not quite so good. Looks too good to be true. Literally.
To validate or check model generalizability, use the model built on one assay from a target ID to predict act across the other assays. Like prospective evaluation, or as close as we can do it.
Now can see that AUCs for some target are close to random but others pretty good. Similarly for EF5 - some worse than random, but others very good.
What's happening? Shows 5-HT6 example. EF5 of 0. Model compounds very different from test. Have overfitted the training data. Or have built a model to predict whether or not a cmpd is taken from a particular paper. Need to consider this and be careful not to fool yourself.
Which fps were picked? (ed: nice ring diagram) Which method/fp pair is best for each assay? Random forest doesn't appear at all (!).
Still work in progress in drawing conclusions.
RDKit UGM (Cambridge) and Knime Fall Summit (Austin, Texas) coming up.
#11thiccs Sereina Riniker on Machine learning of partial charges from QM calcs and the applic in fixed-charge force fields and cheminf
A classical fixed-charge force field has parameters for bonded atoms and non-bonded. The non-bonded are most important for interactions. Review by me in JCIM 2018, 58, 565. Bonded are from crystallography. Charges come from QM, fitted to liquid properties.
QM-derived partial charges. Extraction from electron density is an undetermined maths problem. Most try to fit to ESP with Kollman-Singh, semi-empirical with bond order corrections, e.g. AM1-BCC. Issues, low quality QM (decrease cost), conformational dependence.
Goal a ML model to predict partial charges. Building on work on internal data from Pfizer (JCC, 2013, 34, 1661). Used 2d descriptors - no conformational dependence.
Create dataset. Select cmpds such that all substructures are present. About 130K from ZINC/ChEMBL. Focussed on organic subset, except for B. Used atom-centered atom pairs with maxlength of 2 (#rdkit). Molecules prioritized with rare substructures. All >= 4 times present.
Benchmarking versus coupled-cluster (CCSD) calcs. DDEC (density-derived electrostatic and chemical method) has low conform dependence and good benchmark results. This is a charge extraction method published in 2016.
A little error on each atom, so total doesn't add up. Spread out the excess charge but avoid making the good predictions worse by putting the excess charge onto the atom where the random forest prediction has a large prediction range (uncertainty).
Prediction accuracy. Worse for P, less data and large potential range of charges. Iodine less data but smaller range of charges so not so bad.
Works well even on small molecules , e.g. acetone, or larger ones, FDA approved drugs. Speed: not as fast as Gasteiger, but much better than semi-empirical.
Compared to reference for heat of solvation.
What can we use for in cheminf? Any ideas? Implemented Labute molecular descriptors based on these. Still preliminary. Results about the same as Gasteiger-based ones, but looking for additional applications.
Prediction accuracy easy to extend by adding cmpds. Future work: vdW atom types should also be adjusted.
Dataset availability, see JCIM 2018, 58, 579 for the URL. Includes Python scripts to train and try it out.
Peter Ertl suggests looking at reactivity.
#11thiccs Prakash Chandra Rathi on AI for predicting molecular ESPs
ESPs v. useful for optimising lead cmpds. Shows example with electrostatic clash. Changed structure to pull electrons away from pi cloud. Much better binding.
As Astex, PLIff scoring function used a lot in VS. Knowledge-based using info from PDB. Voronoi partitioning to calculate solvent accessible areas, contact areas, contact geometries.
V. important how you type your atoms, because you calculate the propensity of an atom of one type to interact with another type.
Atom typing in PLIff is unaware of ESP around atoms. Wanted to fix/change this. Incorporating this info should improve results, hopefully.
Can approximate molecular ESP surfaces by extrema. Assign atomic features represent ESP extrema, similar to Cresset field points. Assigned lone pairs, sigma hole.
~57K molecules from eMolecules as training set. Validation set: ~5K diverse mols from ChEMBL (having reasonable activity). QM calcs, 50 days on 60 CPUs, B3LYP.
Simple lookup model is not appropriate. Shows analysis of variability depending on local environment. learning (ed: 1st mention today)
(explaining DNNs and graph convolutional neural network) The more convolutional layers the further away atoms can influence it other. Used 6 hidden layers. 3 fully connected (FC), 3 graph convoluted (GC) and a final FC for output.
Performance on the training set. R2 is 0.96 for calculated versus predicted ESP extrema. Performance on unseen validation set: R2 is 0.88. Quite happy with this.
Work in progress: improving PLIff. Incorporate feature-feature contacts.
#11thICCS Jochen Sieg taking about bias control in structure-based virtual screening with machine learning
When you build a model in SBVS, is the predictor really generalizing? High correlation does not imply causation. We want to distinguish between patterns in the data we want to learn (causal patterns) and those we don't want to learn (non-causal).
A non-causal example would be one based entirely on the molecular weight.
Datasets: DUD, DUD-E, MUV. What they have in common is the attempt to unbias versus certain features, e.g. MW, LogP.
The original DUD missed out on unbiasing net charges. We've used our SMARTS Miner on DUD, and it's possible to find discriminative patterns. bias depends on the combination of dataset, method, descriptor and other methods.
(ed: I missed something, what are the 5 unbiased features?)
DeepVS is a literature example of a docking-based convolutional neural network. Validated with DUD. Reported results are almost as good with and without protein information. We suspected a non-causal bias.
We believe that DeppVS learns the 2d dissimilarity in the DUD dataset.
Conclusion is that a particular dataset's unbiasing technique may not work with different descriptors and methods. (ed: I think he's saying that this is not a problem with DUD, but with how people use DUD for things other than it was unbiased for)
(ed: personally I have always regarded these unbiased datasets as maximally biased, since they choose actives and inactives in a different way - I think that's the underlying cause of the results shown here)
How to build better datasets? Avoid non-causal data patterns, match simple properties, or use uniform sampling in simple descriptor space. Use baseline experiments to find and remove problems.
#11thICCS Oliver Koch on an exhaustive assessment of computer-based drug discovery methods by HTS data
We were challenged to test our methods against HTS results. Not a simple benchmarking exercise, but a real life task.
UNC 119A was the target. No inhibitor was known at the start. An unexplored protein target. Was able to do unbiased assessment - no prior knowledge.
Several VS workflows. Created models of the protein. Searches with Ligandscout or MOE followed by docking with GOLD. Also used pharmacophores instead of protein info as well as ligand-based methods.
The HTS campaign was against 143K cmpds. Hitrate of 1.6% (>75% inhibition). Quite high hit-rate. Probably because of huge binding site - everything can go in.
Using DrugScore PPI for hotspot analysis, and also used MD to assess protein flexibility. We combined protein features into a pharmacophore model in an automated fashion.
Automated pharmacophore generation didn't work well - model too big - hard to hit everything, but if features were optional too many things were hit. Rational generation (researcher bias) of models worked better, which focused on different binding site features.
We found different molecules when focusing on different binding site features.
Used Superstar from the @ccdc_cambridge . Knowledge-based pharmacophores based on non-bonded interactions from the CSD. Several probes used, e.g. N+. Good hit rates.
Using the X-ray model versus model (for the protein) had only minor changes in the feature placements, but gave different molecules (some overlap).
Now showing overall results comparing pure docking versus combination approaches. Ensemble docking with GOLD. Docking sometimes improves results over pharmacophore on its own, but not always. May depend on quality of the pharmacophore.
Once you have results, you can use initial results to improve the pharmacophore models in an iterative process. (Shows graph of how results improved over time)
Unfortunately the most active hit was not found by any of our methods. However, the hit rate for those results that were found was good.
Caveats: Only a single target tested. A huge pocket - difficult. Using proprietary screening database that is biased. No known true actives.
Take-home. Huge influence of the researcher actually sitting in front of the computer. Docking or pharmacophore works well, but hard to say which is best.
#11thICCS Willem Jespers on behalf of EB Lenselink on Lessons learned in benchmarking VS for polypharmacology
VS has been quite successful - hit rates often exceed 10%. DB sizes getting quite big. ZINC 800M, Enamine REAL > 300M.
On average drugs hit six protein targets in the body, but designed to hit just one. Oncology they try to hit multiple targets within one disease type to avoid resistance. 3.4% success rate of drug in oncology.
Rational polypharmacology: design/find mols for targets with known synergy. There is an increasing availability of data (genomics, bioacts, 3d structs) that can help to find these synergies.
"DREAM challenge" - dialog for reverse eng assessments and methods. Not for profit, open source, translational med. Fun to compete in an open challenge but also useful. Spent 3/4 on challenge using our in-house workflows.
DREAM challenge was VS for polypharm. A whole bunch of FDA approached drugs for anticancer kinase inhibs. The organisers gave 3 targets and 4 antitargets. Find inhibitors.
Started with ZINC15 db "in stock". 5 compounds to be chosen for the challenge, and these would be tested. Shows workflow of filters involving docking with Glide, etc, etc.
Janssen biosignatures. Github project_BBDD. JK Wegner et al submitted. Logistic models trained on different features. Data fusion of info from different sources, e.g. exptal endpoints.
Proteochemometrics. Use available data (e.g. ChEMBL, ExCAPE-DB, Eidogen). Combine cmpd descriptors and protein descriptors. RF model. Filtered data on various properties, e.g. MW, activity. 643K datapoints on 371 kinases.
PCM random search (sklearn) to improve RF model by reducing the number of descriptors (?) to avoid overfitting. Shows default versus random search optimised RF model. Clear but not major improvement across the board.
Structure-base docking. For all actives, generating decoys with DUD-E. Docked on available xtal structures. Glide (sp) -docking, XP-redocked the top 10% but did not improve the enrichment. Used combined BEDROC and ... ROC.
Interaction fps. Deng, CREDO, Elements, SYBYL, SPLIF, Glide, 2D, SPLIF+. We found SPLIF to work very well for separating actives from inactives.
GM Sastry JCIM 2013, "Z2" ensemble (?). This worked best when docking.
Metadynamics. Take ligand, apply RMSD bias to force it out, 5 replicas of 10ns with MD. Persistence score - how many interactions are reproduced. Also a pose score.
Compared docking versus docking+metadynamics. (ed:missed results - I think it improved them)
Shows video indicating that one of the nice docking results, starts moves out of the active site when apply MD. So, not so good actually.
Metadynamics can improve affinity ranking. Further worked needed though.
How to combine all these models. Confidence weighing of models. Which do they trust the most. Day of visual inspection by the team. Lots of shouting. Best part of the project. Slection of top 5. Results will be announced in Oct. About 30 groups involved.
Lessons learned? Participation in challenge very useful, time constraint good, data is everything, random search can improve hyperparameters, tailor VS protocol per target, metadynamics can (partly) rescuedocking ranking. We are developing and using these methods in Leiden today.
"Bart would like to thank me. I'd like to thank him."
#11thICCS Ruth Brenk on selectivity determining features in proteins with conserved binding sites
How to rationally design selective inhibitors: steric clash, electrostatic interactions, allostery, flexibility, hydration (Huggins et al 2012). The more conserved it is, the harder it'll likely be.
N-myristoyltransferase (NMT). Co-and post-translational mods of proteins for membrane targeting. Target for cancer and African sleeping sickness.
Describes work from Frearson et al Nature 363, 7289, 728, 2010. Not selective at first, and very difficult to work in selectivity. Didn't understand why.
Selectivity of HsNMT1 (human) versus LmNMT (Leishmania - sleeping sickness)
Highly conserved binding sites though only 34% sequence identity. Shows 3d alignment of binding sites. Very similar if not identical. Why are some cmpds selective and others not?
In NMT the C terminus folds back into the binding site, slightly unusual.
First step. Get the xtal structures of cmpds in binding sites. Only 3 residues different between the two binding sites, but didn't seem to be relevant for binding. We did crossover experiments...swapping the binding sites (? I missed how).
After swapping the binding sites, the LmNMT-3x (triple mutant) was inactive, but HsNMT1-3x was active and could measure. It had the same level of inhibition as the original Lm. So the three residues were important, and specifically it was a single residue that was the origin..
Measured binding energetics. Binding energy, enthalpy and entropy. For the non-selective, the two binding sites have the same energetics on binding. For selective, there are more favourable entropic contributions for the modified binding site. Why?
We couldn't figure out why from looking at the binding site. To investigate we went to MD simulations. Determine the order parameters (S2) for bond vectors - param of 0 is fully flexible and 1 is rigid.
The flexibility does not change much for LmNMT no matter what's bound. Quite different for HsNMT1. Big change for the selective compound. Can reationalise the change in entropy due to the selective compound binding.
Looking again at the binding site, we can hypothesise why that particular residue changes its flexibility. Could be because the selective compound is bulky and thus prevents the Gln moving so much.
Now looking atl another compound with another binding mode that is also selective.
Binding affinity did not change when swapping the binding site residues. But we can see a water molecule in the Human structures but not the Leish structures, that must be expelled when binding.
Decided to change 8 residues around the water molecule in the human NMT and replace with those from Leish, to investigate this. What happens? This abolishes the selectivity. The Ki for the non-selective cmpd is unchanged but reduces Ki for the selective.
This indicates that this environment is important for the selectivity. To investigate further, we co-crystallised. 4 molecules in the asymmetric unit. We saw the water molecule in 3 of the units, but not the 4th. So it looks like the binding of the water is reduced.
Now could design selective inhibitors, based on this information. Activity not as good, but can be optimised - they are selective though..
#11thICCS Steven Oatley on Active search for computer-aided drug design
The disease we are working on is Idiopathic pulmonary fibrosis (IPF) - progressive lung disease. Scar tissue in the lungs. Shortness of breath, dry cough. Often fatal within 2-5 years. Worse for smokers. Genetic factor.
Starts via micro injuries. Death of eopithelial cells, and then abnormal healing process leading to a cycle of repetitive lung remodelling.
Integrins. Large transmembrane signaling proteins. Responsible for regulation of cell cycle and imune responses (e.g. scar formation). Two strands, alpha and beta. av (alpha v) is the alpha subunit that is of interest. Also b3,6 (beta 3,6) of interest here.
Crystal structures available in combination with TGFbeta1 (ed: I think?). To prevent the natural ligand from binding, we use a RGD mimetic. Naphthyridine fragment as arginine mimic.
185K cmpds to test. Activie learning using an adaptive Markov chain based approach. We start by generating 10 cmpds and sent to docking program (OpenEye FRED) for pass/fail. Chain mixing via a random process. Prob(new is hit) / Prob(test is hit). (ed: missing details)
Works thru an example.
Start with parent cmpd. Sample an available attachment pt. Sample a random valid fragment for this attachment pt.
Molecule is treated as a graph. Then turned into MOL file, pH corrected, conformers generated, (RMSD filtering, and energy cutoff).
Describes docking details. Required to have metal chelate bond, and a particular H bond. Scored with chemgauss4.
In less than 10 rounds, the chances of predicting a hit were much better than random. Overall results much better.
Comparison with previous work. Our predicted most active are substituted similarly to previous work. Also we find many of the same molecules.
Looking at improved procedure. Better docking - tried lots of different combinations of parameters. Problem we only have 30 known compounds as the benchmark. Reports r2 values (ed: not sure what this value represents?)
More thorough conformational sampling does not improve results, but is much slower. Maybe because exhaustive sampling just jams everything in something.
Mol Inform 2018, 37, 1
Rarey asks different meanings of active learning. Rather than optimising objective function, people optimise the information gain of the experiment.
#11thICCS Paul Hawkins on conformational sampling macrocycles in solution and in the solid state
The iron triangle in conformation generation: want fast, good and cheap; i.e. fast, accurate and small ensemble.
MD vs torsion sampling. Slow versus fast. Distance geometry: fast, don't require 3d, stochastic, implicit solvent.
Atoms placed randomly in space, and then minimisation a distance constraint function. MMFF94 afterwards. Energy cutoff, and RMSD de-duplication to create an ensemble.
Validation against the solid-state. Three sources: CSD (train), PDB (test)and BIRD (validate).
Compare simple stats of the three dbs. Ring size distrib and molecular complexity.
How to assess? Whole molecule RMSD? But side-chain support is often low. Ring-only? Too something. In the end, use ring+beta atom (ed: seems to be attached double bonds, plus...?).
Solvent modelling via Poisson-Boltzman, and Sheffield model. (ed:Didn't quite get how this is incorporated.) Accuracy seems to be the same for all, so try to discriminate via speed or ensemble size. Now looking at quality - sheffield model is better than in vacuo model..
Now comparing features of a whole bunch of programs. Macromodel, MD, MOE, Prime, OMEGA. Only MD can be discriminated...
Can identify problem structures for OMEGA: sidechain to backbone interactions cause trouble. Three outliers in performance. Direction for future work.
Compared to other approaches it's about 10 times faster than the next fastest.
Moving onto discussing solution structure. Example: Lokey peptide. Solid-state easy to reproduce. In soln, intra-molecular H bonds driven by polarity. Shows interactive display for torsion sampling compared to a reference.
It's possible to incorporate NMR constraints directly in DG approach. Should improve results. Works much better - focussed sampling in torsion space.
#11thICCS Moira Michelle Rachman on Automated frag evolution (FrEvolAted) applied to frags bound to NUDT21
Fragment chemical space is a lot smaller and easily to explore a larger chemical space more efficiently. Smaller and so more likely to bind, higher hit rates, more chemical tractable.
Some strategies in FBDD. Fragment linking, merging and growing - these are diff approaches. Growing is the most popular. Shows example of this from Astex (Murray ACS MedChemLett 2015, 6, 798)
Try to automate frag evolution while taking into a/c syn accessibility. Will alleviate the cost of doing fragment growth.
Start with ZINC "in stock", 15M. Would like to move to 200M and billions of cmpds. Start with sim search between ref and library for ligands of similar size.
Uses MACCS and Morgan for sim search. Then LigPrep for those selected. Then MCS superposition between fragments and their similarity hits. Tether docking (rDOCK) keeping the MCS in place. Some attrition at this point - bad scoring in docking.
Final step is MMGBSA minisation to correct docking scores. Finally dynamic undocking (DUck). To see if the interaction we want is present and how strong it is. Now we have children cmpds which are used in the next iteration.
We select the children with the most parents and also consider the parents' scores. One day per iteration depending on how GPUs are available.
We keep growing 2 heavy atoms at a time. My job was to make the platform more robust across diff targets. For example, for XChem, a collab with Anthony Bradley and Frank von Delft.
NUDT21 is a cleavage factor involved in pre-mRNA processing. Only Nudix protein that binding signaling molecule and RNA thru seq specific recog. Suggests potential for a role of small molecules in reg of mRNA processing.
Approach 1: virtual screen of 1100 cmpds based on synthetic feasibilty, followed by visual inspection, 19 molecules elected for synthesis/testing. Also approach 2 based on purchasability. Approach 3: the fragment evolution approach from ZINC in-stock.
Shows visualization of the relationship between one parent and all of its children, highlight the MCSes.
When superimposing them, there is attrition due to conformation strain or RMSD is poor. We see that very early on we are removing "bad children" so that they don't become "bad parents" in the next generation.
Docking should not discriminate between families. In every iteration, scores are increasing. In every iteration we are choosing children that score better than their parents.
MMGBSA scores are increasing over time.
Protocol is able to scaffold hop. Efficient at removing ligands. More novelty compared to approach 1.
Areas of improvement. We sacrifice ligands for the sake of time. We converge towards certain areas of chemical space, leading to less hits for other spaces. Plan to use larger datasets to improve this.
#11thICCS Johannes Kirchmair Hit Dexter 2.0: machine learning for triaging hits from biochemical assays
People are still taking about PAINS. The editors of various med chem journals teamed up to describe how to id false positive hits and reject them. J. Med. Chem. were the 1st to adopt PAINS as a decision framework to id cmpds that should be tested in detail.
The applicability of PAINS is limited. (P Kenny referenced)
Frequent hitters are not necessarily bad actors and v.v. Some aggregators and reactive cmpds but also true promiscuous cmpds.
Bad actors trigger false assay readouts and often (but not always) cause false positive readouts. Aggregators under v. specific assay conditions. PAINS cmpds cause problems under v. specific assay conditions.
PAINS: 480 SMARTS patterns. Describes their origin. Derived from 100K cmpds screened at high conc with a single screening technology. These cmpds had previously passed a garbage filtered and were screening under detergent-containings conds (may miss reactive cmpds and aggregators
PAINS are not related to frequent-hitters. Only a few (family A) are frequent hitters (emphasised by the original authors).
PAINS should not be used as a hard filter (will lose hits) and their absence does not imply a cmpd's benignity. Presence of PAINS not a problem per se, but may affect its developability.
Other approaches. Similiarity-based approaches such as Aggregator Advisor. Structurally-similar molecules are potential aggregators also. Not intended as a hard filter.
"Badapple". Underappreciated, but very interesting. J Cheminf 2016, 8, 29. A scaffold-based score for the likelihood of a cmpd being a frequent hitter or ....
Our approach. A ML model for pred of frequent hitter. Highlight cmpds for which extra cuation should be taken when reading assay readouts.
Derived from PubChem assay data. Protein clustering first. MW filter, salt filter, element, duplicate filter (InCHI) and quality checks.
How many cmpds have been measured in multiple assays (against diverse targets). ATR is active to test ratio. The number of protein clusters for which a cmpd was measured as active versus total no of clusters on which tested.
Very few cmpds that are highly promiscuous . Shows histogram of dataset percent versus ATR.
NP vs P vs HP (non-promiscuous to highly prom). Created thresholds for these based on the mean ATR, mean+1sigma, mean+3sigma. 20% are P, 3% are HP.
Dataset is really diverse. 399 protein clusters from 429 originally.
HP cmpds have log P higher by 1 unit. Less flexiblity, and higher ratio of arom atoms to aliph atoms.
Model devel. Tried MACCS and Morgan - went with Morgan. Went with extremely randomized trees versus RF or SVM. Optimized hyperparams with grid search. SMOTE - synthetic minority oversampling. scikit-learn and RDKit.
Shows performance of classifier for discriminating different classes of promiscuoity. 10-fold CV. Pretty well: AUC > 0.9, MCC = 0.61. (ed: what's MCC?) Independent test set with molecules that are not similar.
More testing of model. Looked at effect of similarity of molecules to training set. They fall off at ... 0.4 Tanimoto (? I think)
More testing: Dark chemical matter (DCM) dataset. Over 80 or 90% are correctly classified as non-promiscuous. Also tested on Enamine HTS collection. Also aggregators dataset (John Irwin).
Most surprising results was the test on approved results on DrugBank. More predicated to be HP in DrugBank compared to aggregators!
Hit DexTer 2.0 website just went live.
GSK just published 15 most noisiest approved drug in their assays. SLAS discovery 2018.
The website gives a heatmap against various tests. Hit Dexter just gets one of the GSK molecules wrong. That molecule is very far away from the training set data (information which is reported in the webpage).
We believe Hit Dexter is able to predict freq hitters with high accuracy. Not intended as a hard filter. Help in design of screening libs, hit triage and follow-up, and id true prom cmpds. Validation is in progress on a large proprietary dataset.
#11thICCS Roger Sayle of @nmsoftware on Recent advances in chemical & biological search systems: Evolution vs. revolution
Databases are growing at rates exceeding Moore's Law. Dbs that are twice as big take twice as long. But sublinear methods will not slow down that fast as dbs get bigger. At 1M mol/s searching, ChemEBL in 2s, PubChem 1.5min, Enamine REAL in 10mins.
Looking first at substructure searching. Use of binary fps to prescreen possible matches improves perf for typical queries. However, the use of fps does not affect the worse case, e.g. [X5], will bring many search systems to their knees.
Shows performance of different tools on Andrew Dalke's benchmark set of smarts queries. Ref to substructure-search-faceoff. OB 2 days, ChemAxon 2h, BIOVIA 50 minutes, SaCHEM (J. Cheminf.) in 16m. Total time is dominated by pathological queries.
Where is the time going and how can be improved? If you search for indole in eMolecules without a prescreen, 1% on file I/O, 59% on ring perception + aromaticity, and very little actually on the search.
To test, store all molecules in memory and do the search. 6s versus 120s for this search. Large memory footprint and there is contention between threads in multiprocessor systems.
"Arthor" Use compact on-disk representation of molecules that can be searched directly without reading into memory or creating molecule object. Can do the search in 3s.
Goes back to the earlier comparison graph. 27m for the brute force search, 46s for prescreen and 12s if 8 threads.
No live demo, but here's a short video. Available as postgres cartridge.
Moving onto similarity search. Traditionally calculated as the Tanimoto coeff between binary vectors. Shows CUDA code from @olexandr with three popcounts. Could be done in two or even one.
Choice of fps. ECFP4 is used these days rather than path-based fps like Daylight used. Baldi bounds make sense for Daylight but no longer valid for ECFP4 as they are very sparse.
Trick#1: Use hardware popcount to speed things up.
Trick#2: Sort fps by the number of bits. Useful for Baldi bounds but also for data storage, these are all the fps with 40bits set.
Trick#3: reciprocal multiplication. Avoid floating point division just can used integer multiplication with a reciprocal table instead.
Trick#4: the bottleneck is not the search thru the db, it's sorting the results NLogN. Avoid this with a counting sort.
Trick#5: Just-in-time complication. Write a program to work out the optimal code on-the-fly. Code-specialization.
Trick#5a: code skip empty words. Benzene only has three words with bits set, so only three popcounts are need.
#5b: For words with a single bit set, can avoid popcount using an AND.
#5c: Coalesce memory reads using Viterbi dynamic programming.
#5d: Popcount combining. If P and Q have no bits in common, can do both at the same time.
Done using graph coloring. Shows abilify. Draw line between words if have a bit in common. Graph coloring to use the minimise number of popcounts by choosing words that don't have a bit in common.
Using this approach, can reduce the number of popcounts from 5K to 3K. across the entire dataset.
Compares to previous work vs MadFast, ChemFP, OEGraphSim, and OpenBabel. Maxes out at around 500M FPs/sec versus 100FPs/s for ChemFP. GPUs faster but cannot support large databases as don't fit into GPU memory.
Future work is direct support for multiple GPU cards (federated search). Can support larger databases in this case. Future work is direct gen of NVidia SASL via cubin binaries. Pushed some fixes up to GCC9 to speed up popcounts.
Now talking about protein seq searching. If you want to name peptides in terms of other ones, you need fast searching of seq dbs. e.g. PDB 1CRN can be named as [L25I]P01542 if you can quickly search UniProt.
Algorithm to search: Longest common prefix. Rather than use BLAST etc use a suffix array. Store all suffixes in a database and sorting. Much bigger, but can be searched with a binary search rather than having to iterate over the whole database.
Now moving onto graph-edit distance search (SmallWorld). "Fight big data with even bigger data". Precalculate all of the substructures of all molecules. Have much bigger db but can search quickly. Sublinear-scaling search method.
Turn the entirety of chemical space into a Facebook or LinkedIn friends-of-my-friends view. Instead of 340M molecules in 68billion subgraphs. But those subgraphs can be searched much more quickly - don't have to look at them all to find nearest nbrs.
Graph-edit distance (GED) is the min no of edit operations to turn one molecular graph into another. Add/remove ring bonds, terminal bonds, link atoms.
Shows demo. Can find matched pairs of the queries, super and sub structures in real-time. One of the outcomes is the atom-mapping between the query and the hits.
Shows ExScientia example and how graph edit distance could have found the changes they used but traditionally FPs could not have.
Rarey asks about counting fps and unfolded fps. Answer: would be a nice functionality to add.
#11thICCS Henriëtte Willems on Strategies for assembling an annotated library for phenotypic screening
Phenotypic screening is an alternative to target-based drug discovery. Could have high-content imaging screen, find phenotypic hits, then find the target/biology.
Instead could base it on a pathway
Has been v successful in the past. 28 out of 50 first in class FDA approved small mols have come from this.
Disease relevant, greater conf that hits will deliver the desired therapeutic effect. But difficult to derive SAR, and so mostly you want to know the target and so have to find it out somehow.
Which cmpds to screen? Usually relatively small nos of cmpds, with well-annotated pharmacology. SGC has criteria for tool cmpds - but quite stringent. "Orthogonal" chemical problems, diff chemical structures with act for same target (likely to have different off-target)
Dark chemical matter - might have hit, but wouldn't know why.
ARUK (Alzheimer's Res UK) approach in collab with LifeArc. Started by mining ChEMBL. Then looking at availability by SPECS there were 5K cmpds.
Wanted selective cmpds but hard to decide on which are selective. What's an acceptable ration between activity on off-targets and targets. Absolute or relative potency? Similar targets vs different targets? What about if you don't know the subtype activity (e.g. some Chembl data
Used Windows score (Bosc BMC Bioing 2017 18 17). How many targets are hit in a particular window of activity relative to the primary potency.
We ended up defining three groups of cmpds. Green: active for 1 and inactive/weakly active for at least one other. Blue and Yellow groups also based on number of off targets and potency differences.
Some issues: protein targets for diff species have diff identifiers (ChEMBL ids not sufficient). Transportes and Cyps were included. Duplicate cmpds.
Workflow all implemented in Knime. MedChem Express and SPECS were searched for availability. MCE affordable and SPECS sources from diff vendors.
LifeArc had an existing phenotypic library. Partly based on similarity, so did not have ChEMBL annotation. Chemical Probes website and various vendor websites are not set up to search with lists of molecules.
Some troublesome cmpds shown. It's in ChEMBL, but via PubChem assay and acitivity comment: "inconclusive". An example of an approved drug, but not target associated. No mamallian protein target for another. Another inhibits a whole family of proteins.
Some examples of well annotated and selective cmpds in ChEMBL. Hydroquinone? Amantadine? Pepcid?
We wanted to cover a broad range of protein families. Shows ring plot. Seems like good coverage. Which targets not covered? There were gaps - needed to go back and fill them. Where else to look? Difficult to mine SureCHEMBL for non specialist.
Look to PDB? Molport and MCule - good portals with lists of SMILES. Knime nodes to search. Custom synthesis? Too time-consuming and expensive.
Discussed Including cmpds of same chemotypes as actives to help deconvolute targets (controls). Currently looking to investigate other vendors. Also thinking more about 'dark chemical matter'.
#11thICCS Christos Nicolaou on Advancing automated syn via rxn data mining and reuse
The automated synthesis and purification labs (ASL and ASP). "Pretty cool robots". Scientists come up with workflows, click a button and submit, and if all goes well you get the end product at the other side.
The proximal Lilly collection (PLC). Given our cmpd collection, how can we create a database to cover chemical space more fully. Virtual syn engine-->PLC--->in silico design and selection. "Old news"
Use our ELN as the source of reaction information. Info just sitting there. We fished it out of ELN and other databases to create a single homogenous data repo: Synthory, the Lilly Rxn Repo.
Classify with @nmsoftware NameRxn into RXNO reaction ontology.
What did we find? Overall similar to other papers. Top 20 cover 50%. 88% of reactions in ASL are recognized.
Want to use this info for structural design. Synthetic route pred (ChemoPrint). The training is done as follows. Many steps but straightforward. Reverse rxns. Calc rxn signatures. Define rxn classes based on those signatures (uniqification). Take a representative from each class
Signatures differ on atom typing. Want to take into account charge?, aromaticity? etc.
RRT (reverse rxn template) repo. Can now use it to carry out rxns. For a hypothesised structure, search for match, checking building block availability, and then...
Typically we end up with many many routes. Everything can be debrominated if it has a bromine. Need to assess synthetic feasibility of a hypothesis and then make/test the cmpd.
Which route should I push onto the robot? We have not looked at other functional groups the building block might have, or duplicate functional groups.
How do we assess routes? Synthex. We have synthesis execution templates - all those rxns that have been run on the robot already. Start by asking the experts: heuristic rules - how often have they been done, how easy, how successful.
Forward rxn check, brute force. Try whether the reactants match the templates. If more than one matches, then they might be undesirable products.
That's slow, so we prefer neural nets to make a prediction. Give it the two reactants and it tells you which are going to fire.
Where are we now? The Big Picture. Everything is done via a web interface that is presented back to the chemist and they make the final decision. This is being implemented, and lots of data is starting to come in. There will be lots of learning to do done.
Example on public data for testing. USPTO reactions from @dan2097. Shows some example routes sorted by support. Everything looks reasonable. Of course there are others not so reasonable.
Now on Lilly Data. Identify route for 10K latest Lilly numbers. Focussed on single-step rxns. Route found in 49.9% of cases. 255 rxn types. ASL feasible: 35.8%. In short, a big chunk of the chemistry the chemists are doing could be done on the robot.
We are reusing Lilly synthetic knowledge. Connecting theoretical syn routes to actionable items.
Are we there yet? The Robo-Chemist.
Routine reactions are/will be feasible via automation, but not complex, more challenging chemistries. But the "routine" set will grow to support synthetic chemists. And the future chemist will increasingly lean on machines for these chemisties.
We are open sourcing it. Either today or tomoroww. It's going out on GitHub. LillyMol project on EliLillyCo GitHub page. Expanding on code previously released.
#11thICCS Modest van Korff on Targeting of the disease-related proteome by small molecules
The disease is the phenotype - what you observe in the human. It's outside the normal conditions. MeSH terms cover 4500 diseases. A disease has to be severe and no sufficient treatment available for us to start working on a new drug.
A protein can be regarded as a switch which you need to trigger to go back to the normal state away from the disease state.
Ion targets are still an emergin class of drug targets and also include anti-targets (hERG)
We want to relate the genes and the diseases. We developed Gene2Disease to do this. Shows network diagram of info from text-mining collecting all disease/gene relationships. Q: How specific is a gene for a disease?
Pure counting of co-occurence in abstracts is not specific. Some genes and some targets are more interesting and so are mentioned more often. Relevance estimator: Gini coefficient. Presented at Pac Symp Biocomp 2017.
(Note to self: try to get correct definition of this estimator from the authors)
Looking at ChEMBL, some targets tested a lot, some molecules tested on many targets. Build up relationships between diseases, sequences and chemical structures.
Cmpd to cmpd similarity measured with Flexophore pharmacophore descriptor, part of Data Warrior. Distance histograms from pharmacophoric pts from conformations of a molecule. Published in 2008.
Can measure similarity of proteins based on the cmpds tested against in ChEMBL.
How do we make a map? Used the Data Warrior approach, a 2D rubber bond scaling, a forcefield-like arrangement in 2D space. 200K points laid out in 2D.
Compares results on ChEMBL targets, and human protein. The explored part of the human protein is dwarfed by the un-explored part. Zoom into hypertension relevant proteins. Now zooming into mitochondrially encoded cytochrome B, e.g. CDK.
Uses size of box to indicate relevance and colour to indicate how well explored. Two targets close together, both relevant, but only one explored. Target cliff. These may be valuable starting points for drug discovery.
#11thICCS Natalia Aniceto on Gearing transcriptomics towards HTS: Cmpd shortlisting from gene expression using in silico information
Gene expression provides additional insight beyond phenotype into biological processes, which can be used to find new drugs. Original paradigm: focussing on single target. But current paradigm is to use systems biology to look at a set of genes.
Treatment -> transcvriptomics signature -> phenotypic readout. The signature gives additional insight that may be useful to treat the disease.
What data is available. Library of integerated cellular signatures (LINCS). from Broad Institute. Allows systematically mapping gene to mode of action to disease. Difficult to query it except via exptal data. Not suitable for HTS.
Try to find a tool to select cmpds that elicit a desired cellular shift *without* prior exptal measures. So could be used for virtual cmpds.
First idea might be to predict predicted gene expression of molecules and then use those instead of exptal measures. But doesn't work very well.
Could we rank cmpds according to rel likelihood to match a target signature instead? Less importance placed on single point-prediction accuracy. Get set of cmpds with increased likelihood to contain the best candidate.
How to do it? Variable Nearest Nbrs (v-NN); a variation of kNN. Liu et al JCIM 2017, 57, 2194. As closest nbrs might not be close, v-NN uses a hard threshold for distance - mean is weighted by distance. This is a more controlled way of doing predictions.
How robust is the signature is based on repeated replicates. Gives standard deviation.
Tested with LINCS data, 19.5K cmpds. Also on internal Novartis PANOMICS data and 215K cmpds.
Refers to Martin et al JCIM 2017 57 2077 regarding pQSAR profile. Predicted assay profile based on RF model of past results.
Shortlisting on similarity to query *and* reliability. Shows effect of relaxing reliability criteria on ability to find known answers. (Ed: she should perhaps try a 2D heat map of reliability threshold vs similarity vs results)
Applicability domain: In a prospective sense, we don't know how many nbrs to include. Can we separate higher ranking from lower ranking queries? Shows heat map giving overview of the results.
I think this is the first time I heard it. I missed a small no of talks.
A new transcriptomics signature shortlisting procedure suitable for screening in large scale libraries. "1 cmpd - > gene expression" shifted to "gene expression -> N cmpds".
#11thICCS Cheminformatics session dedicated to Peter Willett (who is retiring).
Val Gillet corrects this. He is *not* retiring, but reached a 40 year milestone of contribs to cheminf. Peter cannot make it today due to family reasons.
Has 53-page CV which Val is going to try to condense. 578 publications incl. 16 books. Well-known in bibliometrics and info retrieval (not just cheminf).
Most-cited is GOLD paper, then 1998 chemical similarity searching (standard ref). Another is for document clustering.
Supervised 72 PhD students. Many have gone on to leading positions in pharma, biotech and software companies around the globe. Has collaborated with most of the major pharma companies. Held editorial roles on 18 journals. Reviewer for 126. Grant proposals for 22 agencies.
Youngest ever recipient of Skolnik. First recipient of Mike Lynch Award. Only 2nd non-US recipient of COMP award. (Also lots more on the slide)
Bob Clark video tribute to Peter Willett. Tripos had first refusal on projects they funded at Sheffield. Peter understand that the last 10% of a user-friendly software took 90% of the time. He wanted the software used for real-world problems not just within the group.
#11thICCS Robert Schmidt on Comparison and Analysis of Molecular Patterns on the Example of SMARTS
Gives example: [O,N]-!@c(:c):[a!c] and visualization via Smarts Viewer. Red indicates negation - matches everything except whatever.
The molecular pattern comparison problem. Difficult to compare patterns. Are two patterns the same? One may match a subset of the others.
Motivation: why interesting? Lots of SMARTS patterns filters. Any difference between Dundee filters and BMS filters? How similar is one set of filters to another?
How do we compare two SMARTS nodes? Extended valence states JCIM 2014, 54, 3, 756. Annotate each node with property values. 20071 atom type states that can be annotated with SMARTS. Set bits to true for all feasible states.
For edges, we only need 9 bits to cover all possibilities. Distinguish between cyclic and non-cyclic ones.
What's the relationship between CCN and CCC[N,O]. Create the fps for the nodes and edges. Match via a clique-based approach to get correspondances. (ed: a faster approach would be to use the same approach as used for SMARTS matching)
Node similarity is based on statistical analysis of node occurences.
Is pattern A more specific than pattern B? Equivalent is whether each node of A more specific than B?
SMARTScompare can also handle recursive SMARTS via recursive calls. (ed: impressive!)
Gives examples taken from BMS set and Inpharmatica set. Is there any pattern than is more generic or more specific?
Took 400 patterns with ~1000 matches within 370M molecules from ZINC. How similar are pattersn from set A to set B. Compare patterns in terms of matches to molecules. (ed: missed some details - sorry)
Allows search for specific patterns, search for generic patterns, similarity assessments.
Future work will be supporting remaining SMARTS features. Improve recursive SMARTS, and developing a unique SMARTS repr.
Ed: I think this would be useful even for preparing a SMARTS to match a functional group, as you could find matches to other existing patterns and see whether it would make sense to adjust your own one.
Ed: Questioner is wondering about applicability. Doesn't get how awesome this is. It's the sort of thing we've discussed in-house from time to time.
#11thICCS Andreas Goller on Anisotropic atom react descs for the pred of liver metabolism, ames tox and H bonding
Refers to DFTB+ for calculation. Check out
and - another open source QM package I'd never heard of.
#11thICCS Timur Madzhidov on Creating atom-to-atom mapping in chemical rxn using ML methods
Gives example of reaction search pattern. You need atom-atom mapping to ensure that the corresponding atoms match on both sides of the arrow.
ICClassify is described from InfoChem. Based on spheres of atom environments.
Lots of approaches: fragment-based, MCS based, optimization techiniques. Most based on principle of min chemical dist. Proceed with least redistrubtion of valence electrons (Jochum, Gasteiger, Ugi 1980).
Should AAM correspond to atom mechanism or not? Wendy Warr says it is important to remember that the mapping takes no a/c of the true rxn mechanims. Chen says different. Who's right? Both. For db, exptal meaning is not needed, only that for the same type of rxn the same AAM is...
...always created. The easiest way to do it is to produce a chemically meaningful one.
PMCD (principle of min chemical dist) does not always distinguish. Isotope labelling expts can distinguish. Some mappers may give both with equal probs.
For Diels-Alder there is an AAM that does not correspond to the correct mechanism but has min dist. Need to a/c for hydrogen - degenerate mappings.
Another case, where two mappings are actually correct. Major and minor product. Different probabilities.
PMCD can not work for some reactions, and is intrinsically not a 1-to-1 mapping. AAM is a very complex problem. Heuristics are needed. So...machine learning.
In the field of ML, this sort of problem is called structured (output) learning. Output not a number but something structured.
Gives Diels-Alder example. Atom can be mapped with onto atom x or y but no other. Imagine a function of probability of a particular mapping being correct. In training, should be 0 prob for wrong mappings, and 0.5 for the two possible correct mappings.
Use a binary vector to represent reagents and products (atom pairs). We used Naive Bayes and neural networks to give the probability.
Gives examples for esterifications. (ed: Probabilities appear to be per atom of reactant.)
Use this matrix of probabilities to maximise the probability over the entire mapping. Munkres algorithm. Bipartite graph matching. (ed: perhaps dynamic programming would be more efficient?)
Problems with molecular symmetry, or when atoms have very similar nbrs. Neighboring atoms may map to non-nbrs. Try to correct for this in the majority of cases.
tter.eComparison vs ChemAxon and Indigo. ChemAxon is the best, ML method is not far behind and Indigo is further back. Needs more data to train. Shows specific examples where different tools do better or worse.
So far only 5 different types of reactions. But wanted to support all possible rxns. The training set is ~2700 reactions. (ed: did anyone catch where the rxns come from?)
Only for balanced rxns at present. To come, unbalanced rxns. DNN perhaps? Could be applied and adapted for many graph alignment tasks (substructure search, MCS, biological network alignment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment