Skip to content

Instantly share code, notes, and snippets.

@baoilleach
Created July 3, 2022 20:14
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save baoilleach/1c5e22586942766ca221b5afdeda3124 to your computer and use it in GitHub Desktop.
Save baoilleach/1c5e22586942766ca221b5afdeda3124 to your computer and use it in GitHub Desktop.
Notes from International Conference on Chemical Structures 2022
Monday morning - Analysis of Large Chemical Datasets
--------------------------------------------
https://twitter.com/ConferenceNoel/status/1536235381313753090
I missed the first tweet as I was setting up this Twitter a/c but it should have been:
#2022iccs Maximilian Beckers (Novartis) on 25 years of small molecule optimization at Novartis: A retrospective analysis of chemical series evolution
#2022iccs A chemical series is a subjective concept. Kruger JCIM 2020 published automated id of chemical series.
#2022iccs Specificity of a scaffold is the probability of a random match of a scaffold. More meaningful scaffolds have fewer random matches per scaffold.
#2022iccs The dataset includes a whole bunch of different properties from their Novartis in-house dataset. Filtering removes bifunctional degrader and others (e.g. >5 amide bonds). 310K cmpds in the end.
#2022iccs Ran the scaffold analysis of the dataset. 72% of the compounds were assigned to a scaffold. Median is 60 cmpds assigned to a scaffold; typical one sided/long-tailed distribution.
#2022iccs Time progression of chemical series. Peak of one year from first to last member in series. But many over multiple years. Sometimes, single compound registered; then nothing; then the project starts years later. Cases where later cmpd was found as a hit for diff proj.
#2022iccs Scaffolds can be used in multiple optimisations. Data can be seen as a network of partially overlapping chemical series. Shows min spanning tree. Colour by disease area. 70% of all series are in the biggest connected component. Merge related scaffolds 2 reduce redunancy
#2022iccs After merge, cut all weak connections (Jaccard < 0.5).
#2022iccs Cmpd optimisation during series. Chemical structures evolve in similar ways durin opt. Rouhly 3 HA are gained. Fraction sp3 from 0.3 to 0.36. Structural evolution reflects MedChem common sense. Aliphatic rings inc, dec in aromaticity.
#2022iccs Biggest change in SA score (inc); molecules becoming more complex. Did statistic tests for each property (corrected for multiple testing).
#2022iccs The series are steered by chemists in certain directions. The Fsp3 is not completely explained by the inc in HAC. Similarily for other properties.
#2022iccs How do the ADMET properties change over time? Solubility tends to inc by 0,2 log units. Lipophilicy slight decs. Perm get worse. Met stab improves. Cyp and hERG improve.
#2022iccs Compared to traces based on HAC, all but one properties strongly go in the opposite direction. However, permeability matches it exactly, indicating that chemists do not control for it.
#2022iccs Now looking at qualitative evolution of ADMET properties. Classifiy series start and end into desired and undesired values. Again permeability goes against the trends; tends to be worse.
#2022iccs How likely it is to rescue a liability if poor. The hardest thing to resolve is hERG and clearance. Also looked at partial rescues; not quite so bad. Blue bar indicates random chance.
#2022iccs Reconstructing the development history of a chemical series. Consisder mol sim in addition to the registration dates. Genealogy of cmpds. Starting cmpd shown; several routes tried before going down the final route.
#2022iccs Looking now at the structure evolution with growing dissim from root. Clear trends not observed for ADMET, as they already have acquired information.
#2022iccs Increasing heavy atom count has unfavorable effects (even just 3 extra), particularly on permeability - should optimize towards atom count aware metrics.
#2022iccs What makes a successful series? Try to add early prediction of promising series. Thanks to Mik Stiefl and Mik Fechner.
https://twitter.com/ConferenceNoel/status/1536242043160252416
#2022iccs Joel Graef on GeoMine: On-the-fly geometric pattern mining in binding sites
#2022iccs In Hamburg, we have the Proteins Plus tool (proteins.plus). Includes a tool called GeoMine, for searching the PDB. It was developed to answer questions about spatial arrangement of atoms. Is this a resonable geom for a H bond; typical distances; ... etc
http://proteins.plus
#2022iccs Geometric pattern search. Query designed, uploaded, and then searched, returning results. It's a database approach. PDBs preprocessed by NAOMI framework; Protoss for protonation; DoGSite for binding sites.
#2022iccs Textual annotations added. After, every hvy atom in the pocket is a potenital search point (PSP) with props: elem, interaction type (Don, Acc), origin (protein, ligand, water..). Then interactions are precalculated (pi-pi, H-bond, etc).
#2022iccs Pocket prediction is done by placing protein in grid. For each grid point, DoG density (diff of Gaussian method, where can place objects of a certain radius), PSP counter, and countour level. After, ligands are processed; solvent exposure is calculated.
#2022iccs Clusters are then calculated. These are then enlarged using a radius of 2 angstrom; clusters are then merged into pockets followed by removal of small pockets.
#2022iccs Can search entire PDB. Can do textual, numeric and geometric searches and be combined in any way.
#2022iccs The geometric query can be done using a template structure. Interaction points - every HA or ring centre, or defined via SMARTS. Can connect two points via distance constraints (with a tolerance). Interaction constraints.
#2022iccs Protein filters, pocket filters, ... filters are applied. Discriminative SMARTS analysis of patterns with at least 5 atoms or at least one atom that is not CNO. Convert to SQL db query and search. Finally apply remaining SMARTS filters.
#2022iccs Showing the results now. Can use the "eye" icon to superimpose your results on the query. If you want to reduce/filter your matches, can use "Refine search" button to start a new search. History is recorded for stepping back. Statistics provided about the hits found.
#2022iccs GeoMine can be used for searching for structural similarities between binding sites of unrelated proteins. Also, drug repurposing, side effects, ...
#2022iccs Ligand Annotations for AlphaFold structures. E.g. Kinase NEK 6. Can create query to find potential ligands that bind to AF structure. Query has various protein points, and one ligand point ("ANY"). 198 matches and 116 pockets of 88 PDBs. Matched ligand is superimposed.
#2022iccs See Poster #45 also. Efficient and publicly available web tool. Simple as well as cmplx queries supported. See proteins.plus, uhh.de/naomi, smarts.plus.
http://uhh.de/naomi
http://smarts.plus
http://proteins.plus
#2022iccs Q&A: How often updated? Currently not automated but will be soon to be every two weeks.
#2022iccs Q about removing solvent molecules. There could be some remaining.
#2022iccs Q about PDB having alternate conformers being present. We use the highest probable conformers. Q about determining quality. We include all structures that are processable by our in-house tool, and exclude otherwise.
#2022iccs Q about how accurate the pocket detection algorithm as it does not use the ligand? We use the DoGSite approach. This has been optimised and a paper coming out soon. Has been compared to FPocket, etc.
https://twitter.com/ConferenceNoel/status/1536249469804728320
#2022iccs Olivier Bequignon (Leiden) on Papyrus – A large scale curated dataset aimed at bioactivity predictions.
#2022iccs Integrating heterogeneous data causes problems. Bioactivity data are heterogeneous in terms of ids (cmpds, proteins, annotations), metadata (cross-refs), standardisation (structs, seqs). Curation workflows are typically not public. Cannot reapply.
#2022iccs Trade-off between expect vs intuitive systems (e.g. ChEMBL website vs ChEMBL SQL download).
#2022iccs We first designed a standardisation workflow. Gene identifiers are tricky; we standardised to Uniprots accession ids. We used quality metrics for bioactivities. High/med/low (dups, censored).
#2022iccs Was the chemical structure available. Controversial step (his words): REMOVAL OF STEREOCHEMISTRY (!).
#2022iccs Why removal? ExCape and ChEMBL may have two separate cmpds that actually refer to the same compound due to some error in earlier processing.
#2022iccs We went on with standisation. ChEMBL standardisation pipeline. SMIRKS for failing structures. Ionization (pH=14.0), canonical tautomers, InChI-fication. Group based on cifdence. Aggregrate (mean, median).
#2022iccs ChEMBL (19.2M), Excape (71M), some Kinase focused datasets (870K). Data aggregated on a cmpd-target base based on quality....The Papyrus dataset! After the Leiden papyrus X on aide-memoire about alchemy.
#2022iccs Showing the overlap between datasets using a diagram with circles joined by lines underneath a bar graph (I've forgotten the name for this, @rguha can tell me).
#2022iccs Looking at the activity space. T-Map, min spanning tree, coloured by activity space. Some branches only green - only from Excape; other only blue - only from ChEMBL, etc.
#2022iccs Chemical diversity using sphere exclusion diversity; diversity picker based on Roger Sayle's Leader algorithm, sample sizes of 228 molecules.
#2022iccs Shows circle plot of protein target space. Shows plot of Nactives-Ninactives versus N compounds/target. Compounds are mostly actives in one section; compounds are mostly inactives in other section. Green region has balanced data.
#2022iccs Shows focus on kinases. Monoamine receptors (MRs), ARs, CCRs, SLC6. Are same trends shown? Used random split, but quite different results with temporal split (?).
#2022iccs Papyrus is a dataset with 60M bioactivities. It builds on the complementary of data srcs. Quality of adata is annotated. It ensures consistent chemical structure standardisation. Links to original and structural data. All open data.
https://twitter.com/ConferenceNoel/status/1536265695398350853
#2022iccs Patrick Penner (Hamburg) on Improving Torsion Library Patterns with SMARTScompare
Why classify torsions? Shows picture of ring systems being bent out of plane. Is this bad? How appropriate are these torsions?
First paper on this, 2013 (Scharfer, JMC). Then Guba (2016, JCIM), now Penner (2022, JCIM).
#2022iccs CSD + PDB mined for torsion patterns, expert annotation intrepret statistics and write pattern to place peaks, torsion analyser matches patterns...
#2022iccs Back to the example. Take CN bond, match it to a hierarchical library of SMARTS. Most general at the top - just lists the two atoms involved (C-N bond). More specific subclass further down, which are specific to anilines.
#2022iccs Below this are the torsion rules, which are associated with the statistics. The first pattern does not match, the second one does. Selective matching terminates when the first selective match matches. Unselective matching continues in order to generate the stats.
#2022iccs Shows binned histogram of the angles (2089 hits in CSD, 42 hits in PDB). Much clearer distribution in the CSD. High quality PDB hits only - so many ligands excluded.
#2022iccs Shows visualisation of the pattern. Includes H bond through space interaction. Includes whether particular atom is substituted or not (based on H count).
#2022iccs Hard to manage all these patterns. That's why Schmidt (2019, JCIM) SMARTScompare. [Ed: I remember this presented at the last ICCS] This describes the relationship between two smarts patterns.
#2022iccs SMARTScompare tells whether the SMARTS patterns are in a hierarchical relationship (more and more specific). Essential for this application.
#2022iccs These corrections may seem trivial but have a huge effect on the statistics. Shows example of benzamidine correction that changes the distribution completely.
#2022iccs Human curation v important to catch these sorts of cases. Shows another example of sorting torsion rules where a more generic pattern was placed between the more specific one. [Ed: I don't think SMARTScompare was needed for this one - could just look at no of matches]
#2022iccs Bringing it back to the torsion library now and how to use it. The Torsion pattern miner is the part needed to write your own library. To use it, can use the raw XML or the torsion analyser.
#2022iccs We have a web server (torsion tooling). Can select torsion and see the distribution. Can talk about deploying it locally also. This is a long-standing collab between Roche and Hamburg.
#2022iccs The command line version is part of NAOMI ChemBio suite. Some of the work on GitHub, github.com/rareylab.
rareylab - Overview
rareylab has 14 repositories available. Follow their code on GitHub.
https://github.com/rareylab
#2022iccs Q about automating gen of torsion rules. Hand curation is what sets this apart from e.g. Mogul. The reason we do this is to have recognisable/understandable chemistry; would not have the same if automated.
#2022iccs Q about plans to expand to cyclic bonds. We have a macrocycle conformer generator that breaks a bond and then uses these rules.
#2022iccs Q about matching/enumerating valence in patterns. We spend time standardising the input structures. Small bit of enumerating, but no simple definition on how they do this.
#2022iccs Q/comment about how the hierarchy of SMARTS is important in other applications (e.g. UNIFAC) and could be promoted for other applications.
https://twitter.com/ConferenceNoel/status/1536272413964607488
#2022iccs Ammar Amma (Maastricht) on PSnpBind: A database of mutated binding site protein-ligand complexes constructed using a
multithreaded virtual screening workflow
#2022iccs When new drugs are being developed clinical trials are used for safety. But once on the market, other problems can be found due to genetic variation, environmental exposures, age, etc.
#2022iccs Best case no side effects. Other cases are not effective, and/or side effects.
#2022iccs Natural variants due to SNPs which might change binding affinity. Have been studied, but all performed on small and limited datasets, e.g. one protein. No big db that captures a large landscape of these mutations that might be suitable for building ML models on.
#2022iccs Goal here is to build large dataset of binding site mutated PL complexes with wide coverage, scalable, accessible/reusable, and extensible to apply to new proteins.
#2022iccs Data sources: PDBbind, UniProtKB manually reviewed Natural variants, SIFTS (mapping from Uniprot entries to PDB), ChEMBL.
#2022iccs First step is get data from PDBbind, filtering for res < 2.5A. Linked to UniProt IDs and select missense variants. Shows visual example. SIFTs provides the mapping from seq location to the PDB residue. Can select those in binding pocket.
#2022iccs 26 human proteins and 705 unique missense mutations in the binding sites. Introduce mutations and generate the structures with FoldX, Gromacs energy min, PDB ready for docking.
#2022iccs Ligand selection. ChEMBL similarity search against the ligands from those proteins. Select ligands with Tanimoto cutoff < 0.6, OpenBabel min with MMFF94 [ed. name is not Noel O, but N.M. O'Boyle!]
#2022iccs Set overlap with that set overlap visualisation thingy again (pining @rguha again). Different mutation types, e.g. polar-to-non-polar, charged-to-neutral, etc. How many proteins include all of these classes or just some.
#2022iccs "Docking using docker". 640074 dockings used Kubernetes. 731 structures, and 32261 ligands. Code on Github (username ammar257ammar). Can be used on other infrastructures and parallised.
#2022iccs Looking at docking performance. Takes about 1 min per docking on average. Duration vs num of torsional angles. Clear inc relationship, up to one hour for 32 torsion angles. Allocated 12 cores, but on average only 4 cores were used. Could have reduced down the number.
#2022iccs After constructing the dataset, everything is available for download and accessible via a webapp. PSnpBind interface (psnpbind.org). Can superimpose the wild type and mutated. Can download them. At bottom has a table with the docked ligands.
https://psnpbind.org
#2022iccs Information on similarity of each ligand to the one in PDBbind. How does the binding affinity change compared to the WT? REST API provided, also downloadable.
#2022iccs Limitations: this is an in silico approach. Rigid protein docking, flexible side chain, MD. Limited no. of proteins; would be good to increase the no using more datasets.
#2022iccs Could be used as a src of data to build ML model to predict mutation effect on binding affinity (currently ongoing). Published in J Cheminf 2022, 14, 8.
#2022iccs How much do SNP contribute to drug variants? No personal info, but there have been studies out there.
https://twitter.com/ConferenceNoel/status/1536279899832451073
#2022iccs Roger Sayle (NextMove Software) on Recent Advances in Chemical Search of Ultra-large Databases
#2022iccs Ultra-large databases are beginning to stress the tools we used. Enamine REAL: 4 years ago was 160 of millions. Now 4.6 billion. People in industry have to deal with 10K times more cmpds. Exponential growth (XKCD comic).
#2022iccs Solution 0: ignore the problem. Our company just can't handle it, let's just do it the normal way. But economics of these dbs are hard to ignore. Significant savings compared to making in-house. Competitive business, competitive advantage.
#2022iccs Case Study: Pickett et al 2011 ACS Med Chem Lett 2011, 2, 28. MMP-12 library. Looking in Enamine Real now there are 673 compounds in this range available for purchase that could be used to fill in SAR around that.
#2022iccs There is an interesting step function in cheminformatics. We can process molecules that fit in RAM (up to 100s of Millions) quickly; then we hit a bottleneck, and the performance falls off a cliff edge, "disgraceful degradation". In Memory vs On Disk.
#2022iccs There's also a cold start - the time required to do the first search. The cliff is caused by LRU caching - least recently used cache. Only works if it fits in memory. As soon as it's bigger, each new block pushes the old block out; if you read it twice, no benefit.
#2022iccs If you are going to be reading from disk a large database, might as well spread across lots of machines with small RAM.
#2022iccs Von Neumann bottleneck. Solutions: 1. Alter in-core performance, 2. Move precipice (compression, more RAM), 3. Alter gradient, 4. Alter load time, 5. Alter on-disk performance (faster disk).
#2022iccs Same cliff affects every big data problem. On to "Hardware vs software". Spend more money? 1, 2 and 5 can be improved. 4. Load time can be improved by pre-touching the databse and having dedicated servers.
#2022iccs Software improvements: efficient representations; size of fp has a huge effect (i.e. 256 much faster than 1024). Decompression load. Live on-disk representations. Pruning, sublinearity and locality.
#2022iccs Computer Hardware 2022. Everything pretty much the same as 2020. Biggest change is NVME, can get 32TB vs 2TB. GPUs have ~3x bandwidth, but <10% capacity at ~ same price.
#2022iccs Memory bandwidth. For memory bound applications, neither the CPU clock speed nor the core count are particularly meaningful. It's the speed at which the memory works (DDR4 - 3.2GHz). Most important is the no of memory channels supported (hexa-channel vs monochannel)
#2022iccs Federating servers and the cloud. Round table srever. Different functionality can be split over different servers. Large dbs can be sharded over multiple servers. Or combinations of the above.
#2022iccs NUMA memory architecture. Speed of accessing memory on different CPUs depending on the architecture (numactl -H). Treat a single machine as a network. Lock this process onto processor 1; lock that process onto processor 2.
#2022iccs A single PCI card loaded with 7x8TB NVME drives provides 56TB of fast storage in a regular desktop. GPUs don't move the needle. An efficient GPU implementation of sim search, but only for sufficiently small dbs (where it fits in the GB memory).
#2022iccs A large 48 GB CPU can hold 375M 1024-bit fps, which a CPU can search in less than a second. Searches that take less than a second are not the problem.
#2022iccs Talking about efficient representation. I billion molecules requres 34GB for 256-bit fingerprint (252GB for 2048 fp). But what is the impact on the Briem & Lessel benchmark. Performance falls off below 256 bits.
#2022iccs We use special purpose live on-disk molecule. Uses 4 bytes per atom, 6 bytes per bond in 2018. Now 2 per atom and 2 per bond. Working on even shorter. Currently <2x uncompressed SMILES.
#2022iccs SmallWorld index size is now at 9.98 trillion edges, but advances in representation mean it has gotten smaller over time since 2019.
#2022iccs Some of the tricks used. Multi-gram compression of SMILES. Placed in buckets based on their length, can leave out the null saving space. Multiplicable binary search.
#2022iccs Sub-linearity. Getting technical now, very techinical. Locality sensitive hashing, skip buckets. Baldi bounds not particularly effective especially on ultralarge dbs.
#2022iccs Describes partioning of chemical space in smallworld index. Enamine REAL very different to WuXi Galaxi. WuXi far more diverse, but also much higher molecular weight.
Monday afternoon - Structure-Activity and Structure-Property Prediction
https://twitter.com/ConferenceNoel/status/1536302780981329920
#2022iccs Moritz Walter (Sheffield) on Chemical feature visualization to interpret neural network models for toxicity prediction
#2022iccs Feature visualisation with chemicals. ANN/DNN model - hidden layer neurons learn repr of data suitable to solve supervised task. Aim is to find chemical features detected in neurons.
#2022iccs For example, with mutagenicity PAH neurons or aromatic nitro neurons might be activated.
#2022iccs I've developed an automatic workflow to detect which chemical features are detected in neurons. Take training cmpds with high activation and input fp bits with high weight. Use Formal Concept Analysis (FCA) to id combos of cmps and FP bits.
#2022iccs For the formal concept analysis, the weights need to be above a certain threshold. These chemical substructures are extracted if associated with neuron activation. Based around the 1st hidden layer. Working on additional layers at the moment.
#2022iccs Fragments are stored in a hierarchicial tree structure. X is substructure of Y. Tree construction follows SOHN algorithm (Hanser et al 2014).
#2022iccs Once the substructures have been activated, how to use to explain predictions. Use a technique called IG (integrated gradients) on hidden neurons. Determine importance of neuron for individual prediction.
#2022iccs Second step is to map the neuron importance onto structure. Find most specific matching substructure in trees. Share attribution between atoms of substructure. Neuron 0 (0.32) + neuron 1 (0.22) gives the overall result.
#2022iccs Neural network model - dataset on Ames mutagenicity (8K, from multiple sources). Used Derek expert system to label cmpds (alerts). Model: 1 hidden layer (512 neurons). Used Morgan FP, radius 1. High performance on test set: ACC:0.91.
#2022iccs Used attribution AUC to evaluate. Compares atom attributions to ground truth for a single cmpds. Alert atoms versus model attributions. Red "contributing to toxic pred", blue "contrib to non-toxic pred".
#2022iccs Also cases where it didn't work so well. Evaluation focussed on AUCs for TP cmpds, and also computing the average AUCs for cmpds matching a given alert.
#2022iccs Median AUC for IG input is 0.964. AUC >=0.8 is 255/306. Sometimes the hidden neurons give a better explanation. For a particular alert, AUC is around 0.9.
#2022iccs For the alerts that are most frequent, it's work quite well (except for PAHs). Shows example where the alert has aromatic N, but the IG hidden result includes the entire aromatic ring - this seems reasonable, and is sensitive to how the alert was defined exactly.
#2022iccs Example where new method failed. Rare alerts make it difficult for the software to extract the right substructure.
#2022iccs Have devised method to vis chemical features learned in hidden layers. The extracted fragments can be used to interpret neural network model. Fails on rare alerts. Superior to IG on input features in many cases for larger alert structures.
https://twitter.com/ConferenceNoel/status/1536309618174984192
#2022iccs Eva Nittinger (AZ) on The Influence of Nonadditivity on Machine Learning and Deep Learning Models
#2022iccs Where does non-additivity occur? A lot of our tools are based around the assumption of linearity and additivity in the chemical space. But can have small change with activity cliff, or cmpd filliping around.
#2022iccs With two cmpds related by two MP transformations, can caculate the degree of non-additivity by building a square. If additive would expect diff to be zero.
#2022iccs Non-additivity analysis (NAA) can be applied to AZ and ChEMBL assay data. Can we predict non-additivity. GitHub code by C Kramer based on MMPA code by A Dalke.
#2022iccs Experimental uncertainity needs to be taken into a/c. Formula showing the estimate of noise across 4 measurements. 0.5 log units applied for public data, 0.2-0.3 for internal data.
#2022iccs NAA results. How do inhouse and public data compare? How often does it occur? How does it influence ML?
#2022iccs Shows histogram of NAA for AZ and ChEMBL; these are non-normal distribs; kurtosis 3.1 for AZ, 4.5 for ChEMBL. For in-house data 1 out of 4 tests shows strong NA. For public data only 1 out of 10 tests shows strong NA.
#2022iccs For cmpds, 9% vs 5% of the AZ/ChEMBL cmpds show NA. AZ data indicates that nonlinearity frequently occurs in assays. Should be examined carefully. Why less in ChEMBL? Diff cut-off being used. Fewer cmpds in assays; fewer matched squares. Publication bias; less -ve data.
#2022iccs Optuna framework for automatic extensive hyper parameter optimization. SVM and RF as robust baseline models. 500 trial runs with 5-fold CV. Effect of NA on ML models. The NA test set can not be predicted; the additive data can be predicted fairly well.
#2022iccs Consistent drop in r2 and RMSE across all datasets. Even when training data had NA data included.
#2022iccs Second part of talk based on MMPA results. What about exptal uncertainty for physchem props? How reliable? To estimate we use the in-house data and bin by the no of measurements available.
#2022iccs Standard deviation (SD) of assay is nonlinear. Careful examination of error estimate needed. Models for all assays can achieve an R2 of > 0.9. (based on Bob Sheridan, JCIM 2020).
#2022iccs MMPs should be the easiest changes and should be predictable. 4 datasets: all data, MMPs, additive MMPs, NA MMPs. Bunch of different methods. In total 112 models.
#2022iccs 3-fold CV on training to avoid overfitting. Then selection of best params and retraining on full dataset. For DNN-message passing network, it was a hyperparameter optimisation using Bayesian...
#2022iccs DL models gave best results in general. There is a drop in performance for NA datasets, but not so much for logD. Consistently, higher RMSE and R2 for additive data versus NA.
#2022iccs It's important to detect NA data. There is a signif amount of it. NA data cannot be easily predicted in ML models, even DL models.
https://twitter.com/ConferenceNoel/status/1536317518759055363
#2022iccs Chris Southan (Medicines Discovery Catapult) on Challenges of tracking SARS Cov-2 M-protease inhibitors from patents
#2022iccs M-protease has a statue erected to it in the Singapore Biopolis. Work was done on this by Hilgenfeld back in 2002-2003. Against SARS-CoV. Got to Phase 1 as an intravenous drug. Pfizer. They had grant applications to carry on, but it was never funded
#2022iccs GuideToPharm (GtoPDb) and BindingDB (BDB) have a SARS Cov-2 collaboration. GtoPDB is about stringent activity data curation of lead cmpds. BDB is about complete SAR sets from papers and patents. The two teams traded patents, papers, preprints and structures for rapidity
#2022iccs These efforts have resulted in deep coverage of med chem data directed against M-pro and other t argets. Includes large, patent only, sets that can be downloaded and explored by the community. GtoPDB curated M-pro inhibitor page. 86 entries with 56 distinct ligands.
#2022iccs PubChem mappings for GtoPdb show that only half of them are in ChEMBL. Looking at BDB, there are 91 docs with > 7038 data points.
#2022iccs How done? WIPO, PubMed, biorxiv, Twitter, LinkedIN, pharma press releases, proposed INN and 'COVID Special' lists. PubChem direct monitoring. Wrangling structures: OPSIN, OSRA, sketcher or CWUs if US patent.
#2022iccs FAIR-ness is usually low except for JMC which has a csv of SMILES.
#2022iccs Via @dragon38073853 (Molecular Hunter) Chris found a recent Pfizer patent compound. Was quickly added to GtoPDB and BindingDB.
@mentions
#2022iccs Grappling with the usual challenges. Since 2015, BDB has exploited USPTO xml text, CWU structures and example numbers. However, first-published SARS-Cov-2 WO PDFs are more difficult. Questionable value of low-potency reports.
@mentions
#2022iccs Publication intervals between patents and papers are highly variable and papers rarely cite patents. The polyprotein has a single Uniprot entry which confounds things for PubChem, ChEMBL. GtoPDB gets around this by referencing the specific subsequence.
@mentions
#2022iccs Nice example from Roche, WO2022043374. Not published yet. Lots of compounds. Well exemplified. Interesting isomeric splits. Most potent example in GtoPDB; the rest in BDB. Shinogi in PubChem - ensitrelvir. Patnet on its way? Chemical suppliers are quick off the mark.
@mentions
#2022iccs Desperately seeking fresh M-pro patents using WIPO patentscope. Query something like C07 AND not antibodies and NOT vaccine and not antibody and SARS-COV-2.
@mentions
#2022iccs Waiting for Heptares SH-879 structure and data (paper and/or patent would do). Google patents says its published but you can't actually find the structure. Trying to get the structure from @baoilleach on a beer mat.
@mentions
#2022iccs Curating inhibitors for other targets (e.g. PL-protease and helicase) could enable cocktailing. COVID moonshot and BIH antiviral efforts will yield more info. Expect a lot more patents and publications this year.
https://twitter.com/ConferenceNoel/status/1536325041574535170
#2022iccs Valerij Talagayev (Berlin) on An innovative approach of Toll-like receptor dynamics exploitation for structure optimization
through 3D pharmacophore analysis
#2022iccs Toll-like receptors are the first line of defense in the human body. They have an interesting structure. baceria, fungus, parasite and virus are recognised by TLR. E.g. SARS-CoV-2. Innate immune system activation leading to cytokines and type 1 interferon, interleukins
#2022iccs Two types separated by their location. Cell membrane TLRs, and intracellular TLRs. Looking at TLR8,. We can modulate by creating agonists for treatment of cancer. Or antagonist for treatment of arthritis, lupus and COVID19.
#2022iccs No xtal structure at the start for antag. Used agonist structure for modelling: 3W3J PDB. Used pharmacophore model. Two hydrophobic areas, an aromatic, two HBA and a HBD. Found 22328 hits. Did two rounds of docking; fast settings (HTS) gave 4672 cmpds, then slow (GOLD).
#2022iccs Energy min with MMFF94 gave 330 cmpds. Needed to make final selection based on 1) amount of interactions, 2) physicochemical props, 3) novelty. Gave 9 cmpds, which were tested via inhib of TLR8-mediated signaling in HEK298T cells. Gave antag with IC50 of 9.2uM.
#2022iccs Next step to do sim searching with ROCS to see can find additional antags. 5 cmpds were exptally tested. IC50 of 35.5 and 20.0 [ed. they are v. similar]. How to inc potency.
#2022iccs Synthesised analogs by replacing R groups. Got IC50 to 2.6uM. Now showing synthesis scheme. After all tested, did docking on them to obtain docking pose. By this time there was a known xtal structure (5WYZ). Did MD and got dynophores (prob densities of interactions).
#2022iccs Dynophore app was created by @DominiqueSydow . A tool to analyse interactions. The bars show how often an interaction appears in MD simulation frames.
@mentions
#2022iccs Dynophores give us the consistency of interactions based on probability during multiple MD simulations (replicates). They also give us the stability of interactions (appears in every frame vs not).
@mentions
#2022iccs Employment of dynophore features in drug opt. Inc stab and consistency of interactions; find additional interactions to inc the potency. E.g two cmpds that look v similar. During docking, both cpds have same interacts...
@mentions
#2022iccs ...but in dynophores consistency of 76% versus 46% for a key interaction.
Tuesday morning - Dealing with Biological Complexity
https://twitter.com/ConferenceNoel/status/1536597780260593664
#2022ICCS Denise Slenter (Maastricht) on A Systems Biology Workflow to Support the Diagnosis of Pyrimidine and Urea Cycle Disorders
#2022ICCS Everything is on Chemrxiv, Github, etc. We looked at two well described groups of disorders. No treatment available. Hard to diagnose. Let's use systems biology.
#2022ICCS These are inherited metabolic disorders (IMDs). Substrate accumulates and causes toxicity as the enzyme is no longer converting it to product. Neonatal heel prick screening finds 25 diseases (20 IMDs), all treatable if discovered early.
#2022ICCS But over 1500 IMDs exist. Heel prick is being updated once a year. Another way to check is via the the non-invasive prenatal test (NIPT) during pregnancy. Currently in trial phase in NL. Also prenatal screening for carriers before pregnancy.
#2022ICCS Volendam and Jewish community - screening for specific disorders. Postnatal screening with Quantitative UPLC-MSMS. Focus on amino-acid panel, PUPY panel.
#2022ICCS Rare buy deadly - pyrimidine and urea cycle disorders. Nucleic acid synthesis breakdown is the source for thymine, cytosine and uracil. Prevalence and mortality is unclear. Partial deficiencies exist and limited treatment options available.
#2022ICCS Urea cycle disorders. 1 in 35K. 66% of symptoms after newborn period, mortality is severe. Cycle removes toxic nitrogenous cmpds, and excretes urea. Limitied treatment options available.
#2022ICCS 2Common denominator is carbamoyl phosphate. This complicates diagnosis. Also nonspecific clinical symptoms.
#2022ICCS The link to the phenotype from the genotype is unclear. The metabolome is the closest link, but is unreliable for partial activity or carriers. For newborn screening there are 50 FP for each TP. If we could measure the fluxes in these pathways, this would be much better
#2022ICCS The aim was to design a sys bio workflow to support diagnosis. Got data for 22 patients - clinical biomarker data. PUPY panel and AA panel. Had 5K PUPY ref dadta, 2K AA ref data. Need to standardise the names of metabolites (e.g. many in Dutch).
#2022ICCS For each age cataogry we had at least 1 disorder, for some disorders had only a single patient. Created pathway model with PathVisio. Created purine, pyrimidine, urea cycle and biomarker models.
#2022ICCS With Rhea, there are differences in protonation states., so the link from the data to Rhea is tricky. OMIM merges data on the gene level, but there are cases with different phenotypes that need to be kept separate.
#2022ICCS ID mapping was done manually to avoid problems with automated mapping. BridgeDb can do the mapping, but relies on others doing the correct mapping.
#2022ICCS IEMbase is a db with disorders based on HGNC gene name. But there is no machine-readable download. Needed to retype it. The biomarker changes were relative (up or down) but no link to the actual data or refs.
#2022ICCS Showing heatmap of biomarker overlap for 3 IMD types for patient 1. Indicates which diseases are linked to which observed biomarkers.
#2022ICCS Used cytoscape in an automated way (via R), and a wikipathways app. Here's an example of an easy to diagnose patient. Red color indicating high fold change. Downstream metabolites not found in high concs. Highlights blind spots in diagnosis.
#2022ICCS Another example harder to diagnose. OTCD. Shows overview of how reproducible the clinical diagnosis was compared to her workflow. Nothing new found but does quite well. Real question is how does it work on less well characterised diseases.
#2022ICCS Request for data sharing in a format that is easier to reuse. e.g. identifiers not labels. Model not picture. See Poster 4 and 30!
https://twitter.com/ConferenceNoel/status/1536606392680861697
#2022ICCS Eugene Muratov (Chapel Hill) on Modeling, Proper Validation, and Discovery of Synergistic Drug Combinations
#2022ICCS Three talks in one. Modeling, proper validation and discovery. Brief interlude on catching Marc Nicklaus spying on posters. :-)
#2022ICCS The goal of the discovery part of the talk is to emphasise the combination therapy as a treatment against COVID-19. One of first papers in that area. To inc the chance of having synergy you need to target diff stages of the viral lifecycle.
#2022ICCS The study design is shown. 76 cmpds were found after (ROBOKOP, QSAR, Chemotext) leading to 3K combinations. After more analysis 74 combinations were chosen. 16 combinations were found to be synergistic. 8 antagonistic.
#2022ICCS QSAR modeling of Mpro inhiition. In 2016 they built a model for drug-drug interactions. Use descriptor for drug-drug interactions. For mixtures, you need a diff type of validation. Muratov E Clin Pharm 2014, 1, 1005.
#2022ICCS Have now developed a validation method for mixtures of any number. "Chemotext" - a publicly available web server for ....
#2022ICCS Robokop is reasoning over biomedical objects connected by knowledge graphs. Converted it to COVID-KOP with a focus on Coronavirus.
#2022ICCS For synergistic effect is good to diff stages of the viral life cycle. Possibly good also to hit viral target and host target but no-one knows much about the host targets.
#2022ICCS Describes the assays, and the testing of synergism/antagonism across 73 tested combos. Shows heptagonal polygonogram depicting the most important binary combos tested in the study. The activity (syn or antag) sometimes depended on the conc.
#2022ICCS The opportunities of synergistic antiviral action of drugs for batttling SARS-Cov2 is still underestimated. We have shown that mining and models can be used for this. We ided 16 synergistic combosand 8 antags.
#2022ICCS Q about why synergy when targeting different targets rather than additive effects? That's what I would have thought but it's well studied that this is the case. Not sure why.
https://twitter.com/ConferenceNoel/status/1536613673787183104
#2022ICCS Inbal Tuvi-Arad (Open Uni of Israel) on Conformational Chirality and Protein Structure Analysis
#2022ICCS Many proteins are symettric homomers with lots of beatutiful shape. SARS-Cov2 spike protein over 24000 atoms with C3 symmetry. Symmetry has been related to inc stability, efficient oligomerisation, reduction of synthetic errors. It must be a strong driving force.
#2022ICCS Perfect symmetry is rare. Conformational similarity of matching residues. But distortion is caused by chain translation/rotation and local conf changes. We need a structural desc to quantify the level of distortion.
#2022ICCS 30 years ago the idea of a continuous symmetry measure was developed. Determine the distance of a given structure from its nearest symmetric structure. Gives a num between 0 (symmetric) and 100 (complete asymmetry).
#2022ICCS Using this measure we can define a continuous chirality measure. Molecules that lack Sn symmetry are chiral. The nearest symmetric/achiral structure need not be a real molecule but must have the same connectivity and no of atoms.
#2022ICCS We can use this to analys levels of symm and distorion. Reveal and follow structural trends. Website called CoSyM. csm.ouproj.org.il. Spike protein is too big but smaller proteins fine. Also have GitHub page, but code not yet all there but working on it.
#2022ICCS Shows histo of 565 symmetric homodimers taken from X-ray, which a variety of S(C2). Similar for homotrimers. Similar histogram for NMR structures but level of symmetry is an order of magnitude higher.
#2022ICCS Now that we know that proteins are not all equally symmetric, we look in more detail. Ramachandran plot. Gly is chiral, no perfectly achiral Gly. Highest chirality in alpha helix. Each Gly subunit has an approx enantiomer. The plot has a nearly 2D inversion symmetry.
#2022ICCS The chirality level on the backbone is mostly unaffected by the chirality of the side chain. The maximum level is independent of the secondary structure though alpha helix has a much higher mean. Some outliers. Investigated.
#2022ICCS There's a particular expected range for beta sheet versus alpha helix. We came up with a protein chirality spectrum. Plotted the values along the sequence. Joint point, alpha-link and beta-twist can be observed.
#2022ICCS Outliers in chiral Ramachandran plots are special transitoin points along the sequence.
#2022ICCS Back to symmetry of homodimers. If these are to be perfectly symmetric, the confs of the res on both chains should be the same. Plotted them against each other. The conformation of a res on one c hain can be quite diff from its paired res on the other chain.
#2022ICCS Plotting the distortion along the sequence, the distortions are spread throughout the sequence. Came up with a distorting tendency scale, measuring propensity to be in the top 5% of most distorted residues. The longer sidechains are more likely to be distorted.
#2022ICCS Looking at Spike Protein of SARS Cov-2. As part of the infection process, one (or two) RBD domains moves upwards, causing a loss of symmetry in the RBD domain. The rest of the protein remains highly symmetric.
#2022ICCS What is the sctrucural change from the pt of view of symmetry. S(C3) with 3-down was 0.38. For 1-Up, values was 12.16. The peaks of distortion are in the same place for the two conformers.
#2022ICCS CSM and CCM are rubst and versatile molecular descriptors. Can be applied to small molecules. Residues distorting tendancy incs with the size and polarity of residues.
#2022ICCS Different tools have been shown with different applications. Prof David Avnir is the 'father' of this method.
Tuesday midmorning - Structure-based approaches
https://twitter.com/ConferenceNoel/status/1536627261478158336
#2022ICCS Marina Gorostiola González (Leiden) on Conformational Chirality and Protein Structure Analysis
#2022ICCS Sorry....on "Describing protein dynamics for proteochemometric bioactivity prediction: 3DDPDs"
#2022ICCS If you have a target with bioact data on a no of compounds, you might use QSAR for this. Can describe molecules with physicochemical descs, etc. Combine target info and ligand info to create proteochemometric model (PCM).
#2022ICCS While protein seq is very information, it's really more than a string. It has a 3D structure, and that structure moves, and each residue has a mean fluctuation. This info is not currently captured in any descriptor...hence our descriptors.
#2022ICCS Case study: GPCRs (ed: yay!). Targeted by ~30% of approved drugs. Exist in a dynamic conformational equil. MD and FEP have been successful in GPCR drug discovery campaigns. GPCRMD is a community-driven effort to map the dynamics of all human GPCRs.
#2022ICCS Have high quality MD of around 250-500 ns for ~45.7% of GPCRs with known structure. A preliminary look at these trajectories showed many similarities but also different dynamic patterns aong GPCR subfamily members (e.g. nucleotide receptors like aa1r, p2ry1).
#2022ICCS We collected data from the Papyrus set for the 26 targets for which we had MD data. Compound descriptors were calcd, protein descs. Training and test set split, XGBoost with 5-fold CV to create PCM model.
#2022ICCS 3DDPDs can be residue (rs) or protein (ps) specific. PCA applied to coordinate info for each atom, including partial charges (missed details). Data aggregated per residue. Then a second PCA of that data, which was brought forward to a multiple seq alignment (MSA).
#2022ICCS Benchmarked to determine how many frames to include, whether to use SD, etc, whether to include all atoms, or just heavy atoms, how many residues to be described in the descriptor. Found that using the full seq was better for the rs, but subfamily specific better for ps
#2022ICCS 5 PCs were used covering 95/99% of the variance. The descriptors included were based around coordinates, partial charges, and topology.
#2022ICCS The residue-specific (rs) 3DDPDs reflect the GPCR dynamic fluctuations. The 'ps' version captures global trends compared to rs. van Westen (J Cheminf 2014). 3DDPDs match best performing classical protein descriptors. 3DDPDs are complementary to classical descirptors
#2022ICCS The rs-3DDPDs can be traced back to specific protein locations, so can see importance of particular residues. It was found that 3DDPDs can inc per in targets with less data. In summary, these descriptors capture the dynamic fluctuations of GPCRs.
#2022ICCS This work is currently limitied to GPCRs, but in future it can be expanded to all proteins.
https://twitter.com/ConferenceNoel/status/1536634532543311872
#2022ICCS David LeBard (OpenEye) on Mechanism of passive membrane permeability from weighted ensemble simulations in the cloud
#2022ICCS Have developed a new model for membrane permeability that's quite a departure from the field.
#2022ICCS If you're a drug like mol, you have to cross many diff membrane barriers. If you take an oral inhibitor, it will be absorbed by small intestine, maybe into kidney cells, or blood capillary wall, or blood-brain barrier, or liver cell membrane.
#2022ICCS Small hydrophobic molecules will diffuse through a membrane. If charged or polar, most will pass. If large uncharged polar, then most won't be thru (e.g. glucose). Ions - no way. Units are cm/s; range from 10-2 to 10-4 cm/s for drug-like.
#2022ICCS How to describe mem perm. Pm (Perm coeff) from Fick's 1st law of diffusion. Donor compartment -> Acecptor compartment. New flulx is how many mols make it. Jm = Pm(Cd - Ca).
#2022ICCS Overton's Rule (1895). Pm proportional to greaseyness. 1960s gave us homogeneous solubility-diffusion. P = .... , experimentally verified over six log units. QSPR and ML models (2000s). Pm = fn of stuff (e.g. PSA, HB count, MW, LogP). Great for ranking but no insight.
#2022ICCS In vitro perm measurements exist. (1) Immobilized artifical membrane HPLC (2) Cell layer assyays (CaCo 2 or MDCK). Knock out efflux pump so can measure passive. (3) PAMPA - parallel artif mem perm assay. Give an estimate but lack any mechanistic info.
#2022ICCS Thermodyn-based perm from MD. Inv of Pm can be related to free E profile across memb and diffusion prof across memb. Gives you an idea where you can find a molecule and what's slowing it down. Gives some info but not enough mechanistic info for SBDD.
#2022ICCS Our model. A kinetic approach. JCIM 2022. There's a kinetic rate constant for D to A and vice versa. L_D is a characetristic length scale. Need weighted ensemble MD to estimate k(D->A).
#2022ICCS We are going to run MD, and every time we get a bit across the membrane we are going to replicate those winning trajs, and split their weights. Repeat this. Some of them make it to the other side. k(D->A) we get it from the average instantaneous P flux from D to A.
#2022ICCS Four WESTPA protocols were tested. Regular WE (weighted ensemble, super enhanced sampling rel to MD but need 25 ms of data - very long!). WESS (Dan Zuckerman lab, converges to equil faster but still 25ms). MAC (focus sampling on undersampled regions,, 8ms) . MAB+WESS.
#2022ICCS MAB+WESS. Focused sampling and faster convergence to equil. However might need multiple runs for full convergence as it's a kinetic property.
#2022ICCS Showing evaluation graph versus PAMPA and MDCK-LE exptal results. Can run them all with and without WESS. MAB/WESS was used going forward.
#2022ICCS 3 molecules were used in testing: zacopride, sotalol, tacrine. Typical Ro5 molecules. CPU vs GPU repeats indicates quite some variability. Quite a few of the exptal values have no error bars - would encourage these to be provided.
#2022ICCS Pretty reasonable start. Can be improved in future. How does WE compare to brute force MD? MFPT (mean first passage time) a physical measurement. The three molecules take 5, 52 or 559 ms. On the fastest computer (Anton3) would take 37 days, but WE in Orion would 1 day.
#2022ICCS The top-weighted permeation pathway for tacrine is shown. The first shell of water is shown. Starts to desolvate within the membrane. Resolvates the other side. Typical picture.
#2022ICCS For zacopride. Even in the middle of the membrane it's still dragging water with it. Unexpected. Similar story for sotalol. Still has water within the membrane.
#2022ICCS Some hot-off-the-press data coming... Back to Overton's rule from 1895. Exptally verified in Orbach and Finketstein (1980). Let's calculate the free E surface from our data across the membrane and compare the permeability estimates.
#2022ICCS Are we self-consistent? Calc Log of our Pm, vs the Calc Log (mem/water): 0.29 r2. Versus experiment: 0.65 r2.
#2022ICCS For 17 drug-like cmpds, we are within 1 log unit of experiment for 14. MAE is 0.82. r2 is 0.5. Kendall's Tau: 0.51.
#2022ICCS We have developed a kinetic model based on WE path sampliong, and give Pm estimate as well as mechanistic information. Still working on the statistical analysis of that model. Probably need more sampling still.
https://twitter.com/ConferenceNoel/status/1536643009659518980
#2022ICCS Dominique Sydow (@DominiqueSydow from @soseiheptaresco) on Integrated Structural Cheminformatics Analysis Tools for Customisable Chemogenomics Driven
Kinase and GPCR Drug Design (go Dominique!)
@mentions
#2022ICCS First part on Kinase structural cheminf (Volkamer Lab) and then look at GPCR-focused at Sosei Heptares.
@mentions
#2022ICCS Kinase dysregulation is linked to cancer, inflammation, etc. Challenges are competition with ATP, selectivity, and IP. There is a lot of data as it is well-studied and high structural coverage.
@mentions
#2022ICCS We are using the KLIFS database for binding site annotation. 85 residues aligned across all kinase structures. klifs.net. Currently > 12K structures from PDB, 312 kinases.
KLIFS - the kinase structures database
KLIFS — Kinase–Ligand Interaction Fingerprints and Structures — is a structural kinase database focusing on how kinase inhibitors interact with their targets. The aim of KLIFS is to support (structure…
http://www.klifs.net
@mentions
#2022ICCS The idea of KinFragLib is to take co-crystallised kinase ligands, fragment them and associate frags with the pockets, then recomine to make new molecules that have knowledge on existing kinase inhibitors. We defined 6 subpockets. AP is in front of hinge region.
@mentions
#2022ICCS GA is gate-area. Frag algo is BRICS fragmentation, subpocket assignment by distance. Then repeat from the beginning and only cut those that connect different subpockets. Final subpocket assignment to fragments and dummy atoms.
@mentions
#2022ICCS Here are some reps of the most common frags found in the diff pockets. In the AP pocket there are a lot of hinge binders (H bond donors). GA more lipophilic frags.
@mentions
#2022ICCS KisSim: Kinase Struct Sim. Shows the Manning kinase tree. Shows on-target for erlotinib, and some far away off-targets. can we explain and predict polypharmacology?
@mentions
#2022ICCS Kinase-focused pocket FP (KiSSim). Includes distances to various pockets as well as physicochemical descs. Did all against all comparison. Shows circular dendrogram. Clustering seems to be good.
@mentions
#2022ICCS "Please give me the top 50 most similar to EGFR" It found the LOK and SLK kinase from original example but not GAK off-target. Did this for multiple kinase inhibitors. Compared to KLIFS interaction FPs, and SiteAlign. Sometimes KiSSim does better, sometimes other method
@mentions
#2022ICCS We set up a pipeline that gives results from all of the three methods and shows them.
@mentions
#2022ICCS All of the data, code, notebooks are on GitHub. KinFragLib, KisSim and also OpenCADD-KLIFS which makes it easier to work with KLIFS data.
@mentions
#2022ICCS TeachOpenCADD is a teaching platform for cheminf and structural bioinf and uses only open source packages and data. It started as "from students for students". Each lesson is part of a Jupyter Notebook.
@mentions
#2022ICCS Examples shown like binding site detection, MD analysis, molecular filtering, ligand-based screening, MCS, ligand clsutering, ligand-based ensemble pharmacophores.
@mentions
#2022ICCS A couple of these workflows are available as Knime workflows. Available on KnimeHub. Can be used as templates for your own applications.
@mentions
#2022ICCS Now moving on to GPCR-focused work. Sosei Heptares is based in Cambridge UK and Tokyo, doing SBDD for GPCRs using StAR technology.
@mentions
#2022ICCS This is an exciting era of GPCR SBDD as so many structures being made available. In PDB there are 374 unique GPCR-ligand complexes for 119 GPCRs. Sosei Hep has 354 GPCR structures over 42 GPCRs. Common to find lipophilic hotspots.
@mentions
#2022ICCS The StarR stabilised receptor technology is at the core of our SBDD platform. We do step-wise stabilisation by point mutations to lead to increased thermostability thus keeping structural integrity outside the membrane. Trapped in relevant conformation (ago or antag).
@mentions
#2022ICCS Breakdown of PDB versus SH structures. About 1/3 (?) of the deposited PDB GPCRs tructures are from SH.
@mentions
#2022ICCS Visualisating the bioactive chemical space of the structural GPCRome with StaRs. For 126 GPCRs we find quite a lot of ligands with similar structures. One example is the P2Y1 receptor Sometimes we don't have structural data, so a binding site view is useful.
@mentions
#2022ICCS Example of Knime workflow using 3D-eChem to find residues specific to a subfamily, extract them, do an alignment with all the other GPCRs, rank them, gives the top 50 GPCRs with similar binding sites. Can use these to search again. The 3D-eChem has many other workflows.
@mentions
#2022ICCS Q from speaker on "Who has used TeachOpenCADD?" Quite a few hands went up.
https://twitter.com/ConferenceNoel/status/1536665702593093632
#2022ICCS Theresa Noonan (Berlin) on A novel antibiotic target: Identifying bacterial ribosomal assembly inhibitors via 3D
pharmacophore-based virtual screening
#2022ICCS Shows the large subunit (50S) of the bacterial ribosome. It's responsible for bacterial protein synthesis and so is a popular target e.g. by linezolid.
#2022ICCS We will focus in our novel target, bL17. Ribosomal intermeds allow targeting of 50S assembly. Parts float freely, then self-assemble and become the fully functional subunit. Cryo-EM has led to atomistic models of assembly intermeds.
#2022ICCS There are a series of rRNA folding events. R-proteins join hierarchically to influence rRNA conformation, e.g. bL17 joins early on, is instrumental in the folding process and recruits other proteins later on.
#2022ICCS Our approach was to identify a ribosomal protein and to target it before it even joins the assembly. At the start we had 36 ribosomal proteins, and had to id one that was suitable. Long process.
#2022ICCS bL17 is instrumental in r-protein recruitment and downstream assembly events (shown in Cell 2016, 167, 1610).
No eukaryotics homologs. This makes it attractive as a target. 'b' stands for bacteria-specific.
#2022ICCS Now need to id the binding site. Found a druggable binding site at the rRNA interface. Challenges are no known ligands (nor for any homologs) - need to create a 3D pharmacophore from scratch. Need to outcompete the 50S subunit for binding (lipophilic contacts).
#2022ICCS We created the 3D pharmacophore using PyRod (github: wolberlab/pyrod). It traces water molecules over the course of a MD traj, and makes a dynamic interaction field map. These can be transformed into a 3D pharmacophore (JCIM 2019, 59, 6, 2818).
#2022ICCS Hydrophobic feature, hydrophonic sphere, and various water features.
#2022ICCS >7M cmpds, virtyual screening down to 3K, 186 docking. Visual inspection 20. Exptal testing with biolayer interferometry (BLI). Measures binding events via wavelength shifts. Compound 14 displays association with bL17. But only observed at high conc.
#2022ICCS Very slow off-rate, so can't determine kD easily. Full structure will be published soon but not shown for now. Had the idea to capture Arginines at the rRNA interface. Target the negatively charged rRNA phosphate backbone.
#2022ICCS We use shape comparison via ROCS to find 8 analogues of A14. Cmpds with bulkier substitutions, diff H bonding patterns. After testing, we found 3 further binders. Still slow dissociation. Going to address this with isothermal titration calorimetry (ITC).
#2022ICCS C-05 to C07 show in vitro assembly inhibition vs a control. Results in far fewer functioning subunits.
#2022ICCS C07 has a carbonyl group which is binding to Arg46, and we are starting an optimisation round starting from this. We need large cmpmds to make a bigger wedge between the r-protein and the assembly, but existing cmpds are fragment-like.
#2022ICCS These cmpds however do not show any E coli lethality so we need to address gram negative cell permeability.
#2022ICCS For C07, carbonyl moiety is related to assembly inhibition activity. Structure-informed scaffold-driven search. 111 hits, next steps to filter, etc.
#2022ICCS We identified a novel antibiotic target. Found initial hit and optimized it. Hopefully in 3y I will be presenting a novel class of antibiotics.
#2022ICCS Q about how found pocket. Used 4 different programs. PyRod, fpocket (can use MD traj data), sitemap, MOE (ed:may have missed the right names...)
https://twitter.com/ConferenceNoel/status/1536672576755335170
#2022ICCS Kristina Sophie Puls on Dynamic interaction patterns enable characterization of opioid-peptide binding to
the atypical chemokine receptor 3
#2022ICCS My research focuses on GPCRs (ed: yay!). Have 7TM helices embedded in a membrane. There is ligand-induced conformational changes and associated pharmacological effects.
#2022ICCS High pharmacological relevance. ACKR3 is atypical chemokine receptor 3. Involved in neuronal and CV development, but also pain modulation - this is what I am interested in. It's a scavenger receptor opioid peptides. Four families for opioid ligands. Dynorphorines...
#2022ICCS Dynorphines > enkephalins > nociceptins >> endorphins. Order of binding.
#2022ICCS Opioid peptide pind to ORs for flight-or-flight reflex. Also bind to ACKR3 which degrades opioid peptides causing pain. Inhibition of ACKR3 should be analgesics.
#2022ICCS Problem is that there no structure available. Homology modelling difficult as closest structure has id < 29%. Most structures in inactive state. There's also AlphaFold2, but has the drawback that is no suitable for GPCR research. Lacks the activation state.
#2022ICCS AF2 can only predict one conformation - we want the active state. Ligand-induced state is also important but not predicted by AF.
#2022ICCS We took an AF2 model for ACKR3. Has a prediction accuracy value - most of the structure has a high accuracy. The two terminii have lower accuracy. This structure represents the active state based on the TM7 inward and TM6 outward movement.
#2022ICCS Has three key disulfide bridges. These are important for the folding and function. The loop was already there and only need slight rotameric changes. W2.60 was pointing towards memb instead of binding site [ed:missed the details here].
#2022ICCS So we were happy with this structure. Looking now at adrenorphin binding affinities to ACKR3. Made (?) mutants of adrenorphin and looked at how activity changes related to docked poses. Extensive ionic interactions. Mutation studies supported this.
#2022ICCS Problem rationalising the activty gain of the most potent mutant compared to adrenorphin. Perhaps we need a dynamic investigation. Used MD, followed by dynophore analysis. Did 4 replicates of 1 us each. Dynophore are probability density clouds.
#2022ICCS Shows that frequency of particular ionic interaction was observed. Also a stronger interaction. On average we have 2A shorter distance for more potent vs adrenorphin. Violin plot of distances.
#2022ICCS Have found plausible binding models for several opioid peptides, and rationalised affinity/act differences. Ionic interactions are crucial for ACKR3 act. R6R7-motif seems essential.
#2022ICCS Looking now at drug-like molecules acting as analgesics, e.g. Conolidine. Docked them and found similar interactions. So repeated MD analysis with Dynophore analysis. Freq of ionic interactions correlated to higher activity also compared to another small mol.
#2022ICCS conolidine shifts during the simulation. There's a sideward shift of E5.39 which opens up a space, and then conolidine shift, creating a longer distance to D4.60 and leading to lower interaction strength. 4.60 is importarnt for activity.
#2022ICCS ACRK3 does not bind morphinan opioids. Why? Looked at mu-OR with BU72. Crystal structure available. D3.32 ionic interaction is crucial. If we compare to ACRK3, it's F3.32 instead, which can't make these interactions. Also F3.32, Y6.51 and Q7.39 dec the binding pocket.
#2022ICCS Main findings. Identified for first time how peptides and SMs binding to opioid receptor. Explained act and aff differences. GPCRs are fantastic targets.
https://twitter.com/ConferenceNoel/status/1536679911666208768
#2022ICCS Sarah Maskri (Münster) on Development of potent FPR1 antagonists and partial agonists based on structural modelling and
a detailed understanding of binding characteristics
#2022ICCS PhD projects about 3 targets, two ion channels and a GPCR (FPR)
#2022ICCS GPCRs have very high flexibility. These's a bimodal switch between inactive (R) and active states (R*) and several transition states. Non conserved res distrupt...
#2022ICCS FPR - N-formyl peptide receptors. Mediates immune cell response to infection - anti-inflammatory agent.
#2022ICCS Three isoforms: FPR1/2/3. Used FPR2 to build FPR1 model. Shows MSA of three of them. Seq sim is ~60%. ICLs (intracellular loops) have ~ 90% sim. Responsible for signalling. Binding site has ~71% seq sim.
#2022ICCS Built homology model. Structure refinement for loops and sidechains. From literature have WKYMVM-NH2 peptide agonist. Specificity of FPR2 > FPR1. formyl-MLF favours FPR1 over FPR2. Identified 3 AAs stabilising...
#2022ICCS FPR1 with an agonist ligand - mainly 1 closed state. Without ligand, there were closed and then moved out and back to closed.
#2022ICCS Why do the two peptides not bind FPR3? 3 res were identified as recognition key for binding. C term -ve charge was stabilised by Arg and Lys in FPR1, but not in FPR2. This explains specificity of one of the ligands.
#2022ICCS Tried to explain the mechanism of activation. If Met aa or formylation a must-have to activate FPR1? Is there a full peptidic modulator for FPR? Is any hydrophobic group inducing an antag effect?
#2022ICCS Docked to both FPR1 and FPR2. Found insertion of the BOC group between 2 helices., blocking their movement and creating a salt bridge. We designed some peptides to test our hypotheses. Formylated verions. Tested Boc and Fmoc. Also tried a peptide residue (W).
#2022ICCS Used WKYMVM-NH2 as a probe to build agonists binding homology model. Is the Met important or not? The formyl was still interaction. Met interacts with W254, one of the key switches. There's an additional hydrophonic interaction. Formylated 6 AA peptides are potent agos.
#2022ICCS Fmoc/Boc group break salt-bridge D106, R201, R205. Met also interacts with W254 (rotameric switch). These are partial agonists.
#2022ICCS For the FPR1 trp partial-agonists, the W aa interacts partially with D, R and R. Sometimes breaking salt bridge. Could create a full partial agonist.
#2022ICCS Showing biological assays. Inhibition assay, competitive assay, internalisation assay.
#2022ICCS Still something odd about our protein. Realised it had high constitutive activity (activity of apo form). Tracked down microswitch important for activity. "DRY" motif (ionic lock), "NPxxY", outward displacement of TM6. DRY motif not conserved for FPR1 - might explain it
#2022ICCS As our results, formylation is important for agonists, NOT the Met. Confirmation came via a published structure FPR1-fMLF (04/2022). TRp N-term peptide distrupts interactions - it's a full-peptide partial-agonist. Fmoc and Boc inserted between helices disrupting interac
#2022ICCS Has been patented . Filed Dec 2021. Next to extend our work on FPR2 and FPR3 and mouse models. High basal activity - find mutations to recover a normal basal activity. BMS molecule (May 2022) a potent FPR2 agonist.
Wednesday - Cheminformatics Approaches
https://twitter.com/ConferenceNoel/status/1536959424727859200
#2022ICCS Baptiste Canault (GSK) on Chemical Annotation: A new similarity score for automated design and ranking
#2022ICCS DMTA - design make test analyse cycle (cmpd design cycle). Several cycles leads to candidate cmpd. Goal is to reduce time to candidate selection. For this, we use rapid mol design cycles - these are essential to reduce lead opt times. Main tool is BRADSHAW.
#2022ICCS BRADSHAW - an automated mol design platform. Manages the discovery cycle. Start with data - molecular generator (rxn based, knowledge based, deep learning). We use the GSK model collection - reactivity filter, physchem, synthetic tract, off target, safety, DPK.
#2022ICCS Selection by MPO, filter and ranking, 100 to 1000 cmpds. Then the 'make' part, followed by testing.
#2022ICCS How do we identify molecules that are similar to the input molecules? Mol Generators quickly produce huge diversity of chemical ideations from a lead mol. How can we filter, rank and select new compounds that are structurally relevant.
#2022ICCS Given a lead mol, distinguish between generated cmpds that are space filling (exploiting) from those are extrapolating. How far is too far?
#2022ICCS Describes similarity/dissimilarity. "Similarity like beauty is more or less in the eye of the beholder" [Maggiora]
#2022ICCS Gives examples from Lead Opt (LO). A SAR series is defined in terms of a core with various decorations. Shows examples where the dissimilarity values don't seem to rank cmpds correctly.
#2022ICCS In contrast, the new CAScore can give small scores to small changes, and penalise larger changes.
#2022ICCS Current limitations. Molecular size has a huge effect; for small molecules there will be fewer bits set, and similarity values are necessarily lower. Bits are based on local lolecular shape. Occurence count: repeated features can overlap in bits, unless counted.
#2022ICCS 35 components of CAScore. Molecule/scaffold/ring/linker/substituion. FP, topological, graph, reduced graph. Tanimoto, Dice, Tversky, SAD, MCS. Values are combined as weighted sum. Adjust local and global similarity levels. Train a RF model to learn underlying trends.
#2022ICCS It's not a FP, but it is simple, explanatory, reproducible (across different molecules) and modular.
#2022ICCS Take a patent family (see JCIM, 2020, 60, 12, 5699) for same target. Chemical series. All cmpds of a patent family.Query cmpd is the most active. Try to separate series from entire patent dataset...?
#2022ICCS Modelling set: 43 patents. Validition set of 45 patents: only two patents where separation not possible. Conclusion: CAscore is able to differentiate chemical series.
#2022ICCS Looking into a specific example where it wasn't able to separate series. Actually, the result makes sense - it's not possible to separate them.
#2022ICCS No correlation between Td(ECFP4) dissim and CAScore. There are few cmpds with a diss between 0 and 0.2 for Td(ECFP4). Better repartition of highly similar cmpds for CAScore.
#2022ICCS Looking at cases where ECFP4 and CAScore differ. If sim via ECFP4 but dissim via CAScore, CAScore is doing better at discriminating. Moving a subst around the ring one position is highly sim via CAScore, but lower with ECFP4.
#2022ICCS CAScore is a new chemical dissim metric adapted for drug disc (LO). Homogeneous dissocation of near space (exploitation). High scaffold conservation from queries (exploration).
https://twitter.com/ConferenceNoel/status/1536966772477378560
#2022ICCS Andreas Göller (Bayer) on Conformers Everywhere: Conformer Ensembles, Conformer Energies, 3D-ADMET and Machine Learning Potentials
#2022ICCS In principle molecules are atoms and electrons fluctuating and not static. There's stereo, charge, tautomer and conformer state, and also the medium. Confs play a major role for PL interactions, pharmacophore modeling and alignment, calc of spectra, for ADMET.
#2022ICCS Not just interested in minimum confs but all energetically accessible confs.
#2022ICCS How to use 3D confs and 3D descs to make better models. A few years ago tried lots of diff ways to make 3D models. Corina/Rotate, Macromodel, XEDEX, Pipeline Pilot. 40 diff settings. PP best algorithm yields decent nos of diverse confs at acceptable cost.
#2022ICCS Set up workflow for 3D structure, and calcd descriptors. Calcd descs for min E conf: PSA 3D, globularity, rad of gyration, monent of Intertia Z. Ensembles of confs: made descs from these too.
#2022ICCS Caco-2 permeation is a test case for 3D ADMET. Mol will adapt to environment during transfer thru membranes. Charge state, intramolecular H bonds, hiding polar functionality. 2 sets of 3D descriptors for water and CHCl3.
#2022ICCS The Caco-2 assay is testing for absorption across membrane. 12K data points. Only keep passive transport (< 100 nm/s) as seemed to be noise in higher values. ECFP4 descs were better than 3D descriptors in all tests (!). RMSE was > 0.5 log units - which are meaningless.
#2022ICCS What went wrong? Also tried logD as a 2nd test case. Same result: 2D descriptors better than 3D.
#2022ICCS Sereina Riniker published MDFP - machine learning from MD data to predict free E differences. So we jumped into MD. 100 drug-like mols from the Zaretzki cytochrome P450 data set (JCIM 2013, 53, 3373). Looked into benchmarking Energies.
#2022ICCS Geo opt with GFN1-xTB. Energies from a bunch of methods: MMFF94, OPLS3, ..., PBE0-D3(BJ)/def2-TZVP. This last was the benchmark. We tried gas phase neutral structures, walter calcs with/out charge.
#2022ICCS alcd for 93 mols with > 4 confs. Confs were labeled upfront. Es calcd with each method compared to benchmark E. Aggregated on method level.
#2022ICCS In terms of Es, GFN1-xTB (1s per conformer) is the best method except for PBEh-3c. In terms of ranking, OPLS3 outperforms GFN1-xTB for solvated systems. Mostly because of the solvation model for which OPLS3 is optimised.
#2022ICCS MMFF94 > AM1 but both should not be used. PBEh-3c on par with DFT but too costly. Gas phase: OPLS3 ~ GFN-xTB but better for water phase.
#2022ICCS Dataset of 7 macrocycles with a broad range of memb perm. Comparing ensemble diversities from MD with conf generators. Biovia BEST, Schrodinger prime macrocycle PMM, Rary's group conformator CONF. Post optimised iwth OPLS3e. MD sims with DESMOND.
#2022ICCS 5 diff starting structures. 3 solvents. Translate torsional values nito sin/cos value pairs. PCA on 32 sin/cos pairs. One global map defined by the six 16-rings. Not all torsions contribute equally.
#2022ICCS Shows PCA plot. Multiple start confs are needed for complete ensemble. High barriers Random pre orientation at ring closure can cause ...
#2022ICCS Maps diff strongly depending on solvents.
#2022ICCS Can we reproduce these ensembles via conf generator methods. In violet, results from MD. In yellow, conf generated by generator methods. Area that is accessible from MD is not completely covered by generator method. Overal PMM > BEST >> CONF.
#2022ICCS Solvent has little effect on the generators (post optimisation), but big effect on the MD results. 3D polar surface area does not really diff depending on solvent or charge state. So cannot be used to help with permeation.
#2022ICCS Multiple MD runs on diverse starting confs per mol needed. Still no guarantee for completeness though. MD ensembles cover signif larger space. Solvation important.
#2022ICCS Beyond QM. "!ai.qu" - have set up a project with this name. To leverage the promise of mach learning pots as a shortcut to QM. Set up a db of 100K QM-optimized drug-like cmpds. Machine-learning pots for geoms and confs.
https://twitter.com/ConferenceNoel/status/1536974780884295680
#2022ICCS Oliver Koepler (Leibniz Inf Centre) on NFDI4Chem – The National Research Data Infrastructure for Chemistry
#2022ICCS Going to talk about research data in academia. NFDI is one of the biggest research data infrastructure project recently. Linking and enhancing existing infrastructure components by services.
#2022ICCS We need a better research data infrastructure. The challenge is to find the data you need and re-use it for your purposes. We have decentralised stored data, unconnected to other services. Non standardised metadata. Main idea not to invest in more hardware, but link ..
#2022ICCS ...existing services and data. Funding horizon of 10 years, 70M euro per year with 30 consortia. They each get 2.3M per year.
#2022ICCS The 1st round consortia include NFDI4Chem. Many others listed across all scientific disciplines. Many close to chemistry: NFDI4Cat (catalysis), material science, GHGA (genome).
#2022ICCS Good distrib across sciences. The funders also created an overarching org, the NFDI association with a directorate, and a members assembly (207 members), etc.
#2022ICCS Now about NFDI4Chem. Chemistry has lots of subdisciplines and there are related disciplines. In order to max their impact, focusing first of all on molecular data.
#2022ICCS What sort of data and meta data. Name, Formula, CAS, InChIKey. Minimum information standards. Linking to reactions, assignment of spectra, to biological activity. Not just human readable data, but machine-readable with controlled vocabs and ontologies.
#2022ICCS NFDI4Chem is a grassroots movement growing alongside the needs of its community. 3 pillars: Learned society, German pharm society. Information institutions. Scientific community represented by universities.
#2022ICCS International context. IUPAC, RDA, GoFair, EOSC (european open science club).
#2022ICCS Research data management then and now. 1985: Cray-2. 1.9 GFlops. 2017 IPhone 8, 325 GFlops. But research data is often still just captured on paper - not changed that much.
#2022ICCS Our vision is based on digitisation, standards, community. Disclosure/publication -> reuse -> experiment des -> data collection -> processing -> analysis -> back to disclosure/publication. Data into local repositories.
#2022ICCS Strategy/approach centered around the SmartLab composed of ELN, software tools and devices/API. This is linked to repositories by software. The software is implemented using standards, legal/policies and terminology.
#2022ICCS There will also be a helpdesk, and teaching and training. All List of NFD4Chem services. Includes devel of min inf of chemical investigations (MIChI). Joining forces with publisers: RDM recommendations and author guidelines. Federation of repos. Semantification of data
#2022ICCS These are organised into Legal4Chem, Standards4Chem, Editors4Chem, Repos4Chem, Ontologies4Chem, Community4Chem (knowledge base, best practices).
#2022ICCS Diving into ELNs in more detail. Why ELN? Avoid data loss, secure data storage. Knowledge management: data are Findable, Accessible. Publication....
#2022ICCS ELNs are used for planning expts, data collection and also analysis. We want seamless data flows. There are subdomain specific modules. We use chemotion open source ELN. The infrastructure can be used with any other one (APIs are open). Chemotion can cover the whole cyc
#2022ICCS Here's a screenshot. Can draw structures, create reactions, analsysis of spectral data, sample management, user rights management. J Cheminf 2017. Originally from org chem synthesis, but we now extend it to the other parts of chemistry.
#2022ICCS Can also store the research plan. Analyses with ChemSpectra for NMR, IR and UV, is under development.
#2022ICCS Can directly generate suppl info directly from the ELN by pushing a button or directly deposit into a data repository.
#2022ICCS Federation of repos. VibSpecDB, MassBank EU, STRENDA DB (enzyme kinetics), NOMAD, RADAR4Chem, SUPRABANK (supramolecular interactions), nmrXiv, CSD, ICSD. This is what we currently have.
#2022ICCS Currently working on overarching services. Can do a chemical search across al the repos in NFDI4Chem. Can see what's avavilable.
#2022ICCS Metadata everywhere. FAIR Data. We capture metadata here and there in small portions, all captured, and then all available at the end, way easier for the researchers to do it as they go along. Will all be shared at the end of process when it comes to data publication.
#2022ICCS Terminology service - to bring metadata to the next level. We need ontologies. Common definitions of chemistry concepts and their relations. We did an overview study of all the available ontologies. Current lookup service based on EMBL-EBI service. Already ELN using it.
https://twitter.com/ConferenceNoel/status/1536990181575864321
#2022ICCS Hans Briem (Bayer) on Automated Ligand Design meets Synthesis Planning
#2022ICCS It's his tenth ICCS. Shows the DMTA cycle. What was changed recently? Bayer is running NovaDesign (a collab with Schrodinger), AIOLI (AI-drive Opt of Ligand). Will talk about this latter. Estimating property profiles to focus exptal efforts. Selection via MPO Score.
#2022ICCS Virtual synthesis planning. Based on work from Marwin Segler. In bayer, we have CHAI, Chemistry using AI, a retrosynthetic route prediction - we can use in-house rxns for training, turns out to be very important (-ve data). MTO, ant-colony based multi-target optimizer.
#2022ICCS CHAI - some modifications in the algorithm that improve the whole system.
#2022ICCS When you get a lot of different routes per compound, you need to choose one. MPO takes all the diff routes that CHAI proposes, it tries to optimise to find the minimum no of synthetic steps, fewest intermediates, etc.
#2022ICCS Can run CHAI - 500 routes per cmpd, synthetic feasibility score based on route diversity/length. Overal goal is to select X automatically designed cmpds with optimal props and most synthesis-economic combination of synthesis routes.
#2022ICCS Where do we stand at the moment with all these methods? AIOLI-CHAI. Coverage: % of designed ligands for which CHAI can find routes. Distrib of synthetic feasibility scores, dependency on AIOLI structure variation methods (more on this later).
#2022ICCS If include MTO also. What's the no of required building blocks and synthetic steps?
#2022ICCS Here are the variation methods in CHAI. "Synthesis-agonostic methods". MMP-based MedChem transformation rules. Me -> CF3. DNN Autoencoder (Winter, J Chem Sci, 2019, 8016). Scaffold-Hopping (Spark). Rgroup-based enumeration.
#2022ICCS "Synthesis-aware methods" Retrosyn-based educt exchange "Synthesia". 2D sim search in virt libs and fragspaces. 3D sim search in fragspaces "Ignite".
#2022ICCS Synthesia from Rarey's group. Ignite from Cresset collab. Synthesia uses a retrosyn tree from some tool (AI-Synth, AskCOS-MIT, etc.). Exchange building blocks with another one from a library. Can only exchange building block if full tree can be made.
#2022ICCS By this, we take into account the synthesis. The other is Ignite, with Cresset. Fragment space approach in 3D. Takes 3D query (can have excluded vol, protein, etc), can superimpose R groups from fragment space, align to query, uses Spark technology.
#2022ICCS AIOLI enumerations with all seven variation methods after basic MW and unwanted substructure filtering. 98% of them are unique only generatead by one of the methods. 74K cmpds. Took top 500 of each method based on a vanilla MPO score (MetStab Score, Caco-2, dmso sol)
#2022ICCS Generated CHAI routes for the 3500 cmpds remaining, and made a synthetic feasibility score for each cmpd. For the MTO, took the top 25 cmpds with the overall least no of building blocks needed. Onto results...
#2022ICCS For how many does CHAI generate routes. Over 90% in general. Synthetic feasibility scores higher for more synthesis-aware methods. For scaffold-hopping a bunch of molecules with low or zero score.
#2022ICCS The num of building blocks higher for synthesis-agnostic methods versus synthesis-aware. Now on to a real use-case. Blinded structure 'shown'. Central ring with 3 substituents. Starting point was cmpd with IC50 2nM/200nM on isoform A/B. Want to hit both.
#2022ICCS 5K cmpds enumerated. Filtering docking FEP+, CHAI. 1 cmpd synthesised so far with additional heterocycle ring system. 1 compound synthesised. nM efficacy on both isoforms.
#2022ICCS Q on why develop own retrosyn tool? Wanted to include real neg data, not artificial neg data. Quite some work to annotate neg data (crowd sourcing approach). Also AskCos tool had some limitations. This was 2 or 3 years ago when there were fewer tools available.
https://twitter.com/ConferenceNoel/status/1536998284534898688
#2022ICCS Alan Kerstjens (Antwerp) on De novo design of synthetically accessible molecules using an evolutionary algorithm
#2022ICCS Drug discov fitness landscape. The fitness function defines the landscape. It's discontinuous (mols are discrete), and rugged (many local minima). We need a good search strategy.
#2022ICCS One strategy is test all mols we have easy access to. Virtual screening. Advs are simple and parallel. Disadvs, blindly shooting at fitness landscape and hope you hit a jackpot. Ideally you want a guided approach to the landscape.
#2022ICCS De novo mol design. Iteratively modifying a molecule. Algo directs the search through chemical space. Advs: access to large areas of chemical space; find fitter mols; find novel mols; effficient. Disadvs: designs are hard to synthesise.
#2022ICCS This could be measured several diff ways, e.g. SAScore: composite measure based on sim to known chemistry and molecular complexity. Despite being quite simple it has proven to be very popular and correlates reasonably well with chemists intuition.
#2022ICCS "LEADD" - an evol allgo for de novo drug design. Imitates known chemistry in hopes of designing synthesisable molecules. Three mechanisms: 1. Limits bond formations to those observed in ref. Fragment-based design. Favour the most freq bond/chemotypes.
#2022ICCS Fragment based on atom types. Separate ring systems from acyclic regions. Extract all subgraphs of given sizes from acyclic regions. These latter are subjected to systematic fragmentation. All the broken bonds become labelled connectors. Store frags and their freqin db.
#2022ICCS Morgan atom typing from RDKit Morgan fps, which mimic ECFP. 32-bit integer. The no. of unique values is ~15K. [Missed details here
]
#2022ICCS Molecules are combinations of fragments bonded thru their connectors. Represented as a meta-grah (chromosome). Not all conectors can be connected to each other: they must be compatible. Strict: two connector are compatible if they are mirrored.
#2022ICCS Lax: Two atom types are compatible if they have been observed bound in ref molecule. The genetic operators: insert fragments, delete fragments, substitute, translation within molecule. GA operators ensure that bonding rules are met.
#2022ICCS How to enforce bonding rules. Construct bipartite graph. Maximum Bipartite Matching Problem (MBPM). How to match up central fragment with three flnking fragments. But too slow. As a compromise, multiple set intersection. Pre calculate which fragments are compatible.
#2022ICCS With this approach, can find the intersection of the compatible fragments to all three flanking groups. This is only a subset of the ones possible, but it is fast. Depending on the number of fragments, it will use one method or the other.
#2022ICCS Used GauacaMol benchmark suite. We took the scores of the benchmarks. Took SAScore of the designed mols as a measure of synthesisability. Does it work?
#2022ICCS Generated atoms with diff atom typing schemes. Molecules designed with Morgan atom types compared to dummy atom types (just valence rules). Using more specific atom types we are describing chemistry better and easier to replicate the chemistry inthe ref mols.
#2022ICCS However, being more restrictive means we can access fewer states and this affects the optimisation scores. This is shown by the graph.
#2022ICCS Comparison to other algos. GB-GA is an atom-valence and graph-based EA for de novo drug design. Compared to LEADD the difference is not statistical signif. But better SAScore [I think?]. Compared to AiZynthFinder, ...?
#2022ICCS LEADD is more computational efficient than GB-GA. The most compelling argument for VS is the fast transition to exptal studies.
#2022ICCS Improvements? Filter out mols with high SAScore or incorporate SAScore into scoring fn. All approaches seem to define a FeatureScore Pareto front. Opt power versus SA trade-off. Unavoiadable? No. Potent and synthesisable drugs are known to exist.
#2022ICCS Algorithmic restrictions implement search space barriers. Probabilistic illusion: opt power and SA are uncorrelated objectives. The prob of fulfilling multiple uncorrelated is hard.
#2022ICCS Novelty and SA is hard. Diff from known molecules, but ....similar to known molecules. Contradiction: can't be sim and diff at the same time. How novel is novel enough. J Chem 2022, 14, 3. Just takes a few lines of Python code. Free software: GPL.
https://twitter.com/ConferenceNoel/status/1537005778145509377
#2022ICCS Sereina Riniker (Zurich) on Assigning Diastereomers by Comparing Experimental and Theoretical IR Spectra
#2022ICCS Group works on developing ensembles of confs and their applications. Today an application.
#2022ICCS VCD spectra allow you to assign enantiomers. IR for diastereomers. VCD - vibrational circular dichroism. Chrial mols abs diff circularly polarized IR. Enants give same signal with opposite signs.
#2022ICCS You can only assign them by doing the calculation and compare to experiment. VCD can be measured in solution. Need to do a QM calculation. Limitations: typically in gas-phase, no anharmonicities, DFT functional, etc.
#2022ICCS With one conformer, can just about assign but there's a shift. Much harder if conformational ensemble. Conformational sampling up to 250 confs. QM geo + freq calc gives spectrum and free-E estimate (Boltzmann weighted spectrum).
#2022ICCS Solvent effects mean that not all the peaks have the same scaling. Our idea was to use Needleman-Wunsch alignment algorithm. Dynamic programming algorithm. Need scoring function.
#2022ICCS Take the absolute intensities, include the scaling factor and how much the Gaussian peaks can shift. Before the alignment, and then after the alignment. Seems to work well for rigid molecules, but how about flexible molecules.
#2022ICCS Shows simvastatin. Use Boltzmann weighting. Or tune the weighting to improve the alignment.
#2022ICCS Assigning distereomers using IR spectra. Had to adapt the scoring function for this. Much smaller differences (the diff between peaks). Start with simple one. We use the score that comes out of the NW algorithm, to get a global measure how good the alignment is
#2022ICCS Need to look at the fingerprint region, not the functional group region.
#2022ICCS Looking at nat product, mutanobactin D. Calculate NMR chemical shifts - cannot distinguish between disastereos. Based on spectrum, here are the results. Up to 200 conformers contributed. This was done blind. JACS 2021, 143, 10389.
#2022ICCS Now on to some current work. Handling of strongly overlapping peaks. Deconvolution with pseudo-Voigt bands. Recovers the underlying peaks. We are also working on improving the scoring function. First scaling by global factor, then fine-tuned alignment with IRSA algo.
#2022ICCS New workflow. Peak selection. Spectra deconvoltion step, etc. How about combine multiple spectra sources, e.g. IR plus Raman spectra. Can be done at the same time. Or IR plus VCD.
#2022ICCS Available on github.com/rinikerlab/irsa. Feedback welcome
https://twitter.com/ConferenceNoel/status/1537011990605996032
#2022ICCS Marc Nicklaus (NCI) on Tautomerism analyses in preparation of InChI V2
#2022ICCS Tautomers are isomers that can readily transform into each other thru chemical equil rxns. Protropic taut, ring-chain taut and valence taut. These are the 3 types.
#2022ICCS Why worry about tautomers? [ed: or indeed anything?] Affects property calc by several orders of mag (e.g. pKa, logP). Tanimoto sim between tauts can be surprisingly slow. ..
#2022ICCS Financial consequences. If you look at Aldrich Market Select (AMS) over 30K conflicts. Can find quite different prices for the diff tautomers. Could buy cheap and sell high.
#2022ICCS Tautomerism is widespread. About 2/3 of molecules are amenable to tautomerism.
#2022ICCS InChI was designed to be tautomerism invariant. But standard InChI only handles a limited range of taut types. Can turn on KET and 15T, but even with these on, there are many missing.
#2022ICCS In 2012, Dmitrii T proposed some breaking changes to support additional tautomers. This was the start of the InChI tautomerism committee.
#2022ICCS We have taken as a starting point the rules in CACTVS. All rules expressed as SMIRKS. Total no of rules is 119. 54 Prototropic rules. You can look at the rules at the tautomerizer rules cactus.nci.nih.gov/tautomerizer. You can see the rules and where they come from.
http://cactus.nci.nih.gov/tautomerizer
#2022ICCS We extracted these rules from the experimental literature. All available for download. 2819 tautomeric tuples comprising 5977 structures. Structurally different tuples is 1776 comprising 3884 structures. Tautobase from Thomas Sander contains 1680 unique tautomer pairs.
#2022ICCS "Tautomer conflict" - presence of two/more structs in a db that are listed as diff chemical entites but are really tauts of each other. We have done this analysis a bunch of times. in 10M molecules, 73 of the 119 rules are found to cause conflicts. Not just theoretical
#2022ICCS 99.96% due to prototropic rules, 327 for ring-chain rules, 5 for valence taut rules.
#2022ICCS Within various databases, how many entries have the same InChIKey. About two thirds of the conflicting molecules are not handled by the current InChI even with the extra options. Can we add these new rules? InChI does not a SMIRKS parser.
#2022ICCS Adding new tautomeric rules requires code changes in the ore of InChI. We picked ~20 prototropic rules as candidates for implementation. Igor Filippov was able to add six new rules. The others could not easily be added. Will be present in next version of InChI.
#2022ICCS What have we gained with the six new rules. Looking at 90M PubChem cmpds. Turning on all eight rules, you lose 3.5%. How does InChI perform on the "Tautomer Database". The ideal InChI should give you 1776 InChIs. The standard InChI gives 3380. Down to 2210 with all 8.
#2022ICCS A few examples we see. The tautomer InChI gives a different value than the non-standard InChI even where they both identify the tautomer. You should not mix the two. Maybe we should have a "T" in the version instead of "N". Back conversion of Tauto InChI to mol can fail
#2022ICCS Something is coming, organised by the FDA. "precisionFDA". InChI-based tautomers identification challenge for Sep-Dec 2022. Based on InChI 1.0.6 with all 8 rules available. Participants should test these algo mods on their real chemical samples with their analyt chem.
#2022ICCS We should try to connect cheminf in this field with real-world exptal data. This is important. Hope's that many will get involved. Can download the software and run locally, or upload your own molecules. Would be very useful for the whole field.
#2022ICCS What about InChI version 2? Which rules? Looking at occurence rate in PubChem. Can be used to prioritise. Most common rule: 65%. Dimitri's original rule: 0.02%, but he thought that was important.
#2022ICCS Where should go from here? Some prototropic transforms were implemented but doubtful more can be in the current codebase. To add more rules, InChI needs to be rewritten refactored or provided wit a tautomer preprocessor.
#2022ICCS This might lead to version 1.07 (extend) or 1.5 (refactor) or 2.0 (rewritten).
Thursday morning - Artifical Intelligence Approaches
https://twitter.com/ConferenceNoel/status/1537322454980378627
#2022ICCS Karel Johannes van der Weg on Improved classification of protein function by a localized 3D protein descriptor and deep learning
#2022ICCS EC classification is part of this talk. A basis to classify chemical function. There's a relationship between shape and function. TopEnzyme is the db we created for this project (there's a preprint). And then TopEC is their classifier based on binding site.
#2022ICCS TopEnzyme is a framekwork and db for structural coverage of the functional enz space. Easy overview for the structural availability of EC numbers (Enzyme Classification). Largest collection of enz structural models classified according to EC nos. 18307, 60$ of EC number
#2022ICCS Methods used: TopModel, TopScore and AlphaFold2. Homology model creating using DNN: TopModel and AlphaFold2. We use strict homolog conditions. TopScore is a model quality assessment program using DNNs. AF2 is ab initio folding.
#2022ICCS What we contributed. Most models of good quality (few low, some high). For most of the cases we have a fairly complete rep of the binding site except for ....
#2022ICCS Show graphs of quality. Disordered parts of enzs are much harder to model. Beta sheets the best, alpha helices harder. AlphaFold2 released just as they finished this work. What do we do? Let's compare.
#2022ICCS Using our TopScore method of measuring quality. AF2 is slightly better. We are generally better when the model is difficult. So AF2 is definitely missing some information.
#2022ICCS How far away is a structure from the xtal structure? Looking at the harder target, the IDDT score is quite low but AF2 really overestimates these scores. You think you have a good model, but you don't. Shows structural comparison examples...
#2022ICCS AF2 makes a mistake in the first example. In top right, our method we were unable to model the beta fold structure. In bottom left/right, both get almost the same as the xtal structure.
#2022ICCS Focussing on the binding sites. Shows example where binding sites are very good for both, or where one got it wrong. One case where disagreed but it's a super flexible region that moves out to accommodate ligand.
#2022ICCS What can we now do with neural networks and this db? TopEC - function prediction. 3D-CNN, 3D-graph neural network, un/supervised learning. Data splits implemented - temporal split - what can we learn from previous knowledge? Fold split - can we learn from local region
#2022ICCS Why use graph objects? Rotationally invariant. Understandable and explainable. Very easy to process, e.g. can make some selection of atoms, make the graph representation. Unfortunately run into hardware limits. Used supercomputer. 800 GPUs for 4 days.
#2022ICCS How to make a 3D-aware graph neural network? SchNet (distances) and DimeNet++ (angles). Computationally v expensive. We reduced this problem to a localized 3D descriptor. Reduce input size but retain essential info. How many atoms do we need. 75-100 works okay.
#2022ICCS Hierarchical classif with TopEnz and BindingMOAD. 3D nets works are better for structural classification, local repr incs performance. TopEC(dist) > TopEC(dist+angles). Better than other methods using CNN and that don't use graph NN.
#2022ICCS If you train on AF model and test on the PDB. We still obtain a very good score on our method. About 70% correct. When we do a fold-split are we doing what we think we are doing? Shows example,: we are able correctly predict function even where there are different folds
#2022ICCS Similarly for serine proteases. What we learning in the NN? Catalytic res are usually located in regs with high struct stability. The most predictive residues seemed to be linked to structural stability.
#2022ICCS But for allosteric enzymes, the most predictive residues even though it's in a disordered region but is important for its function to bind a ligand.
#2022ICCS Preprint uploaded but not visible yet. Once visible, will be able to access all the data, etc. TopEnzyme URL, releases of TopEC. Future work: Distinguish multiple objects in a network. Interface recognition.
https://twitter.com/ConferenceNoel/status/1537329470977232896
#2022ICCS Morgan Thomas (@soseiheptaresco) on Augmented Hill-Climb improves language-based de novo molecule generation as benchmarked via the
open source MolScore platform
@mentions
#2022ICCS Two part presentation. Recurrent neural network for de novo mol gen. Augmented Hill Climb on how it improves sample efficiency. MolScore platform.
@mentions
#2022ICCS Recurrent neural networks is a natural language processing model that can be applied to SMILES. One-hot encoding of vocab into binary vector. RNN generates prob distrib over entire voc and you maximise the likelihood assigned to the correct token.
@mentions
#2022ICCS To generate. Give a GO token and then predict the next token. Repeat until the END token. You are sampling from a learned probability distribution. Other grammars are DeepSMILES (poster by @baoilleach) and SELFIES.
@mentions
#2022ICCS Common opt algos for RNNS. Train on ChEMBl and then fine-tune on ligands for a certain target. However, this requires project data being present in the first place, but you might not have this. Also might not be good for novelty into new IP spaces.
@mentions
#2022ICCS Alternative is to do reinforcement learning (RL), where you have a scoring function that updates the RNN after the initial training. Goal is to maximize an objective function. Disadvantages are that the scoring function may have pitfalls that are exploited.
@mentions
#2022ICCS Despite being rudimentary, this approach usually ranks 1st or 2nd on benchmarkks. State-of-the-art not clearly defined, but this approach is very common.
@mentions
#2022ICCS RL requires many samples. GuacaMol 160K molecules, we found 128K. Hard to use medium time scoring function like docking, CASP. Req for large compute resources.
@mentions
#2022ICCS Benefits of docking over QSAR models is that ligand data not required. Better coverage of known active space. Greater chemotype diversity.
@mentions
#2022ICCS The REINVENT optimisation algorithm. For each mol you apply a scoring function. Then compute augmented log-likelihood which is the prob it was generated by the prior and then a weighted reward. You are trying to optimise the loss function.
@mentions
#2022ICCS Varying the sigma value should in theory help get more quickly to convergence. But in practice this is not observed.
@mentions
#2022ICCS Augment Hill-Climb. If the reward is small, sigmaR tends to 0. In this scenario it's better to learning nothing than regress back to the prior. Our solution, just to ignore the low scoring molecules and just take the 50% top scoring molecules.
@mentions
#2022ICCS You get this huge benefit in sample efficiency, and also optimisation works much better. Unfortunately, this causes mode collapse - uniqueness tends to decrease.
@mentions
#2022ICCS Can be fixed by applying a diversity filter, penalise generation of chemotypes already observed. When applied this this still outperforms REINVENT.
@mentions
#2022ICCS Tuning sigma now has greater effect on chemical space generalizability, in comparison when this is done to REINVENT. This means that the value of sigma can be use for fine control of staying closing to training data or move out of that training data space.
@mentions
#2022ICCS AHC can explot diversity filter params. The model has learned to generate molecules that cap at 0.8 (min score threshold) to avoid the diversity filter param. We did a hyperparameter search on the GuacaMol training set to tune this in the interests of time.
@mentions
#2022ICCS Several observations: mini score threshold has to be < 0.5. Software gradient of penalization (e.g. linear or sigmoid penalization) - give the model time to learn how to optimise.
@mentions
#2022ICCS Now learning has stabilised over the full training period. Sample efficiency x 8.6 better. Opt ability x1.4. Observed for mu opioid, AT1 and OX1 receptor.
@mentions
#2022ICCS In terms of the molecules generated, chemistry remains as reasonable as REINVENT. Only exception is OX1 receptor, a peptide receptor (molecules generated are larger). Can dec sigma if don't want this but it's not unreasonable.
@mentions
#2022ICCS Shows examples of chemical structures. The chemotypes generated are very similar to REINVENT, and the substructures generated are known active substructure. Both have similar similarity to actives. Both decrease sim to known inactives.
@mentions
#2022ICCS Disclaimer: docking score alone is not an adequate objective. You need a binding hypothesis and MPO. Docking score, lipophilic hotspot constrained, RA score, TPSA >=40 and rot bonds <= 6. This works much better in practice for A2a compared to docking score alone.
@mentions
#2022ICCS AHC is more sample efficient than other common RL algorithms across a sreies of easy/medium/hard tasks (from GuacaMol?). All of the code used to generate this is available on GitHub...moving onto MolScore.
@mentions
#2022ICCS MolScore is a Python package for de novo mol scoring, enabling easier eval and comparison. Born of a lack of standardisation in existing eval and the presence of a lot of toy tasks.
@mentions
#2022ICCS Can do lots of things. Scoring fns, transform fns, diversity filter, applicability domains filters, performance metrics, aggreg fns. You need a config file in JSON, but this is difficult to do manually. You can instead use a Stremlit app to build this config file.
@mentions
#2022ICCS You can view the molecules generated via another app. MolScore paper under prep. Augmented Hill-Climb paper under review. Preprint available. In practical terms, can run overnight on 10 CPUs compared to several days on a cluster.
@mentions
#2022ICCS Future work: more complex methods for importance samples, and molscore will have more generative models.
@mentions
#2022ICCS MolScore code on Github.
https://twitter.com/ConferenceNoel/status/1537337208369295361
#2022ICCS Maxime Langevin (Sanofi Aventis) on Explaining and avoiding failure modes of artificial intelligence for small molecule design
#2022ICCS "A tale of two models". Bob can created a predictive model to design novel optimised molecules that are predicted to be active. Alice asks "How does my model score your molecules?".
#2022ICCS We assume that both models have similar predictive model. We expect Alice's model predicts it also to be active. But it is predictive to be inactive. Failure mode.
#2022ICCS Bob says prob of active is 0.8 but Alice says 0.3. Renz at "On failure modes..." DDT, 2019, 32, 55. Two distinctive models for the same dataset and same prop - should behave in the same way.
#2022ICCS They show that as you generative molecules, the optimised score increases, but the control score stays low. Optimises on Bob's score, but not on Alice's. Worrying. If you use different random seeds on the exact same data, you still get that worrying behaviour.
#2022ICCS So does this mean that the models are exploiting specific biases in the model? Bob's scores go up, but Alice's stay low. Not what we want. Our goal was to understand and explain this.
#2022ICCS Step one to check. Do the two models even agree on the original data? Ask Bob and Alice to score the original set (separate samples from the same dataset). Already strong disagreement! Though it correlates a bit.
#2022ICCS High opt scores mechanically lead to lower control scores. This is a disagreement already present in the initial dataset. The expected impact on the control scores. Sample from the optimization scores and the expected control scores - draw tolerance interval.
#2022ICCS Given the disagreement on the intial set, the results found are within the tolerance for what's expected to be observed. We need to get them to agree on the origin dataset - how to do?
#2022ICCS We change the hyperparameters to go from strong disagreement to go to much stronger agreement. Once done, there is still a discrepancy between Alice and Bob's scores, but it's much reduced which is a good sign. Found also when built on homogen data on ALDH1 from Lit-PCB
#2022ICCS Minor insignif changes when building the model, lead to signif changes when running on external data. "Model underspecification". Not a probelm with de novo gen of molecules. Solution: check robustness, parametrisation. See our publication.
#2022ICCS Some tips. Predicted prob of being active is misleading. Not surprising to get diff scores with diff classifiers. Not true that it's a prob of being active. It's just a score used to rank cmpds. No real meaning. Mervin, Afzal JCIM 2020, 60, 4546 for scale scores models.
#2022ICCS What about regression model. Has actual meaning. Perhaps works better between models. In practice, you get better agreement between Alice and Bob. Limitations is not always possible (only binary data), or may not work (not enough data).
#2022ICCS If you can get Alice and Bob to agree on the initial data, then they will agree on the predictions. Control models can give surprising results on generated models. Biases not limited to fooling pred models. Quality of generated molecules, or overexploitation of fns.
https://twitter.com/ConferenceNoel/status/1537343343637692417
#2022ICCS Pavel Polishchuk (Palacky) on Multi-Instance Learning Approach to Predictive Modeling of Molecular Properties: new or well forgotten old?
#2022ICCS Here on behalf of Timur Madzhidov. We use diff reps of molecules 0D, 1D, 2D, 3D, 4D. Info content versus ease of calc. They are dynamic.
#2022ICCS CoMFA was one way to capture this: comparative mol fields analysis (Cramer 1988 JACS). Use electrostatic field descs and stereic field descs. We need a conformer. For flexible mols, these approaches have limitations.
#2022ICCS Bioactive conformer. Seminal paper of Hopfinger, 4D-QSAR (2010). MD sims, average across conformers. Align-independent. Feature vector is found in quite simple way.
#2022ICCS Denis Fourches tried to assess 2D/3D/MD-QSAR methods (JCIM, ...). He found that 2D performs pretty well. Can we somehow use conformer ensembles in a smarter way than just mean, standard deviation?
#2022ICCS Cluster the conformers. Switch from descriptor space to cluster space. Matrix view is 0 or 1. 0 if no confs of a particular mol or in this cluster; 1 if it is.
#2022ICCS Pmapper: 3D pharmacophore descs. Enumerate quadruplets. Canonical signature indicating constitution and relationships. Used for clustering and building models.
#2022ICCS We observed that 2D models based on Morgan fps perform pretty model and outperform our clustering approach. But if we look at only compounds with chiral data. Then 2D models not so good. In some cases we were able to outperform 2D methods.
#2022ICCS Looking at ChEMBL dataset. Still 2D methods working better in most cases. So we need a smarter way to use conformer data.
#2022ICCS Use multi-instance learning where each mol is rep by multiple confs, multi-instance 3D-QSAR. 3D MI model. The funny thing is that this is a very old paper by Diettrich (artificial intelligence 1997) - not used in chemical domain but developed in comp sci.
#2022ICCS "Solving the multiple instance problem with axis-parallel rectangles".
#2022ICCS Two quite simple approaches. Bag-wrapper: averaging of conf descs. This is what is currently done for 4D QSAR. Alternative is "Instance-wrapper". Averaging of conf predictions (not descriptors). We train the model on a matrix of all conformers.
#2022ICCS Now in the era of neural networks. Bag-Net averaging of conformation embeddings versus Instance-Net averaging of c onformation scores.
#2022ICCS Bag Attention-Net - weighted averaging of conf embeddings. This allows us to identify key instances which has more benefits.
#2022ICCS MIL study - comparison of these algos over ChEMBL dataset. We found that the quite simple approach instance-wrapper outperformed the other MI approach. Quite interesting. We compared it to 2D models on Morgan FPs. We also looked at an approach based on bioact conf.
#2022ICCS Also tried other 2D reps. MI instance wrapper beat everything. Models basd on a single energetically favourable conf were completely unreliable. >50% of cases instance wrapper beat the others.
#2022ICCS For some dataset which cannot be predicted by 2D models, we see a huge improvement. We are looking into the rationale behind this. Can we explain/predict for which datasets this might occur?
#2022ICCS If the average no of rot bonds is smaller, then the 3D method is better; other the 2D method was more favourable (ed: a bit surprised to see this trend). The less diverse scaffolds in a dataset, the easier for 2D.
#2022ICCS Identification of "bioactive" conformers. [Sorry: was distracted, and missing this] Taken from JCIM, 2021, 61, 4913. We compared random choice (worked quite well), compared to pose from Vina, lowest E conf, versus our MI-learning model. Our method never below random...
#2022ICCS ...and sometimes quite a bit above, The most recent work is about modelling of chiral catalysts. We tried to model enantiomeric excess (ee). We used 3D descs or using RDKit (or when failed, using OpenBabel). Pharmacophore features. Reactants and products were encoded.
#2022ICCS Built finally multiinstance learning models and predicted ee. Designed to sim diff scenarios for prediction. For training set we used 16 reactants and 24 catalysts. For valid we used 3 external test sets. Choosing from new reactants with same catalysts.
#2022ICCS Also tested on another tested with same react and prods, but diff catalysts. Thrird was where both (cat and reaction) were out; this is the hardest.
#2022ICCS MAE (mean abs error) was smallest for reaction-out set even for 2D. For more challenging cases, 3D MI approach worked best. Zahr's model worked quite well too - the ref model (4D QSAR techniques).
#2022ICCS All algos and data are at
GitHub - cimm-kzn/3D-MIL-QSAR: Python source code for 3D/MI/QSAR models
Python source code for 3D/MI/QSAR models. Contribute to cimm-kzn/3D-MIL-QSAR development by creating an account on GitHub.
https://github.com/cimm-kzn/3D-MIL-QSAR
https://twitter.com/ConferenceNoel/status/1537360199090155520
#2022ICCS Janosch Menke (Münster) on Neural Fingerprints: Generating Domain-specific Molecular Fingerprints Using Neural Networks.
#2022ICCS The concept of neural fps requires only an input that represents a mol and a NN that fits of the choice of rep. Needs supervised approach.
#2022ICCS The input is transformed as it passes thru the network. If the input is a mol, then the transformed input ("the activations") should also rep the molecule. If we extract the activations from hidden layers, then this can be a neural fingerprint.
#2022ICCS Neural fps should work better than trad fps for tasks of similarity search so long as we stay within the domain of the neural network training.
#2022ICCS Kinase-specific neural fp. Input ECFP4 or graph (GNN). Activity predictions for 160 kinases. Results. Training 80%. 10% valid and test set 10% used for sim search. Neural fps gives almost twich as much enrichment as the regular fp (published in 2021, JCIM, 6, 664).
#2022ICCS What if your kinase is not in the training ste? Big drop in performance, but still better than ECFP4. Not necessarily true for the graph neural network fp - performance actually can drop.
#2022ICCS Why do GNN perform so poorly. Older generation GNN, not as expressive and suffers more from oversmoothing. Extra conv layers overfits to the task - less generael structural info is retained.
#2022ICCS This time we froze the convolution layers. Frozen layers do not receive any weight updates, hence cannot be trained. These layers can not adapt to the prediction task. Lower risk of overfitting. GNN on the training set almost the same, but improved on the training set.
#2022ICCS Now interested in products. Natural product neural FP. Menke, Massa, kock, 2021, Comp and Struct Bio J. 19, 4593. Used Coconut for nat prods, ZINC and ChEMBL for synthetic.
#2022ICCS Two models provide higher enrich than baseline. Only active natural products are counted as hits. Physicochemical property fp, and baseline work better than ECFP4. Correlation between ECFP4 sim and sim of baseline fp before and after training.
#2022ICCS See two clusters after training, but this doesn't happen when we include the physicochemical properties. This smoothes out the similarity landscape, but the baseline does not fully punish the clustering.
#2022ICCS Gives examples of molecules found. Natural product likeness score by Ertl. Looking at correlation between the Ertl score from the NN score.
#2022ICCS Working on ion channels at AZ. Input is ECFP4 or Smiles (Roberta, a transformer architecture). ION channels have challenges. Extremely diverse class of targets and not much data out there. Get from guidetopharmacology, and ChEMBL. After cleaning it reduced drastically
#2022ICCS Really spare matrix. Most cmpds not measured on each ion channel. Only measured on average on 2 ion channels. Off targets, like hERG, dominate the dataset. An unrealistically high amount of active cmpds; chemists do not like to make/publish inactive cmpds.
#2022ICCS Using transformer, like RoBerta, which are trained by reconstructed SMILES. Random parts of the SMILES are hidden from the model using masking. Done randomly. The goal of the network is to predict those masked characters.
#2022ICCS We also include a Classification Token (CLS) in front of the SMILES. We can use the hidden States (Embeddings) of the CLF token to make prediction for the input SMILES. We use the loss of both the token and SMILES to guide the model at the same time, backpropagate.
#2022ICCS ChemBerta v2 makes it easier to train it quickly because already trained to some extent. Ion channel results: 80%, 10%, 10% (train, valid, test). ECFP4 does way better than untrained transformer. Only pretrained is worse (doesn't use classification token at this stage)
#2022ICCS Various other changes don't help much. Training the MLP helps though in the sim search of data. Trained transformer is good but worse than MLP. Median performance for ECFP4 is almost identical to transformer and MLP performance. Half the targets none performed well.
#2022ICCS Compares to out-of-distribution results. ECFP4 is best (just about). Training does nothing to improve performance.
#2022ICCS Also tried Siamese networks. Two RoBerta models with same weights. Punish the models if they don't both look the same (Cosine Loss). This makes it even worse - worst model so far. Matrix too sparse perhaps for this to work well.
#2022ICCS Too many active cmpds for specific targets, and large diffs in number of measures - might need to downsample actives (e.g. down to 10%). Overal high diversity between targets - maybe not enough info is being transferred. However, even within subfamilies fails.
#2022ICCS The choice of predictive task can steer the activations towards a desired domain and provide increased enrichment in sim search. Activity data is not always required, simple prop preds are sufficient if combined with auxiliary data.
https://twitter.com/ConferenceNoel/status/1537367648253759490
#2022ICCS Benoit Baillif (Cambridge) on Ranking generated molecule conformations using deep-learning predicted deviation to target-bound conformations
#2022ICCS Work is all about finding bioactive conformations. Important for docking or pharmacophore searching. CCDC conf gen can retrieve bioactive conf for 90% (ARMSD < 1Ang) on the platinum dataset (among 250 gen confs). Only 70% within top 10.
#2022ICCS Aim to develop model that can distinguish which is the best. Going to use SchNet atomistic neural network. Take atomic positions and atomic numbers, do a raw embedding. Add interatomic distance embeddings. Get interaction blocks. Gives processed atom embeddings.
#2022ICCS Neighborhood convolution. Give a single output value for a conformation. Trained to predict the bioactive conformation. Bioact confs were extracted from PDBbind. Remove molecules with > 50 heavy atoms. 12592 ligands and 15296 complexes.
#2022ICCS PDBbind + generated conformations 952880 conformations in total.
#2022ICCS Three diff splitting strategies were tested. 80%/10%/10%. Shuffle order 5 times (bootstrapping). Random split. Scaffold split. Protein split.
#2022ICCS Workflow, generate confs. Predict ARMSD. Rank by ARMSD. Selection a fraction of these (e.g. top 10%). Use CCDC Gold to Dock with PLP scoring function. 10 poses per conformation. Rigid docking to ensure that the conf is not changed during docking.
#2022ICCS DUD-E virtual screenig. Goal to retrieve actives among decoys. Rank mols by PLP score of best pose. Assess enrichment of active using BEDROC, similar to EF8% but has advantage of being weighted.
#2022ICCS Compared to a baseline. Molsize model: predict a molecular average ARMSD using only 2D descriptors. Conformation ranking baseliness. Rank of ascending UFF energy, CCDC conformer generator order.
#2022ICCS The different splits have variable ARMSD pred performances. On the test set, the models seem to have R2 around 0.5 [missed details here]. BioSchNet ranking retrieves bioactive-like c onfs inearly ranks.
#2022ICCS Difficult to apply model to molecules that are quite different than the training set. Not surprising - has been observed elsewhere. BioSchNet can accelerate virtual screening on targets seen in training. Gives example of jak2 from PDBbind.
#2022ICCS Shows nos1 (not in PDBBind) - much more difficult. Training allowed us to predict for input confs their ARMSD to their closest known bioact conf, to rank gen confs with an early enrichment of bioact-like confs, and to short-list confs for virt screening using rigiddock
https://twitter.com/ConferenceNoel/status/1537373906247507969
#2022ICCS Arndt Finkelmann (Syngenta) on Digital Chemistry at Syngenta: From academic labs to industrial applications
#2022ICCS Not so much about explicit scientific results. How we think about ML and bring these innovations into these research operations. How to make it a success in the long time.
#2022ICCS We do small molecule based crop protection. Why do we need to innovate in this space? World pop growing. Output of agricultural domain is lagging behind this growth. Anthropogenic climate change making this worse. Since 1961 growth of output has slowed down by 21%.
#2022ICCS We have to do this innovation in a sustainable way as a quarter of greenhouse gases are linked to agriculture. We believe that digital technologies can help drive this innovation.
#2022ICCS Data downstream informs decisions upstream. We want to be able to balance a broader set of props in early design. Shows radar plot of 8 properties. Safety/potency/novelty/cost/etc... Right now, we do this sequentially. Find potent cmpds. Next test pysicochemical props.
#2022ICCS If we had good models we could incorporate them into the design phase. Inverse design would help make this process a lot faster. Digital tech is part of digital transformation. Evolve data and software infrastructure to support horiz scaling of data-based decision-makng
#2022ICCS Scout and implement new tech. Change management - if you can't get your researchers to use your software then what's the point?
#2022ICCS The importance of addressing synthesisability. "Virtual world" Start with design objectives. Manual design in parallel with generative chemistry. Next route planning and synthetic tractability. "Physical world" Reactions - compound synthesis and protocoling -> testing.
#2022ICCS Synthesisability is really important to avoid the friction between the virtual world and the physical world. Design - what should be made. Synthesis - what can be made.
#2022ICCS Design principles for storing reaction data. First one is separation of concerns - separate the design from the protocol/making part. The second one is how you model reactions. Library synthesis. High throughput reaction optimization. Batch synthesis.
#2022ICCS These issues are not solved well in the current solutions. e.g. Organometallic catalysts are currently just captured as strings. What about byproducts - we should be capturally. Should be labelling parts of the reaction with their role. Make it universally accessible.
#2022ICCS We have a synthesis design module. A protocoling module Reaction data into reaction data lake.
#2022ICCS People started thinking about the network of organic chemistry (NOC) some time ago (Nature Chem 2009, 1, 31, Chem Sci, 2019, 10, 4640). If you have all of this information in this graph, you could use this for modelling, or analyse it to look at proposed routes.
#2022ICCS Can find similar routes with defined reaction conditions. We want to connect all reaction data to create powerful network effects. We want to use publically available and internal data. Our own data will be highly reliable. External high quality data from patents: 20%.
#2022ICCS Working together with IBM. "The power of transfer learning!" Even if only 20% of data (patents, Pistachio) is usuable, still very good. Legacy data is 60% usable.
#2022ICCS How do we do this in details? 700K in internal ELN. Pistachio 10M. Other. Store in MongoDB data lake. "LinChemin" where we have a library to map to neo4j graph representation. Use bipartite graph where we can do our analysis. Paper in prep.
#2022ICCS For protocoling. There are languages to describe protocols. Several approaches - no standard. Our implementation is drag-and-drop. Protocols as a recipe. Capture the details in a structured way. Using open standards and open source code.
#2022ICCS For all of this, there is a human component. You need to have people on board. No use otherwise. These in-house developements along to bring chemists on board and have a rule on how they interact with it, the GUI, etc.
#2022ICCS Design internally, but development partner is EPAM. [Ed: Three names I recognise in the credits @SuperScienceGrl , @georgeisyourman and @spparel]
https://twitter.com/ConferenceNoel/status/1537381230399868930
#2022ICCS Akos Tarcsay (ChemAxon) on Translating data to predictive models
#2022ICCS Ref to "Thinking fast and slow". There's the peak, and the last moment. This is the last talk!
#2022ICCS The ML life-cycle. Collection of data. Experimenting with data, standardising, training, visualising, triage. Last part - deployment of model ("the operate" part).
#2022ICCS We have developed a process to help with this from data ingestion through preprocessing, modelling, review and prediction. Not talking today about building models, but about building the infrastructure.
#2022ICCS We have a Java ML library (SMILE). using ChemAxon standardizer and desc generation. This is the service layer. Links to DB layers. There's a REST interface. And linked to this a GUI for CompChem, a separate one for MedChem, and links to Jupyter notebooks.
#2022ICCS Effect of standardisation. Salts and solvates, and tautomerism. Simple descriptors (MW, fsp3), etc. Quote from a paper saying that the ChemAxon tool works much better than the open source alternative for these problems.
#2022ICCS SMRT dataset for tautomerization. Training: 7K cases. Comparing the results for standardised versus non-standardized. 80000 cases in the full dataset. For 15K, R2 is 0.9.
#2022ICCS Let's look at pharmacological data. The ChEMBL bioact bench set (J Cheminf 2017, 9, 45).. We filtered some data points. Did temporal split for external testing. Cumulative histogram on the test set and external set. More than 50% of the cases are higher than the cutoff.
#2022ICCS The external set was quite challenging, and there's a large shift. If selecting randomly it's easier to build the models. Maybe the most recent ones were different in some way (e.g. peptides versus SMs).
#2022ICCS Is it a hard task? Original results random split. The best method from the paper was 0.59 Matthews Corr Coeff (MCC), also using temporal split. That's the baseline. We used a cutoff of activity of 7. Most cases are balanced classes but there are some extremes.
#2022ICCS Random Forest (RF) results shown. Big peak at 0 because we assigned 0 if the MCC could not be calcualted at all. We used Keras, atom and bond features. Can we create a standard model with a Message-passing NN (MPNN) and compare.
#2022ICCS Binned MCC evaluation random selected test case. It's hard for the NN as well. Results shown for the external case (temporal split). Most of the models are poor (shouldn't be used) - RF somewhat better than NN.
#2022ICCS Introducing conformal prediction. Where you build an error model. To get an idea of prediction error.
#2022ICCS How long does it take? Avg prediction is about 50ms per cmpd, or 72K cmpds/h. Looking at some examples where it was slow - large MW fullerenes!
#2022ICCS Haven't yet spoken about feature generation. We used a combined desc set: 3220 descs. Used Pearson correlation on validation set to rank these. It's hard to beat ECFP, but if combined with MACCS or PhysChem descriptors, yo.u can do better.
#2022ICCS Combining ChemTerm with ECFP actually made things worse. We wanted to investigate. Looking at RF importance. 50% are related to protonation or partioning. "pKa of opioid ligands........" paper by Christoph Stein et al.
#2022ICCS Protonation deour - featanyl F-derivative: FF3. The fluorine only affects the binding thru modulating the pKa. No diff at pH 6.5. Is difference at 7.4 (?).
#2022ICCS MoleculeNet dataset used to build blood brain barrier penetration model. MCC 0.6. Ended up with 43 decscriptors on this dataset. Rich prediction results.
#2022ICCS Classification use-case. PAMPA permeability. Use PubChem BioAssay dataset. Standardise. Classify into low and high cases. Create tSNE plot based on MACCS keys. Two clusters. Try to train on cluster 1 and predict on cluster 2.
#2022ICCS "Trainer engine". Using gradient-boost method. Another example is for hERG based on a dataset from Scientific Reports (2019). ChemAxon versus SVM (training on the data) versus ACD, ADMET Prodictor and StarDrop. Shows results.
#2022ICCS Describes architecture of design hub connected to model building. Design Hub is service agnostic. You simply set a production flag and then the model is available straightaway to the users.
#2022ICCS "Trainer Engine" and "Design Hub". Lower the barrier to build and deploy models to users.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment