Skip to content

Instantly share code, notes, and snippets.

@fabsta
Last active July 3, 2020 10:47
Show Gist options
  • Save fabsta/0b20935c3c066a2ad788d83f2a66ceff to your computer and use it in GitHub Desktop.
Save fabsta/0b20935c3c066a2ad788d83f2a66ceff to your computer and use it in GitHub Desktop.
AI in drug discovery

Table of contents generated with markdown-toc

Introduction

What's the problem

Success rates

Drawing

Drawing

image source

Costs of developing a drug

Drawing

source

Drawing

image source

The opportunity

The future of drug discovery

Generative models

Drawing

There are many startup companies

Videos

Posts

DiversityNet

(top)

Papers

Generators

  • Deep Reinforcement Learning for de-novo Drug Design (Moscow, 2018)
    Paper: Link
    Notes: Data: JAK2, ChEMBL, PubChem, 14,176 (logP), 15,549 (JAK2), and 47,425 (T_m)
    Method: Property prediction models, Training for the generative model, Stack-augmented recurrent neural network
    Code: , Jak2 demo: , RecurrentQSAR Jak2 Demo:

  • Prototype-Based Compound Discovery Using Deep Generative Models (Israel, 2018)
    Paper: Arxiv Link, Journal link
    Notes: Extend VAE to allow a conditional sampling – sampling an example from the data distribution (drug-like molecules) which is closer to a given input. Data:
    Method: VAE
    Code:

  • Conditional Molecular Design with Deep Generative Models (Korea, 2018):
    Paper: Arxiv link, Journal link
    Notes:
    Data: 310K Zinc, Code: , jupyter_embedding_test.ipynb, Fork link

  • Automatic chemical design using data driven continuous representation of molecules (Havard, 2018):
    Paper: Link
    Notes: , Code: authors, simplified

  • De Novo Design at the Edge of Chaos (Schneider lab, 20??):
    Paper: Link, sci-hub
    Notes: , Method: Review

  • Reinforced Adversarial Neural Computer for de Novo Molecular Design (Insilico, 2018):
    Paper: Link
    Notes: schematic view, Methods, RANC based on Organic
    Code:

  • Optimizing distributions over molecular space. An Objective-Reinforced Generative Adversarial Network for Inverse-design Chemistry (Organic, Havard, 2017):
    Paper: Link
    Notes: Method: GAN, RL Data: 15 000 drug-like form ZINC, 15 000 drug-like from ChemDiv
    Code:

  • Adversarial Threshold Neural Computer for Molecular de Novo Design (Insilico, 2018):
    Paper: Link
    Notes:
    Code:

  • Generating focused molecule libraries for drug discovery with RNNs (AstraZeneca, 2018):
    Paper: Link
    Notes:
    Data: ChEMBL, Mol Representation: SMILES Methods Code:

  • Exploring Deep Recurrent Models with Reinforcement Learning for Molecule Design (Benevolent.AI, 2018):
    Paper: Link
    Notes: , Code:

  • Grammar Variational Autoencoder (Alan Turing Institute, 2017):
    Paper: Link
    Notes:
    Code:

  • Application of generative autoencoder in de novo molecular design (AstraZeneca, 2017):
    Paper: Link
    Notes: , Code:

  • Syntax-Directed Variational Autoencoder for Structured Data (Georgia Tech, 2018):
    Paper: Link
    Notes:
    Code:

  • Deep Generative Models for Molecular Science (Georgia Tech, 2018):
    Paper: Link
    Notes:
    Code:

  • The cornucopia of meaningful leads: Applying deep adversarial autoencoders for new molecule development in oncology (Insilico, 2017):
    Paper: Link
    Notes:
    Code:

  • druGAN: an advanced generative adversarial autoencoder model for de novo generation of new molecules with desired molecular properties in silico (Insilico, 2017):
    Paper: Link
    Notes:
    Code:

  • De novo drug design with deep generative models : an empirical study (??, 2017):
    Paper: Link
    Notes:
    Code:

  • De novo drug design with deep generative models : an empirical study (??, 2017):
    Paper: Link
    Notes: RNN generative models for stochastic optimization in the context of de novo drug design.
    Code:

  • Molecular generation with recurrent neural networks (Wildcard consulting, 2017):
    Paper: Link
    Notes: RNN with LSTM cells can generate synthesizable molecules.
    Code:

  • Sequence Tutor: Conservative Fine-Tuning of Sequence Generation Models with KL-control (Google, 2017):
    Paper: Link
    Notes: Method for improving sequence generated by a RNN
    Code:

  • Molecular De Novo Design through Deep Reinforcement Learning (AstraZeneca, 2017):
    Paper: Link
    Notes: Sequence-based generative model for molecular de novo design.
    Code:

  • ChemTS: An Efficient Python Library for de novo Molecular Generation (Tokyo, 2017):
    Paper: Link
    Notes: Python library ChemTS that explores the chemical space by combining Monte Carlo tree search (MCTS) and an RNN.
    Code:

  • Generative Recurrent Networks for De Novo Drug Design (Schneider lab, 2017):
    Paper: Link
    Notes: De novo design that utilizes RNN containing LSTM cells.
    Code:

  • Molecular generative model based on conditional variational autoencoder for de novo molecular design (KAIST Korea, 2018):
    Paper: Link
    Notes: Conditional variational autoencoder CVAE) for de novo molecular design (5 properties,Aspirin, Tamiflu).
    Code:

  • Improving Chemical Autoencoder Latent Space and Molecular De novo Generation Diversity with Heteroencoders (Wildcard consulting, 2018):
    Paper: Link
    Notes: Dataset: GDB-8 dataset.
    Code:

Retro synthesis

  • Towards "AlphaChem": Chemical Synthesis Planning with Tree Search and Deep Neural Network Policies:
    Paper: link

  • Learning to Plan Chemical Syntheses:
    Paper: link

  • Retrosynthetic Reaction Prediction Using Neural Sequence-to-Sequence Models:
    Paper: link, Code: link

  • Planning chemical syntheses with deep neural networks and symbolic AI:
    Paper: link

  • Predicting Organic Reaction Outcomes with Weisfeiler-Lehman Network
    Paper: link, Code: link

Overview of papers

Paper Author Year Abstract
De Novo Design at the Edge of Chaos Schneider lab Current perspective automated molecule generation.
Reinforced Adversarial Neural Computer for de Novo Molecular Design Insilico 2018 Reinforced Adversarial Neural Computer (RANC)
Optimizing distributions over molecular space. An Objective-Reinforced Generative Adversarial Network for Inverse-design Chemistry Havard 2017 Objective-Reinforced Generative Adversarial Networks (ORGANIC)
Adversarial Threshold Neural Computer for Molecular de Novo Design Insilico 2018 Adversarial Threshold Neural Computer (ATNC), de novo design of novel small-molecules (Generative Adversarial Networks (GANs) with Reinforcement Learning)
Generating focused molecule libraries for drug discovery with RNNs AstraZeneca 2018 RNNs can be trained as generative models for molecular structures
Exploring Deep Recurrent Models with Reinforcement Learning for Molecule Design Benevolent.AI 2018 19 benchmarks, apply reinforcement learning techniques for molecular design
Automatic chemical design using data driven continuous representation of molecules Havard 2018 Convert molecules to and from a multidimensional continuous representation.
Conditional Molecular Design with Deep Generative Models 2018 Conditional molecular design method that facilitates generating new molecules with desired properties.
Grammar Variational Autoencoder Alan Turing Institute 2017 VAE using parse trees to check validity.
Application of generative autoencoder in de novo molecular design AstraZeneca 2017 Performance of various autoencoders as generators
Syntax-Directed Variational Autoencoder for Structured Data Georgia Tech 2018 Syntax-directed variational autoencoder (SD-VAE) with on-the-fly generated guidance for constraining the decoder
Deep Generative Models for Molecular Science Denmark Tech 2018 Review deep generative models for predicting molecular properties. Focus on autoencoder
The cornucopia of meaningful leads: Applying deep adversarial autoencoders for new molecule development in oncology Insilico 2017 First application of AAE for generating novel molecular fingerprints with a defined set of parameters.
druGAN: an advanced generative adversarial autoencoder model for de novo generation of new molecules with desired molecular properties in silico Insilico 2017 AAE and its advantages compared to VAE
De novo drug design with deep generative models : an empirical study 2017 RNN generative models for stochastic optimization in the context of de novo drug design.
Molecular generation with recurrent neural networks Wildcard consulting 2017 RNN with LSTM cells can generate synthesizable molecules.
Sequence Tutor: Conservative Fine-Tuning of Sequence Generation Models with KL-control Google 2017 Method for improving sequence generated by a RNN
Molecular De Novo Design through Deep Reinforcement Learning AstraZeneca 2017 Sequence-based generative model for molecular de novo design.
ChemTS: An Efficient Python Library for de novo Molecular Generation Tokyo 2017 Python library ChemTS that explores the chemical space by combining Monte Carlo tree search (MCTS) and an RNN
Generative Recurrent Networks for De Novo Drug Design Schneider lab 2017 De novo design that utilizes RNN containing LSTM cells.
Molecular generative model based on conditional variational autoencoder for de novo molecular design KAIST Korea 07.2018 conditional variational autoencoder CVAE) for de novo molecular design (5 properties,Aspirin, Tamiflu).

Generative models

It’s good to start with models already tried in the literature:

For GAN, there are different flavors: Wasserstein-GAN (Facebook), Cramer-GAN (DeepMind), Optimal Transport-GAN (OpenAI), Coulomb-GAN (Linz University), although at the end, maybe they are all equal (Google).

(top)

You can also find more in the Natural Language Processing literature (and apply them to SMILES):

Benchmarks

(top)

Software

DeepChem

molecule generator (wildcard consulting) (top)

Data

In most papers, data is taken from:

source

Code

Labs

Rating of labs in AI for drug discovery

(top)

Companies

101 Startups Using Artificial Intelligence in Drug Discovery

Hot areas

Generate Novel Drug Candidates

Theory

Molecule representations

From a computational perspective, druglike molecular structure can be represented in five ways.

  • SMILES(41) or InChi,(42)
  • molecular fingerprint,(43)
  • set of molecular descriptors, such as molecular weight, logP, number of heavy atoms, number of rotatable bonds, etc.,
  • graph in which atoms are nodes and links are bonds between atoms, or
  • 3D electron density map

Comparison

Drawing

Smiles

aka simplified molecular-input line-entry systems

Drawing

Disadvantages

  • One fingerprint can match several molecules, so there is no one-to-one mapping from a molecule to the fingerprint,
  • The fingerprint representation contains less information about the molecule topology than the string representation

Grammar

Graph convolutions

(top)

Diversity metrics

Designing evaluation metrics is an important part of the challenge. These metrics assess the quality and diversity of generated samples. Here, contributions from medicinal chemists and statisticians are especially welcome.

Measures of diversity are based on distance metrics in the chemical space. This distance tells when two molecules are chemically close to each other. The most popular distance is the Tanimoto distance on Morgan fingerprints. It’s not necessary to get into details of the definition, the point is that those fingerprints are hand-crafted features, and it’s probably better to replace them with deep learning features, as suggested in the MoleculeNet benchmark.

Let’s denote:

  • Td the distance in the chemical space.
  • A the set of generated molecules with desired properties. Its size is noted |A|.
  • B the training set.

Nearest neighbor diversity

it’s the average distance between a generated molecule in A and its nearest neighbor in the training set B. The formula is:

NN(A,B)=\frac{1}{|A|}\sum_{x\in A}\min_{y\in B}T_{d}(x,y)NN(A,B)=∣A∣1​x∈A∑​y∈Bmin​Td​(x,y)​

Internal diversity

it’s the average distance of desired generated molecules with each other. The formula is:

I(A)=\frac{1}{|A|^{2}}\sum_{(x,y)\in A\times A}T_{d}(x,y)I(A)=∣A∣21​(x,y)∈A×A∑​Td​(x,y)​

Earth Mover Distance with a reference dataset

Another measure of internal diversity is to compare the set of generated samples with a reference set, which is known to be diverse a priori. For example, the ZINC dataset seems suitable. Chemists can propose alternative reference datasets.

The idea is to take a random subset of the reference set with the same size as the generated set. Then to consider those two sets as two piles of sand in the chemical space, and measure the energy necessary to move the first pile into the second pile (this measure is known as Earth Mover Distance in statistics, and Wasserstein metric in mathematics).

Inception score

OpenAI. This metric uses the Inception predictive model, which is a standard image classifier (a winner of the ImageNet challenge). A generative model has a high Inception score when the Inception model is very confident that generated images belong to a particular ImageNet category, and when all categories are equally represented. This suggests that the generative model has both high quality and diversity.

Fréchet Inception Distance

(Linz University): it computes a distance between distributions of the training data and of the generated data. See their Fréchet ChEMBLNet distance.

(top)

Architectures

RNNs

Modeling Molecules with Recurrent Neural Networks

Autoencoders

Denoising autoencoder

Neural inpainting

Variational autoencoder

Learns a distribution normal bottleneck vector z is replaced by two vectors

  • mean
  • standard deviation

Loss function: Reconstruction loss KL divergence (makes sure distribution you're learning isn't too far from normal distribution)

Disentangled autoencoder

Github code: Molecule generator chembl autoencoder

Notes for molecule generator autoencoder

Paper to reimplement: paper, code

Starting points: MNIST: Pytorch VAE example, another, more detailled Convolutional autoencoder (exercise/solution) Denoising autoencoder (exercise/solution)

pytorch/VAE Graph decoders DeepChem issue, icml18-jtnn, github 2

Theory VAE explained

(top)

Notes from talks/Papers

Syntax directed variational autoencoders and other methods of drug discovery (SD-VAE)

Video: here

Deep learning for ligand-based de novo design in lead optimization: a real life case study

http://iktos.ai/successful-lead-optimization-project-in-collaboration-with-servier-presented-at-efmc2018/ Video: here Poster: here

A B C
Drawing Drawing Drawing
Drawing Drawing

Matched molecular pairs

  • Goal: Associate defined structural modifications with chemical property changes, including biological activity (SAR)
  • It is argued that longer matched series is more likely to exhibit preferred molecular transformation while, matched pairs exhibit only a small preference
Drawing

Activity Cliff

  • large change in potency that correspond to small changes in the molecular structures

  • high SAR information content

Literature McPairs software youtube

Detailed papers (archive)

De novo design at the edge of chaos (miniperspective, Gisbert Schneider)

Drawing Drawing Drawing Drawing

(top)

RANC (Insilico)

Putin, E., et al. (2018). "Reinforced Adversarial Neural Computer for de Novo Molecular Design." Journal of chemical information and modeling.

Drawing Drawing Drawing
Drawing Drawing Drawing

(top)

ORGANIC

Benjamin, S.-L., et al. (2017). Optimizing distributions over molecular space. An Objective-Reinforced Generative Adversarial Network for Inverse-design Chemistry (ORGANIC).

Drawing
(top)

ATNC (Insilico)

Putin, E., et al. (2018). "Adversarial Threshold Neural Computer for Molecular de Novo Design." Molecular pharmaceutics.

Drawing Drawing
Drawing Drawing
(top)
Idea: adsasadsadDrawing MethodsDrawing Drawing ResultsDrawing Drawing Drawing Drawing

Generating focused molecule libraries for drug discovery with RNNs (AstraZeneca)

Generating focused molecule libraries for drug discovery with recurrent neural networks.

In _de novo_ drug design, computational strategies are used to generate novel molecules with good affinity to the desired biological target. In this work, we show that recurrent neural networks can be trained as generative models for molecular structures, similar to statistical language models in natural language processing. We demonstrate that the properties of the generated molecules correlate very well with the properties of the molecules used to train the model. In order to enrich libraries with molecules active toward a given biological target, we propose to fine-tune the model with small sets of molecules, which are known to be active against that target. Against _Staphylococcus aureus_, the model reproduced 14% of 6051 hold-out test molecules that medicinal chemists designed, whereas against _Plasmodium falciparum_ (Malaria), it reproduced 28% of 1240 test molecules. When coupled with a scoring function, our model can perform the complete _de novo_ drug design cycle to generate large sets of novel molecules for drug discovery. Drawing Drawing Drawing Drawing

(top)

19 Tasks as open AI gym for molecular generation

Neil, D., et al. (2017). "EXPLORING DEEP RECURRENT MODELS WITH REINFORCEMENT."

Drawing Drawing Drawing
(top)

Autoencoder for molecular design (Havard, 2018)

Gomez-Bombarelli, R. "Automatic chemical design using data driven continuous representation of molecules."

Drawing Drawing Drawing
Drawing Drawing Drawing

(top)

Conditional molecular design ()

Kang, S. and K. Cho (2018). "Conditional Molecular Design with Deep Generative Models." J Chem Inf Model.

Drawing Drawing Drawing
Drawing Drawing Drawing

Grammar variational autoencoder (Alan Turing Institute, 2017)

Matt J. Kusner, et al. (2017). "Grammar Variational Autoencoder."

Drawing Drawing Drawing

(top)

Application of generative autoencoder (AstraZeneca, 2017)

Blaschke, T., et al. (2017). "Application of generative autoencoder in de novo molecular design."

Drawing Drawing Drawing
Drawing Drawing Drawing

(top)

Syntax-directed variational autoencoder (Georgia Tech, 2018)

Dai, H., et al. (2018). "Syntax-Directed Variational Autoencoder for Structured Data."

Drawing Drawing Drawing
Drawing Drawing Drawing
Drawing Drawing

(top)

(Denmark Tech,2018)

Jorgensen, P. B., et al. (2018). "Deep Generative Models for Molecular Science." Mol Inform 37(1-2).

We review these recent advances within deep generative models for predicting molecular properties, with particular focus on models based on the probabilistic autoencoder (or varia- tional autoencoder, VAE) approach in which the molecular structure is embedded in a latent vector space from which its properties can be predicted and its structure can be restored.

(top)

Cornucopia of meaningful leads with deep adversarisal autoencoders (Insilico, 2017)

Kadurin, A., et al. (2017). "The cornucopia of meaningful leads: Applying deep adversarial autoencoders for new molecule development in oncology." Oncotarget

Drawing Drawing

druGan (Insilico, 2017)

Kadurin, A., et al. (2017). "druGAN: an advanced generative adversarial autoencoder model for de novo generation of new molecules with desired molecular properties in silico." Molecular pharmaceutics 14(9): 3098-3104.

Drawing Drawing
Drawing Drawing
(top)

Empirical study (2017)

De novo drug design with deep generative models : an empirical study

Molecular generators + chemplanner (Bjerrum, 2017)

Bjerrum, E. J. and R. Threlfall (2017). "Molecular generation with recurrent neural networks."

Drawing Drawing Drawing

(top)

Music and sequence generation tutor (Google, 2017)

Sequence Tutor: Conservative Fine-Tuning of Sequence Generation Models with KL-control

This paper proposes a general method for im- proving the structure and quality of sequences generated by a recurrent neural network (RNN), while maintaining information originally learned from data, as well as sample diversity. An RNN is first pre-trained on data using maximum likeli- hood estimation (MLE), and the probability dis- tribution over the next token in the sequence learned by this model is treated as a prior pol- icy. AnotherRNNis then trained using reinforce- ment learning (RL) to generate higher-quality outputs that account for domain-specific incen- tives while retaining proximity to the prior pol- icy of the MLE RNN. To formalize this objec- tive, we derive novel off-policy RL methods for RNNs from KL-control. The effectiveness of the approach is demonstrated on two applications; 1) generating novel musical melodies, and 2) com- putational molecular generation. For both prob- lems, we show that the proposed method im- proves the desired properties and structure of the generated sequences, while maintaining informa- tion learned from data. Drawing Drawing

(top)

Molecular De Novo Design through Deep Reinforcement Learning (AstraZeneca, 2017)

Olivecrona, M., et al. (2017). "Molecular de-novo design through deep reinforcement learning." Journal of cheminformatics 9(1): 48.

This paper proposes a general method for im- proving the structure and quality of sequences generated by a recurrent neural network (RNN), while maintaining information originally learned from data, as well as sample diversity. An RNN is first pre-trained on data using maximum likeli- hood estimation (MLE), and the probability dis- tribution over the next token in the sequence learned by this model is treated as a prior pol- icy. AnotherRNNis then trained using reinforce- ment learning (RL) to generate higher-quality outputs that account for domain-specific incen- tives while retaining proximity to the prior pol- icy of the MLE RNN. To formalize this objec- tive, we derive novel off-policy RL methods for RNNs from KL-control. The effectiveness of the approach is demonstrated on two applications; 1) generating novel musical melodies, and 2) com- putational molecular generation. For both prob- lems, we show that the proposed method im- proves the desired properties and structure of the generated sequences, while maintaining informa- tion learned from data. Drawing Drawing Drawing Drawing Drawing Drawing

(top)

ChemTS: An Efficient Python Library for de novo Molecular Generation (Tokyo, 2017)

ChemTS: An Efficient Python Library for de novo Molecular Generation

Automatic design of organic materials requires black-box optimization in a vast chemical space. In conventional molecular design algorithms, a molecule is built as a combination of predetermined fragments. Recently, deep neural network models such as variational auto encoders (VAEs) and recurrent neural networks (RNNs) are shown to be effective in de novo design of molecules without any predetermined fragments. This paper presents a novel python library ChemTS that explores the chemical space by combining Monte Carlo tree search (MCTS) and an RNN. In a benchmarking problem of optimizing the octanol-water partition coefficient and synthesizability, our algorithm showed superior efficiency in finding high-scoring molecules. ChemTS is available at https://github.com/tsudalab/ChemTS. Drawing Drawing

(top)

Generative Recurrent Networks for De Novo Drug Design (Tokyo, 2017)

Generative Recurrent Networks for De Novo Drug Design

Generative artificial intelligence models present a fresh approach to chemogenomics and _de novo_ drug design, as they provide researchers with the ability to narrow down their search of the chemical space and focus on regions of interest. We present a method for molecular _de novo_ design that utilizes generative recurrent neural networks (RNN) containing long short‐term memory (LSTM) cells. This computational model captured the syntax of molecular representation in terms of SMILES strings with close to perfect accuracy. The learned pattern probabilities can be used for _de novo_ SMILES generation. This molecular design concept eliminates the need for virtual compound library enumeration. By employing transfer learning, we fine‐tuned the RNN′s predictions for specific molecular targets. This approach enables virtual compound design without requiring secondary or external activity prediction, which could introduce error or unwanted bias. The results obtained advocate this generative RNN‐LSTM system for high‐impact use cases, such as low‐data drug discovery, fragment based molecular design, and hit‐to‐lead optimization for diverse drug targets. Drawing Drawing Drawing Drawing

Matched molecular pairs

  • Goal: Associate defined structural modifications with chemical property changes, including biological activity (SAR)
  • It is argued that longer matched series is more likely to exhibit preferred molecular transformation while, matched pairs exhibit only a small preference
Drawing Drawing

Activity Cliff

  • large change in potency that correspond to small changes in the molecular structures

  • high SAR information content

Literature McPairs software youtube

Written with StackEdit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment