AI + QSAR: helping drug discovery efforts

>>> Welcome to Club QSAR

Many believe that bringing pharmaceuticals to the market is a quick and easy process that require seldom regulation and experimentation. This is not the case. The drug development process is a long, arduous, and costly one further explained using the figure below.the-drug-discovery-processIn this post, I will be focusing on the drug discovery ( research & development) stage, which focuses on identifying the perfect drug candidate from many molecules able to have the desired therapeutic effect on a biological target of interest ( i.e., a protein).

This drug candidate identification is done by performing many in vitro ( in glass) experiments that although necessary, consume scientists plenty of costly resources and time that could potentially be saved by using computational instead of experimental means.

Quantitative structure-activity relationships ( QSAR) modelling is the main chemistry-informatics approach used to discover small chemical compounds (drug candidates) having the desired activity against a therapeutic target (usually a protein playing a vital role in the disease) while minimizing the likelihood of off target effects which can cause toxicity. Such predictions help prioritize drug discovery experiments reducing work and resources cost. QSAR usually works using ligand-based models where the protein is ignored due to its complex structure (See Blog Post) while only the small molecule is modeled.

drug_2In its simplistic form, the measured activities of many small molecules against a single protein is obtained experimentally, then the model from the small molecules specific features ( fingerprints), i.e., count and arrangements of atoms and functional groups within the molecule. Alternatively, the model can learn these fingertips by deriving them from chemical structures using an auto-encoder.

The most promising QSAR methods prior to deep learning were variations of Random Forests (RF) and support vector machine algorithms. That was before Merck, one of the leading biotech companies sponsored a Kaggle competition to examine which machine learning combinations can provide the most efficient solutions to QSAR problems. The winning entry outperformed RF by using an ensemble Gaussian process (GP) regression, where the primary factor were Deep Neural Networks (DNN) (see insight into DNN).

Use of multi-task DNN in QSAR, for example, have improved the single protein approach mentioned, by allowing the analysis of compounds across multiple proteins. Conceptually, it allows to learn from fewer data by using the fact that molecules having similar features behave similarly across multiple proteins.

Deep learning can also solve a key limitation of both single and multitask models, which is that the activity of molecules against proteins most in need of prediction are the hardest to predict because of scarce data sets.

A promising approach is the use of  Deep Convolutional Neural Network (DCNN) using AtomNet to directly model both the molecule and the structure of the protein to predict bioactivity in novel (new) proteins with no experimental biological activity data for drug discovery applications. AtomNet is the first deep neural network made specifically for structure-based binding affinity prediction.

 

Insight into DNN—————————————————————————————————

dnnDNN are a class of deep learning algorithms made of a network composed of “neurons”. A neuron (a) has many inputs reflected as the input arrows and one output (output arrow). Each of the input arrows is associated with a weight wi. An example to understand weights is if we were to train a model to identify pedestrians in an image but these always appeared in the centre of the image, the model would not be able to recognize pedestrians in other positions of the image as each part of the image would be a different weight. The neuron also has an activation function, f(z), and a default bias term b. A row of neurons forms a layer of neuronal network and DNN has several layers(b), where each output neuron produces a prediction for a separate end point (e.g. assay result)

——————————————————————————————————————————