Predicting Drug Combinations
Introduction
Drug combinations can be remarkably effective medicines. For example, antiretroviral
therapy (ART) has dramatically reduced the HIV-related mortality rate. On their own,
the constituent drugs of ART are ineffective - HIV rapidly mutates and acquires
resistance to them. When the constituent drugs of ART are combined, any HIV particles that acquire resistance to a single drug are eliminated by the others. Presumably, multi-drug cancer therapies will also be more effective than single drugs.
One approach to finding a candidate drug or drug combination is through Connectivity Mapping. This approach measures how gene expression is changed in response to a disease and then searches for drugs that cause the opposite changes in gene expression. Drugs found in this manner are predicted to reverse the gene expression changes caused by the disease and thereby reverse the disease. This approach was taken up by the Broad Institute, when they assayed how gene expression is changed in response to 1309 different drugs.
This may seem like a large number of drugs in which to find candidates. However, if you also include all unique two-drug combinations of the 1309 assayed drugs - this represents a staggering 856086 potential medicines. It is currently unfeasible to assay all these combinations, but their expression profiles can be predicted.
In this post, I show you the approach that I took to predict drug combinations for my Bioconductor package ccmap. I first implement a neural network in Python followed by a gradient boosted random forest in R. Along the way, I introduce the concepts of data augmentation and stacking. You can follow along by downloading the training data from here.
Training Data
The training data consists of all microarray data that I could find from GEO where single treatments and their combinations were assayed. In total, 148 studies with 257 treatment combinations were obtained. For all the studies used, only 3483 genes were common to all. As such, a separate neural network was trained to infer any missing values (not covered here). Let’s load up the data:
A Tale of Two Models
Let me tell you the story of Simple Model and Hopeless Model.
Simple Model, like her name suggests, takes a simple approach to predicting how gene expression will change in response to a combination of two treatments. Simple Model looks at each gene one at a time and asks: “What was the effect of the two individual treatments on this gene?”. Simple model then does something reasonable, but simple, like averaging the effect of the two individual treatments.
Hopeless Model is much more ambitious and thinks Simple Model a bit simple. In order to predict how gene expression will change in response to a combination of two treatments, Hopeless Model looks at how the first treatment affected all 11525 genes AND how the second treatment affected all 11525 genes. By doing this, Hopeless Model thinks he can find some relationships that simple model has no chance of discovering. Unfortunately, Hopeless Model only has 257 samples and so a lot of the relationships he finds only work well on these samples. They don’t apply very well to data that he hasn’t observed (See here for a description of prerequisites and the various neural network parameters):
Data Augmentation
Hopeless Model, realizing his folly, figures out a clever way to improve his predictions. He reasons that the expression of each gene is affected by two separate treatments (by amounts dA
and dB
), and that the effect of the combined treatment should be similar irrespective of which treatment is responsible for which effect. This reasoning allows him to randomly swap dA
and dB
for each gene and thereby gives him access to an essentially limitless amount of training data (lines that end with #!
indicate a change from the previous model):
Stacking
Hopeless Model is feeling a bit down about his accuracy and seeks consolation from Simple Model. While consoling her friend, Simple Model realizes that there are certain situations when she can make a better prediction by considering both Hopeless Model’s predictions AND the effect of the two individual treatments on a given gene. This news sure cheers up Hopeless Model!
This is a description of the machine learning approach called stacking (MLwave has a fantastic guide to stacking and other variations of model ensembling). One important subtlety of stacking is that in order for Simple Model to effectively learn when Hopeless Model’s predictions should be incorporated, Hopeless Model can’t have been trained on the data that he is providing predictions for. If he has, Hopeless Model’s predictions will seem strangely accurate and end up being weighted too heavily.
To get around this, we train two separate Hopeless Models. Each sees half of the data and then provides their predictions for the other half (figure below). By doing this, we get Hopeless Model’s predictions for the entirety of the training data and ensure that those predictions are good reflections of Hopeless Model’s ability.
For our purposes, Hopeless Model’s predictions will be stacked with a gradient boosted random forest. Hopeless model will make his predictions (figure below on left - transparent purple circles) and then pass them to the random forest (figure below on right). Each sample provided to the random forest will contain the information for only a single gene from one study. For each sample, the random forest will have access to Hopeless Model’s prediction and to the effect of the two individual treatments (both effect sizes - solid red and blue circles, and variances - feathered red and blue circles). Note that xgboost
doesn’t mind missing data so we don’t need to infer the missing variances.
Let’s remove the evaluation set and train our two Hopeless Models, each on half of the data:
Before training the random forest stacker, we first need to reshape the training data and Hopeless Model’s predictions. We will also reapply the same shuffling and reshaping to our variances and add them to the training data for our stacker:
Pirates Love R
What is a pirates favorite programming language? Rrrrrrrrr!
One of the few things I prefer Python for is its neural network modules. As such, I am going to perform the stacking in R. Also, the final model is implemented in R (as part of the ccmap package) so I had to transfer over the trained neural networks from Python to R. Thankfully this is relatively straightforward. To do this for net1
, first extract and save the weights in Python:
Then load the parameters in R:
Our Hopeless Model’s prediction function is relatively straightforward as well:
And our final stacker is trained as follows:
Evaluation
One informative way to analyse our models is to look at how well they do as a function of the absolute effect size of the combination treatment (figure below). Both models make almost perfect predictions at high absolute effect sizes. In contrast, both models struggle to decide if a gene is up or down regulated at small absolute effect sizes. These trends make sense: the smaller the effect size, the more likely it was due to chance and will not be reproducibly positive or negative. Perhaps unsurprisingly, it’s only for intermediate effect sizes that the stacker model has the advantage. These cases are neither trivial to predict, such as those with large absolute effect sizes, nor virtually impossible to predict, such as those cases with effect sizes very close to zero.
Another informative way to evaluate our models is to look at the distribution of error rates across the 259 treatment combinations (figure below on left). Interestingly, this perspective reveals that a small number of treatment combinations were either almost perfectly predicted (very low error rate) or predicted seemingly at random (error rate near 50%).
In addition to considering the accuracy of classifying a gene as down- or up-regulated, it is also relevant to consider if genes were accurately ordered from the most down-regulated to the most up-regulated. The correlation between ranks, or spearman correlation, provides just this measure. Again, both models did quite well, with a slight advantage to the machine learning model (figure below on right).
Summary
In this post I trained a model to predict gene expression changes in response to a drug combination based on the measured gene expression changes after treatment with the individual drugs. As compared to a model that simply averages the effects of the two individual treatments, the final model is more accurate at both classifying genes as up- or down- regulated and at ordering genes from the most down-regulated to the most up-regulated. To build this model, I employed the machine learning techniques of data augmentation and stacking. I may have also told a really bad fairy tale and made an awesome programming joke about pirates.