
Theory, Dataset, and Methods
Overview
Parameter optimization and Benchmarking
Probabilistic Matrix Factorization (PMF) is a broadly used machine learning method, which has been successfully adopted to predicting drugtarget interactions (Cobanoglu, et al., 2013). The known drugtarget interactions between N drugs and M targets (proteins) can be described using the interaction matrix R_{N×M} the elements of which are defined as:
(1)
In the PMF, drugs and targets are represented by Ddimensional vectors of latent variables (LVs), denoted as (1 £ i £ N) and (1 £ j £ M), respectively (Figure 1). The interaction matrix R_{N×M} can be modeled as the product of two matrices U^{T}_{N×D} and V_{D×M }for the N drugs and M targets respectively, and the corresponding the predicted interaction matrix R’ is
(2)
Training the PMF model is to find the best approximation for the known N×M drugtarget interaction matrix R under the giving loss function:
(3)
where is the regularization term to avoid overfitting.
Once we obtained the optimal U and V matrices, the predicted confidence score for the interaction between a drug i and a target j is calculated from the dot product of and , i.e. the product .
Figure 1. Qualitative illustration of the method for identifying drugtarget interactions. The known interactions between drugs and targets (indicated by black lines) are used to learn the latent vectors (LVs, shown adjacent to each node) that describe each drug and target. The dot product of the LVs for each pair of drug and target defines the predicted confidence score of the corresponding connection (red lines) (Cobanoglu, et al., 2013) .
QuartataWeb uses DrugBank v5.1.1 (Wishart, et al., 2018) for drugtarget interactions and STITCH (Szklarczyk, et al., 2016) database for proteinsmall molecule (or chemical) interactions, in addition to KEGG (Kanehisa, et al., 2017) Pathways and GO (Huntley, et al., 2015) for further analysis of cellular pathways and biological functions.
Table 1. The content of the data sources used in QuartataWeb
Data type and properties 
DrugBank 5.1 (Wishart, et al., 2018) 
STITCH 5 (Szklarczyk, et al., 2016) (human) 

Approved 
All 
Experimental 

# of drugs/chemicals (N) 
1,883 
5,494 
315,514 

# of targets (M) 
2,244 
2,807 
9,457 

Interaction space (NxM) 
4,225,452 
15,421,658 
2.98 billion 

known interactions (S) 
9,253 
14,983 
5,364,673 

Occupancy of R (S/[NxM]) 
0.0022 
0.0010 
0.0018 

Data type and properties 
KEGG (Kanehisa, et al., 2017) (human) 
GOA (Huntley, et al., 2015) (human) 

Molecular Function 
Biological Process 
Cellular component 

# of genes 
19,124 
2,807 

# of pathways/GO terms 
323 
2,346 
6,892 
897 

# of associations 
28,664 
11,475 
39,510 
15,160 

DrugBankapproved dataset. A total of 9,253 drugtarget interactions between 1,883 approved drugs and 2,244 human targets were extracted from DrugBank v5.1.1 released on 07032018. DrugBank is a frequently updated bioinformatics and cheminformatics resource that combines drug data with comprehensive target information. The sparsity of the approved drugtarget interaction matrix is 0.9978 with an interaction space of 4,225,452 (calculated by 1,883*2,244), or occupancy is 0.0022.
DrugBankall dataset. A total of 14,983 drugtarget interactions between 5,494 drugs and 2,807 human targets were extracted from DrugBank v5.1.1 as well. The DrugBankall dataset includes drugs from six different groups: approved, nutraceutical, illicit, withdrawn, investigational and experimental. The sparsity of this drugtarget interaction matrix is 0.9990 with an interaction space of 15,421,658.
STITCH experimental dataset. A total of 5.3 million chemicaltarget interactions between 315,514 chemicals and 9,457 human targets were extracted from STITCH v5 human experimental subset. STITCH is an extensive database that integrates chemicalprotein interactions from experiments, other databases, literature and predictions, resulting in interactions between 430,000 chemicals and 9,643,763 proteins across 2,031 genomes. We have used the subset of human proteinchemicals data supported by experimental evidence, ending up with a chemicalprotein interaction matrix with sparsity of 0.9982 and a huge interaction space of 2.98 billion.
KEGG. QuartataWeb uses the 28,664 genepathway associations between 19,124 human genes and 323 human pathways extracted from KEGG Pathway DB (Table 1). Proteins are mapped to genes following UniProt (The UniProt, 2017) annotations, and then mapped to pathways through genepathway associations. Based on the known and predicted CTIs, drugs/chemicals can then be connected to affected pathways.
GOA. A total of 2,807 target genes in our datasets were mapped to GO terms, comprising 11,475 molecular functions, 39,150 biological processes, and 15,160 cellular components (Table 1). The connection between targets and GO annotations is established through UniProt gene identifiers. Likewise, drugs/chemicals are mapped to GOA via the associated known and predicted targets, which enables users to assess the molecular functions, biological processes and cellular components potentially affected by chemicals.
Figure 2 shows the histogram of drug/chemical/target degree in each dataset. Most of the drugs or targets in DrugBank have degrees less than 40. In the DrugBankapproved set, the maximal number of targets for a given drug is 303, and the maximal number of drugs for a given target is 104. On average, there are 4.91 targets per drug and 4.12 drugs per target. In the DrugBankall dataset, the maximal number of targets per drug is 303, and the maximal number of drugs per target is 137, and the averages are 2.73 targets/drug and 5.34 drugs/target. The interactions in STITCH are much more extensive. Most chemicals or targets have degrees less than 250. Figure 2 displays the degree distribution of chemicals or targets with degrees < 250. The maximal number of targets per chemical is 4,044, and the maximal number of chemicals per target is 26,803. The averages are 16.45 targets/chemical and 541.82 chemicals/target. Yet, the median values are much smaller, 1 and 17, respectively.
Figure 2. Histogram of the degrees of drugs/chemicals and targets in each of the three examined datasets. Most of the drugs or targets in DrugBank have degrees below 40. The drugs or targets with degree over 40 are shown in the inset. Most of the chemicals or targets in STITCH have degrees less than 250, and the degree distribution of these chemicals or targets are shown in the corresponding plots. The dashed lines indicate the mean and median in each case.
Parameter optimization and Benchmarking
The latent factor models (LFMs) were trained on the DrugBankapproved, DrugBankall and STITCHexperimental datasets, separately. For training DrugBank datasets, we applied 10fold cross validation technique to select the optimal hyperparameters dimensionality (D) and optimization coefficient (λ), as well as the optimal cutoff values for confidence score. In the 10fold cross validation, the total known drugtarget interactions were evenly divided into 10 sets, and then we utilized 9 sets as training data and the remaining 1 set as testing data. We repeated it for ten times such that each subset could be tested. In the simulations, the 10fold cross validation were repeated 10 times independently and the average values were evaluated and reported.
DrugBank contains only binary inputs, either 0 (unknown interaction) or 1 (known interaction). We used a cutoff confidence score to decide whether the prediction was positive or negative, so that we could evaluate whether they were true positive (TP), false positive (FP), true negative (TN) or false negative (FN). Accordingly, any drugtarget pair with predicted confidence score above the cutoff value was considered as positives, otherwise as negative. The sensitivity (TP rate) and specificity (TN rate)
(4)
(5)
under different cutoff values were calculated to measure the performance. For DrugBankapproved dataset, the training dataset included 8,328 positive drugtarget pairs, the testing dataset included 925 positives, and 925 randomly sampled unknown drugtarget interactions were considered as negatives. For DrugBankall dataset, the training dataset included 13,485 positive drugtarget pairs, the testing set included 1,498 positives, and 1,498 randomly sampled unknown drugtarget interactions (negatives). In the parameter tuning section, ROC AUC was used to select the optimal hyperparameters the dimensionality D and the regularization parameter λ. D was scanned in the range 30100 as [30, 50, 70, 90, 110], the parameter λ was scanned from [1e6, 1e5, 0.0001, 0.001, 0.01, 0.1, 1], ROC AUC evaluation results are shown in Figure 3. For DrugBankapproved dataset, D = 50, λ=0.01 were selected as optimal parameters with ROC AUC of 0.836. The optimal cutoff value is 0.3, and the corresponding sensitivity was 0.582, the specificity was 0.992 and the precision (defined as the ratio of correctly predicted ones over all positives) was 0.987. For DrugBankall dataset, D = 50, λ = 0.0001 were selected as the optimal parameters with ROC AUC of 0.836, the optimal cutoff score is 0.2, with the sensitivity equal to 0.492, specificity equal to 0.991, and precision of 0.984.
Figure 3. ROC AUC of the 10fold cross validation for evaluating DrugBankapproved and DrugBankall. xaxis is the optimization coefficient (λ), yaxis is the dimensionality (D), and zaxis is the ROC AUC.
Table 2. The parameters used to train LFMs in QuartataWeb
Parameters 
DrugBank 5.1 
STITCH 5 (human) 

Approved 
All 
Experimental 

Latent vector dimensionality (D) 
50 
50 
100 
Regularization term (l) 
0.01 
0.0001 
0.5 
Table 3. Performance of QuartataWeb observed in crossvalidation tests
Dataset 
Crossvalidation 
Training set 
Test set 
Results 

Size^{(d)} 
Size^{(d)} 
Positive ^{(e)} 
Negative ^{(f)} 
AUC^{(g)} 
Sensitivity ^{(h)} 
Specificity ^{(i)} 
Precision/ RMSE^{(j)} 

DrugBankapproved ^{(a)} 
10fold 
8,328 
1,850 
925 
925 
0.836 ± 0.010 
0.582 
0.992 
0.987 (Precision) 

DrugBankall^{(b)} 
10fold 
13,485 
2,996 
1,498 
1,498 
0.836 ± 0.009 
0.492 
0.991 
0.984 (Precision) 

STITCHexperimental^{(c)} 
10fold 
4,828,203 
1,072,934 
536,467 
536,467 
0.870 ± 0.015 
N/A 
N/A 
0.0328 (RMSE) 

^{(a)} approved drugs and their targets in DrugBank v5; ^{(b)} all experimentally verified drugs and targets in DrugBank v5.1; ^{(c)}experimentally confirmed chemicals and proteins corresponding to human targets in STITCH v5; ^{(d)}Size: number of interactions in the training and validation data sets; ^{(e)}Positive: number of known interactions; ^{(f)}Negative: number of false interactions; ^{(g)}AUC: area under the receiver operating characteristics curve; ^{(h)}Sensitivity: TP / (TP + FN); ^{(i)}Specificity: TN / (FP + TN); ^{(j)}Precision=TP/(TP+FP); rootmeansquare error (RMSE) compared to the confidence scores reported in STITCH for known interactions. TP: true positive, FP: false positive, FN: false negative, TN: true negative. Sensitivity and Specificity results are not available for STITCH as the input training data are continuous confidence score instead of binary input. Instead, we use RMSE as metric. 

To demonstrate the ability of the method to recapitulate unknown drugtarget interactions, a benchmarking test was performed by hiding 70% of known interactions and examining the number of true positives in the top 1000 ranking predictions. As shown in Figure 4, there are 684 and 706 TPs out of top 1000 predictions in the DrugBankapproved and DrugBankall datasets, respectively, when hiding 70% of the known interactions. Similar performance was accomplished for a significantly smaller set with DrugBank v3.0 (Knox, et al., 2011) in an earlier study (Cobanoglu, et al., 2015).
Figure 4. Ability of the method to recapitulate hidden drug–target interactions. Simulations were repeated n = 44 times for each of the 1<m<1000 predictions, and the number of true positives is plotted for each run. The blue solid curves refer to the average performance obtained when hiding 70% in the DrugBank approved dataset and the DrugBank all dataset.
In addition, the top 1,000 predicted drugtarget pairs from the LFM generated for the full DrugBankapproved (1,883 drugs and 2,244 targets) and DrugBankall dataset (5,494 drugs and 2,807 targets) were examined using other CTI DBs STITCH, ChEMBL (Mendez, et al., 2019) and TTD (Li, et al., 2018), respectively. 376 of DrugBankapproved LFM ‘predicted’ pairs were actually listed among the experimentally confirmed pairs in STITCH, 255 pairs were reported in ChEMBL, and 14 pairs in TTD, ending up with 459 confirmed pairs (because of the overlaps between those confirmed in different databases). 341 of DrugBankall LFM ‘predicted’ pairs were listed in STITCH, 260 pairs in ChEMBL, and 16 pairs in TTD, ending up with 438 confirmed pairs. This demonstrated the predictive power of the current tool. The list of confirmed drugtarget pairs, along with their IDs in different databases can be found here (linked to the table). It remains to be seen if the remaining pairs are confirmed in the future.
The STITCH experimental dataset, on the other hand, is a much more extensive dataset containing chemicalprotein pairs with continuous interaction confidence scores ranging from 0 to 1. 10fold cross validation was applied to select the optimal hyperparameters dimensionality (D) and optimization coefficient (λ), and the rootmeansquare error (RMSE) between the confidence scores of the training data and testing data and those reported in STITCH database
(6)
were used as metrics to assess the performance of the method, where S’ refers to the number of training data or testing data. The size of training data was 4,291,739 and the size of testing data was 1,072,934. Parameter D was scanned from: {20, 40, 60, 80, 100, 120}, parameter λ was scanned from: {0.001, 0.01, 0.05, 0.1, 0.5, 1}. D = 100, λ = 0.5 were selected as optimal parameters. The corresponding training RMSE was 0.0093, and testing RMSE was 0.0328.
Figure 5. RMSE of the 10fold cross validation for evaluating STITCH experimental dataset. xaxis is the dimensionality (D), yaxis is the optimization coefficient (λ) and zaxis is the rootmeansquare error (RMSE).
We calculated the confidence score for each unknown drugtarget pair by the dot product of the LVs of the corresponding drug and target, for each dataset. Figure 6 shows the histograms of the known and unknown interactions (left), as well as the predicted confidence score (right) for three datasets.
Figure 6. Histogram of interaction confidence scores for each dataset. (left) Histogram of scores (or interaction probabilities) assigned to known and unknown interactions in the three datasets. Note that the values in DrugBank are binary. (right) Histograms of the confidence scores predicted by QuartataWeb for the unknown drug/chemicaltarget pairs (right). Scores range from 0 to 1, where close to 0 means low probability of interaction, close to 1 means high probability of interaction. Results are obtained with the optimal parameters selected above.
Based on drug/chemicaltarget interaction patterns, drugdrug (or chemicalchemical) similarities can be calculated by the correlation cosine between the LVs corresponding to the two drugs or two chemicals. 2D fingerprint similarities of the drugs or chemicals were also calculated with RDKit software (http://rdkit.org) and stored as references to enable comparison based on interactions patterns (predicted by Quartata) and based on structural patterns (based on RDKit). With the same idea, we calculated targettarget similarities.
CTIpatternbased similarity. Chemicalchemical or targettarget similarities based on CTI patterns are calculated by evaluating the correlation cosine between two chemical LVs (u_{i} and u_{j}) or two target LVs (v_{i} and v_{j}) as
(7)
(8)
The CTIpatternbased similarities range from 1 to 1, where 1 represents the highest similarity. Histograms of the chemicalchemical or targettarget similarities based on CTI patterns for each dataset can be found in Figure 7.
Structurebased similarity. The Tanimoto coefficient T_{ab} between the 2D fingerprints of two chemicals is calculated to provide a metric of structurebased similarity (Bajusz, et al., 2015) as
(9)
where a and b represent 2D fingerprint binary vectors, the Tanimoto coefficient ranges from 0 to 1, and 1 is the highest similarity.
Figure 7. Histogram of drugdrug/chemicalchemcial and targettarget similarities evaluated for three datasets.
Genepathway associations from the human subset in KEGG PATHWAY (Release 84.1, December 1, 2017) and the protein annotations from Uniprot Knowledgebase (KB) were stored on the QuartataWeb server. A total number of 28,664 genepathway associations included in 323 human pathways were downloaded from KEGG PATHWAY, the proteins listed in DrugBank v5.0 were mapped onto these pathways. Likewise, drugs are mapped to pathways via the associated targets. We define the known and predicted pathways for a drug as follows: the known pathways for a drug are those mapped based on the known drugtarget interactions, and the predicted pathways for a drug are those based on newly predicted drugtarget interactions.
Enrichment analysis identifies subpopulations that are statistically overrepresented in an ensemble. The hypergeometric pvalue has been widely used to investigate whether genes from predefined functional terms, e.g., pathway, are enriched in the differentially expressed genes. For example, let q be the total number of drugs that are mapped to a certain pathway A; let N and T be the total number of drugs in our background database and the number of drugs we evaluated. The number of drugs that are mapped onto pathway A in our evaluation is denoted by n. The null hypothesis of the hypergeometric test is that pathway A is irrelevant to our evaluated drugs. Given (N, T, q), we can model distribution of n under the null hypothesis, and the pvalue represents the probability of randomly observing n or more drugs mapped to pathway A from N total evaluated drugs.
(10)
where is the binomial coefficient.
For multiple testing, BenjaminiHochberg method (Benjamini and Yekutieli, 2001) is applied to correct false discovery rate (FDR). Giving T as the total number of the evaluated As, the hypergeometric pvalues are sorted from smallest to largest, and the adjusted pvalue of the ith item (p_{i}^{*}) is calculated as
(11)
The adjustment limits the FDR to a selected cutoff level alpha (e.g. adjusted pvalue < 0.05), which indicates that the fraction of false significant As from all the significant As identified is expected to be less than alpha.
The enrichment score used in the provided enrichment output bar plots is defined as
(12)
where higher enrichment score represents higher significance.
Bajusz, D., Racz, A. and Heberger, K. (2015) Why is Tanimoto index an appropriate choice for fingerprintbased similarity calculations? J Cheminform, 7:20.
Benjamini, Y. and Yekutieli, D. (2001) The control of the false discovery rate in multiple testing under dependency. Ann. Stat, 29, 11651188.
Cobanoglu, M.C., et al. (2013) Predicting drugtarget interactions using probabilistic matrix factorization. J Chem Inf Model, 53(12), 33993409.
Cobanoglu, M.C., et al. (2015) BalestraWeb: efficient online evaluation of drugtarget interactions. Bioinformatics, 31(1), 131133.
Huntley, R.P., et al. (2015) The GOA database: gene Ontology annotation updates for 2015. Nucleic Acids Res, 43, D1057D1063.
Kanehisa, M., et al. (2017) KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res, 45(D1), D353D361.
Knox, C., et al. (2011) DrugBank 3.0: a comprehensive resource for 'omics' research on drugs. Nucleic Acids Res, 39:D1035D1041.
Li, Y.H., et al. (2018) Therapeutic target database update 2018: enriched resource for facilitating benchtoclinic research of targeted therapeutics. Nucleic Acids Res, 46(D1), D1121D1127.
Mendez, D., et al. (2019) ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res, 47(D1), D930D940.
Szklarczyk, D., et al. (2016) STITCH 5: augmenting proteinchemical interaction networks with tissue and affinity data. Nucleic Acids Res, 44(D1), D380D384.
The UniProt, C. (2017) UniProt: the universal protein knowledgebase. Nucleic Acids Res, 45(D1), D158D169.
Wishart, D.S., et al. (2018) DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res, 46(D1), D1074D1082.