Skip to content

DeepMol: a python-based machine and deep learning framework for drug discovery

License

BSD-2-Clause, Unknown licenses found

Licenses found

BSD-2-Clause
LICENSE
Unknown
Licence
Notifications You must be signed in to change notification settings

Tersonous/DeepMol--Tersonou-Fork

 
 

Repository files navigation

DeepMol

Description

DeepMol is a Python-based machine and deep learning framework for drug discovery. It offers a variety of functionalities that enable a smoother approach to many drug discovery and chemoinformatics problems. It uses Tensorflow, Keras, Scikit-learn and DeepChem to build custom ML and DL models or make use of pre-built ones. It uses the RDKit framework to perform operations on molecular data.

Table of contents:

Requirements

Installation

Pip

Install DeepMol via pip:

If you intend to install all the deepmol modules' dependencies:

pip install deepmol[all]

Extra modules:

pip install deepmol[preprocessing]
pip install deepmol[machine_learning]
pip install deepmol[deep_learning]

Also, you should install mol2vec and its dependencies:

pip install git+https://github.com/samoturk/mol2vec#egg=mol2vec

Manually

Alternatively, clone the repository and install the dependencies manually:

  1. Clone the repository:
git clone https://github.com/BioSystemsUM/DeepMol.git
  1. Install dependencies:
python setup.py install

Getting Started

DeepMol is built in a modular way allowing the use of its methods for multiple tasks. It offers a complete workflow to perform ML and DL tasks using molecules represented as SMILES. It has modules that perform standard tasks such as the loading and standardization of the data, computing molecular features like molecular fingerprints, performing feature selection and data splitting. It also provides methods to deal with unbalanced datasets, do unsupervised exploration of the data and compute feature importance as shap values.

The DeepMol framework is still under development, and it is currently at a pre-release version. New models and features will be added in the future.

Load a dataset from a CSV

For now, it is only possible to load data directly from CSV files. Modules to load data from different file types and sources will be implemented in the future. These include JSON, SDF and FASTA files and directly from our databases.

To load data from a CSV it's only required to provide the math and molecules field name. Optionally, it is also possible to provide a field with some ids, the labels fields, features fields, features to keep (useful for instance to select only the features kept after feature selection) and the number of samples to load (by default loads the entire dataset).

from deepmol.loaders.loaders import CSVLoader

# load a dataset from a CSV (required fields: dataset_path and smiles_field)
loader = CSVLoader(dataset_path='../../data/train_dataset.csv',
                   smiles_field='mols',
                   id_field='ids',
                   labels_fields=['y'],
                   features_fields=['feat_1', 'feat_2', 'feat_3', 'feat_4'],
                   shard_size=1000,
                   mode='auto')

dataset = loader.create_dataset()

# print shape of the dataset (molecules, X, y)
dataset.get_shape()

((1000,), None, (1000,))

Compound Standardization

It is possible to standardize the loaded molecules using three option. Using a basic standardizer that only does sanitization (Kekulize, check valencies, set aromaticity, conjugation and hybridization). A more complex standardizer can be customized by choosing or not to perform specific tasks such as sanitization, remove isotope information, neutralize charges, remove stereochemistry and remove smaller fragments. Another possibility is to use the ChEMBL Standardizer.

# Option 1: Basic Standardizer
standardizer = BasicStandardizer().standardize(dataset)

# Option 2: Custom Standardizer
heavy_standardisation = {
    'REMOVE_ISOTOPE': True,
    'NEUTRALISE_CHARGE': True,
    'REMOVE_STEREO': True,
    'KEEP_BIGGEST': True,
    'ADD_HYDROGEN': True,
    'KEKULIZE': False,
    'NEUTRALISE_CHARGE_LATE': True}
standardizer2 = CustomStandardizer(heavy_standardisation).standardize(dataset)

# Option 3: ChEMBL Standardizer
standardizer3 = ChEMBLStandardizer().standardize(dataset)

Compound Featurization

It is possible to compute multiple types of molecular fingerprints like Morgan Fingerprints, MACCS Keys, Layered Fingerprints, RDK Fingerprints and AtomPair Fingerprints. Featurizers from DeepChem and molecular embeddings like the Mol2Vec can also be computed. More complex molecular embeddings like the Seq2Seq and transformer-based are in development and will be added soon.

from deepmol.compound_featurization import MorganFingerprint

# Compute morgan fingerprints for molecules in the previous loaded dataset
MorganFingerprint(radius=2, size=1024).featurize(dataset)
# view the computed features (dataset.X)
dataset.X
#print shape of the dataset to see difference in the X shape
dataset.get_shape()

((1000,), (1000, 1024), (1000,))

Feature Selection

Regarding feature selection it is possible to do Low Variance Feature Selection, KBest, Percentile, Recursive Feature Elimination and selecting features based on importance weights.

from deepmol.feature_selection import LowVarianceFS

# Feature Selection to remove features with low variance across molecules
LowVarianceFS(0.15).select_features(dataset)

# print shape of the dataset to see difference in the X shape (fewer features)
dataset.get_shape()

((1000,), (1000, 35), (1000,))

Unsupervised Exploration

It is possible to do unsupervised exploration of the datasets using PCA, tSNE, KMeans and UMAP.

from deepmol.unsupervised import UMAP

ump = UMAP()
umap_df = ump.run_unsupervised(dataset)
ump.plot(umap_df.X, path='umap_output.png')

umap_output

Data Split

Data can be split randomly or using stratified splitters. K-fold split, train-test split and train-validation-test split can be used.

from deepmol.splitters.splitters import SingletaskStratifiedSplitter

# Data Split
splitter = SingletaskStratifiedSplitter()
train_dataset, valid_dataset, test_dataset = splitter.train_valid_test_split(dataset=dataset, frac_train=0.7,
                                                                             frac_valid=0.15, frac_test=0.15)
train_dataset.get_shape()

((1628,), (1628, 1024), (1628,))

valid_dataset.get_shape()

((348,), (348, 1024), (348,))

test_dataset.get_shape()

((350,), (350, 1024), (350,))

Build, train and evaluate a model

It is possible use pre-built models from Scikit-Learn and DeepChem or build new ones using keras layers. Wrappers for Scikit-Learn, Keras and DeepChem were implemented allowing evaluation of the models under a common workspace.

Scikit-Learn model example

Models can be imported from scikit-learn and wrapped using the SKlearnModel module.

Check this jupyter notebook for a complete example!

from sklearn.ensemble import RandomForestClassifier
from deepmol.models.sklearn_models import SklearnModel

# Scikit-Learn Random Forest
rf = RandomForestClassifier()
# wrapper around scikit learn models
model = SklearnModel(model=rf)
# model training
model.fit(train_dataset)

from deepmol.metrics.metrics import Metric
from deepmol.metrics.metrics_functions import roc_auc_score

# cross validate model on the full dataset
model.cross_validate(dataset, Metric(roc_auc_score), folds=3)

cross_validation_output

from sklearn.metrics import precision_score, accuracy_score, confusion_matrix, classification_report

#evaluate the model using different metrics
metrics = [Metric(roc_auc_score), Metric(precision_score), Metric(accuracy_score), Metric(confusion_matrix), 
           Metric(classification_report)]

# evaluate the model on training data
print('Training Dataset: ')
train_score = model.evaluate(train_dataset, metrics)

# evaluate the model on training data
print('Validation Dataset: ')
valid_score = model.evaluate(valid_dataset, metrics)

# evaluate the model on training data
print('Test Dataset: ')
test_score = model.evaluate(test_dataset, metrics)

evaluate_output

Keras model example

Example of how to build and wrap a keras model using the KerasModel module.

Check this jupyter notebook for a complete example!

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from deepmol.metrics.metrics import Metric

input_dim = train_dataset.X.shape[1]


def create_model(optimizer='adam', dropout=0.5, input_dim=input_dim):
  # create model
  model = Sequential()
  model.add(Dense(12, input_dim=input_dim, activation='relu'))
  model.add(Dropout(dropout))
  model.add(Dense(8, activation='relu'))
  model.add(Dense(1, activation='sigmoid'))
  # Compile model
  model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
  return model


from deepmol.models.keras_models import KerasModel

model = KerasModel(create_model, epochs=5, verbose=1, optimizer='adam')

# train model
model.fit(train_dataset)

# make prediction on the test dataset with the model
model.predict(test_dataset)

# evaluate model using multiple metrics
metrics = [Metric(roc_auc_score),
           Metric(precision_score),
           Metric(accuracy_score),
           Metric(confusion_matrix),
           Metric(classification_report)]

print('Training set score:', model.evaluate(train_dataset, metrics))
print('Test set score:', model.evaluate(test_dataset, metrics))

DeepChem model example

Using DeepChem models:

Check this jupyter notebook for a complete example!

from deepmol.compound_featurization import WeaveFeat
from deepchem.models import MPNNModel
from deepmol.models.deepchem_models import DeepChemModel
from deepmol.metrics.metrics import Metric
from deepmol.splitters.splitters import SingletaskStratifiedSplitter

ds = WeaveFeat().featurize(dataset)
splitter = SingletaskStratifiedSplitter()
train_dataset, valid_dataset, test_dataset = splitter.train_valid_test_split(dataset=ds, frac_train=0.6, frac_valid=0.2,
                                                                             frac_test=0.2)
mpnn = MPNNModel(n_tasks=1, n_pair_feat=14, n_atom_feat=75, n_hidden=75, T=1, M=1, mode='classification')
model_mpnn = DeepChemModel(mpnn)
# Model training
model_mpnn.fit(train_dataset)
valid_preds = model_mpnn.predict(valid_dataset)
test_preds = model_mpnn.predict(test_dataset)
# Evaluation
metrics = [Metric(roc_auc_score), Metric(precision_score), Metric(accuracy_score)]
print('Training Dataset: ')
train_score = model_mpnn.evaluate(train_dataset, metrics)
print('Valid Dataset: ')
valid_score = model_mpnn.evaluate(valid_dataset, metrics)
print('Test Dataset: ')
test_score = model_mpnn.evaluate(test_dataset, metrics)    

Hyperparameter Optimization

Grid and randomized hyperparameter optimization is provided using cross-validation or a held-out validation set.

from deepmol.parameter_optimization.hyperparameter_optimization import HyperparameterOptimizerValidation,

HyperparameterOptimizerCV

# Hyperparameter Optimization (using the above created keras model)
optimizer = HyperparameterOptimizerValidation(create_model)

params_dict = {'optimizer': ['adam', 'rmsprop'],
               'dropout': [0.2, 0.4, 0.5]}

best_model, best_hyperparams, all_results = optimizer.hyperparameter_search(params_dict, train_dataset,
                                                                            valid_dataset, Metric(roc_auc_score))

print(best_hyperparams)
print(best_model)

# Evaluate model
best_model.evaluate(test_dataset, metrics)

Feature Importance (Shap Values)

Explain the output of a machine learning model can be done using SHAP (SHapley Additive exPlanations) package. The features that most influenced (positively or negatively) a certain prediction can be calculated and visualized in different ways:

from deepmol.feature_importance import ShapValues

shap_calc = ShapValues(test_dataset, model)
shap_calc.computePermutationShap()

calc_shap_output

shap_calc.plotSampleExplanation(index=1, plot_type='waterfall')

sample_explanation_output

shap_calc.plotFeatureExplanation(index=115)

feature_explanation_output

Draw relevant features

It is possible to plot the ON bits (or some of them) in a molecule for MACCS Keys, Morgan and RDK Fingeprints. IT is also possible to draw those bits on the respective molecule. This can be allied with the Shap Values calculation to highlight the zone of the molecule that most contributed to a certain prediction, for instance, the substructure in the molecule that most contributed to its classification as an active or inactive molecule against a receptor.

from deepmol.utils.utils import draw_MACCS_Pattern

patt_number = 54
mol_number = 1

prediction = model.predict(test_dataset)[mol_number]
actual_value = test_dataset.y[mol_number]
print('Prediction: ', prediction)
print('Actual Value: ', actual_value)
smi = test_dataset.mols[mol_number]

draw_MACCS_Pattern(smi, patt_number)

draw_maccs_output

Unbalanced Datasets

Multiple methods to deal with unbalanced datasets can be used to do oversampling, under-sampling or a mixture of both (Random, SMOTE, SMOTEENN, SMOTETomek and ClusterCentroids).

from deepmol.imbalanced_learn.imbalanced_learn import SMOTEENN

train_dataset = SMOTEENN().sample(train_dataset)

About Us

DeepMol is managed by a team of contributors from the BioSystems group at the Centre of Biological Engineering, University of Minho.

This research was financed by Portuguese Funds through FCT – Fundação para a Ciência e a Tecnologia.

Citing DeepMol

Manuscript under preparation.

Publications using DeepMol

Baptista D., Correia J., Pereira B., Rocha M. (2022) "A Comparison of Different Compound Representations for Drug Sensitivity Prediction". In: Rocha M., Fdez-Riverola F., Mohamad M.S., Casado-Vara R. (eds) Practical Applications of Computational Biology & Bioinformatics, 15th International Conference (PACBB 2021). PACBB 2021. Lecture Notes in Networks and Systems, vol 325. Springer, Cham. https://doi.org/10.1007/978-3-030-86258-9_15

Baptista, Delora, Correia, João, Pereira, Bruno and Rocha, Miguel. "Evaluating molecular representations in machine learning models for drug response prediction and interpretability" Journal of Integrative Bioinformatics, vol. 19, no. 3, 2022, pp. 20220006. https://doi.org/10.1515/jib-2022-0006

J. Capela, J. Correia, V. Pereira and M. Rocha, "Development of Deep Learning approaches to predict relationships between chemical structures and sweetness," 2022 International Joint Conference on Neural Networks (IJCNN), 2022, pp. 1-8, doi: 10.1109/IJCNN55064.2022.9891992. https://ieeexplore.ieee.org/abstract/document/9891992

Licensing

DeepMol is under BSD-2-Clause License.

About

DeepMol: a python-based machine and deep learning framework for drug discovery

Resources

License

BSD-2-Clause, Unknown licenses found

Licenses found

BSD-2-Clause
LICENSE
Unknown
Licence

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.4%
  • Other 0.6%