From python API¶
Get your dataset from sdf files¶
Here we extract the input features from a serie of active/inactive compounds. The available input features are:
csv: Any external information with the name of the molecules as row index and the feature names as column indexes.
MACCS: Generate MACCS fingerprints
fp: Generate Daylight fingerprints
descriptors: Generate topological descriptors
import os
import numpy as np
import modtox.ML.preprocess as Pre
import modtox.ML.postprocess as Post
import modtox.ML.model2 as model
from sklearn.model_selection import train_test_split
from sklearn.datasets.samples_generator import make_blobs
import matplotlib.pyplot as plt
folder = "tests_2/data/"
sdf_active = os.path.join(folder, "actives.sdf")
sdf_inactive = os.path.join(folder, "inactives.sdf")
pre = Pre.ProcessorSDF(csv=csv, fp=False, descriptors=False, MACCS=True, columns=None)
print("Fit and tranform for preprocessor..")
X, y = pre.fit_transform(sdf_active=sdf_active, sdf_inactive=sdf_inactive)
Use your own dataset¶
On the contrary here we use any external X, y dataset you could have
X, y = make_blobs(n_samples=100, centers=2, n_features=2,
random_state=0)
Curate dataset¶
Drop samples with all Nans (sanitize) and remove the specified features (filter). To specify the columns to remove use the columns argument on the model as:
pre = Pre.ProcessorSDF(csv=csv, fp=False, descriptors=False, MACCS=True, columns=[“Feature_1”, “Feature_2”])
print("Sanitazing...")
pre.sanitize(X, y)
print("Filtering features...")
pre.filter_features(X)
Fit model¶
You can choose between single/stack model as you want to use a stack of 5 classifiers or only one.
Model = model.GenericModel(clf='stack', tpot=True)
Model = model.GenericModel(clf='single', tpot=True)
print("Fitting model...")
Model.fit(X_train,y_train)
Predict¶
y_pred = Model.predict(X_test)
Analysis¶
pp = Post.PostProcessor('stack', x_test=Model.X_test_trans, y_true_test=Model.Y_test,
y_pred_test=Model.prediction_test, y_proba_test=Model.predictions_proba_test,
x_train=Model.X_trans, y_true_train=Model.Y)
Metrics¶
Plot the ROC and PR curve together with the confusion matrix
ROC = pp.ROC()
PR = pp.PR()
DA = pp.conf_matrix()
Feature importance¶
Plot the features importance coming from the shap values or he XGBOOST gain function
SH = pp.shap_values(debug=True)
FI = pp.feature_importance()
Uncertanties¶
Analyse the uncertanties of the model on the test samples
DA = pp.domain_analysis()
UN = pp.calculate_uncertanties()