From python API ================= Get your dataset from sdf files --------------------------------------- Here we extract the input features from a serie of active/inactive compounds. The available input features are: - csv: Any external information with the name of the molecules as row index and the feature names as column indexes. - MACCS: Generate MACCS fingerprints - fp: Generate Daylight fingerprints - descriptors: Generate topological descriptors .. code-block:: python import os import numpy as np import modtox.ML.preprocess as Pre import modtox.ML.postprocess as Post import modtox.ML.model2 as model from sklearn.model_selection import train_test_split from sklearn.datasets.samples_generator import make_blobs import matplotlib.pyplot as plt folder = "tests_2/data/" sdf_active = os.path.join(folder, "actives.sdf") sdf_inactive = os.path.join(folder, "inactives.sdf") pre = Pre.ProcessorSDF(csv=csv, fp=False, descriptors=False, MACCS=True, columns=None) print("Fit and tranform for preprocessor..") X, y = pre.fit_transform(sdf_active=sdf_active, sdf_inactive=sdf_inactive) Use your own dataset ---------------------- On the contrary here we use any external X, y dataset you could have .. code-block:: python X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=0) Curate dataset ----------------- Drop samples with all Nans (sanitize) and remove the specified features (filter). To specify the columns to remove use the columns argument on the model as: pre = Pre.ProcessorSDF(csv=csv, fp=False, descriptors=False, MACCS=True, **columns=["Feature_1", "Feature_2"]**) .. code-block:: python print("Sanitazing...") pre.sanitize(X, y) print("Filtering features...") pre.filter_features(X) Fit model -------------------------------------- You can choose between single/stack model as you want to use a stack of 5 classifiers or only one. .. code-block:: python Model = model.GenericModel(clf='stack', tpot=True) Model = model.GenericModel(clf='single', tpot=True) print("Fitting model...") Model.fit(X_train,y_train) Predict ---------- .. code-block:: python y_pred = Model.predict(X_test) Analysis ----------- .. code-block:: python pp = Post.PostProcessor('stack', x_test=Model.X_test_trans, y_true_test=Model.Y_test, y_pred_test=Model.prediction_test, y_proba_test=Model.predictions_proba_test, x_train=Model.X_trans, y_true_train=Model.Y) Metrics **************** Plot the ROC and PR curve together with the confusion matrix .. code-block:: python ROC = pp.ROC() PR = pp.PR() DA = pp.conf_matrix() Feature importance *********************** Plot the features importance coming from the shap values or he XGBOOST gain function .. code-block:: python SH = pp.shap_values(debug=True) FI = pp.feature_importance() Uncertanties *************** Analyse the uncertanties of the model on the test samples .. code-block:: python DA = pp.domain_analysis() UN = pp.calculate_uncertanties() Visualize ************ Visualize the dataset and the wrong samples .. code-block:: python pp.UMAP_plot() pp.PCA_plot() pp.tsne_plot()