From c3ba93c5c2edcdd80985c5e219398f71777aa08a Mon Sep 17 00:00:00 2001 From: Chanin Nantasenamat <51851491+dataprofessor@users.noreply.github.com> Date: Sun, 9 Aug 2020 22:43:08 +0700 Subject: [PATCH] Add files via upload --- ...le_linear_regression_model_in_python.ipynb | 1349 +++++++++++++++++ 1 file changed, 1349 insertions(+) create mode 100644 python/How_to_build_a_simple_linear_regression_model_in_python.ipynb diff --git a/python/How_to_build_a_simple_linear_regression_model_in_python.ipynb b/python/How_to_build_a_simple_linear_regression_model_in_python.ipynb new file mode 100644 index 0000000..4049574 --- /dev/null +++ b/python/How_to_build_a_simple_linear_regression_model_in_python.ipynb @@ -0,0 +1,1349 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "How_to_build_a_simple_linear_regression_model_in_python.ipynb", + "provenance": [], + "collapsed_sections": [] + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "OQi3X7TNUl5Y", + "colab_type": "text" + }, + "source": [ + "# **How to Build a Simple Linear Regression Model in Python** \n", + "\n", + "Chanin Nantasenamat\n", + "\n", + "[Data Professor YouTube channel](http://youtube.com/dataprofessor), http://youtube.com/dataprofessor \n", + "\n", + "In this Jupyter notebook, we will building a simple linear regression model using the **Delaney Molecular Solubility** dataset." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "H661uGwCNFMC", + "colab_type": "text" + }, + "source": [ + "## **1. Retrieving the Dataset**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZkglmVcwpoXG", + "colab_type": "text" + }, + "source": [ + "### **1.1. Original dataset**\n", + "\n", + "The original [Delaney's dataset](https://pubs.acs.org/doi/10.1021/ci034243x) available as a [Supplementary file](https://pubs.acs.org/doi/10.1021/ci034243x)$^4$. The full paper is entitled [ESOL:  Estimating Aqueous Solubility Directly from Molecular Structure](https://pubs.acs.org/doi/10.1021/ci034243x).$^1$" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "mupO58ZfpiqE", + "colab_type": "code", + "colab": {} + }, + "source": [ + "import pandas as pd" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "5NSgd-6Mol_T", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 419 + }, + "outputId": "26d12350-2f85-4955-fdcf-c7ca8366f2b4" + }, + "source": [ + "delaney_url = 'https://raw.githubusercontent.com/dataprofessor/data/master/delaney.csv'\n", + "delaney_df = pd.read_csv(delaney_url)\n", + "delaney_df" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Compound IDmeasured log(solubility:mol/L)ESOL predicted log(solubility:mol/L)SMILES
01,1,1,2-Tetrachloroethane-2.180-2.794ClCC(Cl)(Cl)Cl
11,1,1-Trichloroethane-2.000-2.232CC(Cl)(Cl)Cl
21,1,2,2-Tetrachloroethane-1.740-2.549ClC(Cl)C(Cl)Cl
31,1,2-Trichloroethane-1.480-1.961ClCC(Cl)Cl
41,1,2-Trichlorotrifluoroethane-3.040-3.077FC(F)(Cl)C(F)(Cl)Cl
...............
1139vamidothion1.144-1.446CNC(=O)C(C)SCCSP(=O)(OC)(OC)
1140Vinclozolin-4.925-4.377CC1(OC(=O)N(C1=O)c2cc(Cl)cc(Cl)c2)C=C
1141Warfarin-3.893-3.913CC(=O)CC(c1ccccc1)c3c(O)c2ccccc2oc3=O
1142Xipamide-3.790-3.642Cc1cccc(C)c1NC(=O)c2cc(c(Cl)cc2O)S(N)(=O)=O
1143XMC-2.581-2.688CNC(=O)Oc1cc(C)cc(C)c1
\n", + "

1144 rows × 4 columns

\n", + "
" + ], + "text/plain": [ + " Compound ID ... SMILES\n", + "0 1,1,1,2-Tetrachloroethane ... ClCC(Cl)(Cl)Cl\n", + "1 1,1,1-Trichloroethane ... CC(Cl)(Cl)Cl\n", + "2 1,1,2,2-Tetrachloroethane ... ClC(Cl)C(Cl)Cl\n", + "3 1,1,2-Trichloroethane ... ClCC(Cl)Cl\n", + "4 1,1,2-Trichlorotrifluoroethane ... FC(F)(Cl)C(F)(Cl)Cl\n", + "... ... ... ...\n", + "1139 vamidothion ... CNC(=O)C(C)SCCSP(=O)(OC)(OC)\n", + "1140 Vinclozolin ... CC1(OC(=O)N(C1=O)c2cc(Cl)cc(Cl)c2)C=C\n", + "1141 Warfarin ... CC(=O)CC(c1ccccc1)c3c(O)c2ccccc2oc3=O \n", + "1142 Xipamide ... Cc1cccc(C)c1NC(=O)c2cc(c(Cl)cc2O)S(N)(=O)=O\n", + "1143 XMC ... CNC(=O)Oc1cc(C)cc(C)c1\n", + "\n", + "[1144 rows x 4 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 14 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "biqQ78_hqdb6", + "colab_type": "text" + }, + "source": [ + "### **1.2. Delaney dataset with computed molecular descriptors**\n", + "\n", + "As demonstrated in a previous YouTube video [Data Science for Computational Drug Discovery using Python](https://www.youtube.com/watch?v=VXFFHHoE1wk) on the Data Professor YouTube channel, SMILES notation from the Delaney dataset was used as *input* for molecular descriptor calculation using the **rdkit** Python library. This produced the 4 molecular descriptors as used by the authors in their published research article." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "olyPX1TjQMvr", + "colab_type": "text" + }, + "source": [ + "#### **Definition of variables**\n", + "\n", + "The **Y** variable (response variable) is **LogS** (log of the aqueous solubility).\n", + "\n", + "The **X** variables are comprised of 4 molecular descriptors:\n", + "1. **cLogP** *(Octanol-water partition coefficient)*\n", + "2. **MW** *(Molecular weight)*\n", + "3. **RB** *(Number of rotatable bonds)*\n", + "4. **AP** *(Aromatic proportion = number of aromatic atoms / total number of heavy atoms)*" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "q9kna0SWkamZ", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 419 + }, + "outputId": "3d948275-2329-4cc0-b465-0198efafb183" + }, + "source": [ + "delaney_descriptors_url = 'https://raw.githubusercontent.com/dataprofessor/data/master/delaney_solubility_with_descriptors.csv'\n", + "delaney_descriptors_df = pd.read_csv(delaney_descriptors_url)\n", + "delaney_descriptors_df" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
MolLogPMolWtNumRotatableBondsAromaticProportionlogS
02.59540167.8500.00.000000-2.180
12.37650133.4050.00.000000-2.000
22.59380167.8501.00.000000-1.740
32.02890133.4051.00.000000-1.480
42.91890187.3751.00.000000-3.040
..................
11391.98820287.3438.00.0000001.144
11403.42130286.1142.00.333333-4.925
11413.60960308.3334.00.695652-3.893
11422.56214354.8153.00.521739-3.790
11432.02164179.2191.00.461538-2.581
\n", + "

1144 rows × 5 columns

\n", + "
" + ], + "text/plain": [ + " MolLogP MolWt NumRotatableBonds AromaticProportion logS\n", + "0 2.59540 167.850 0.0 0.000000 -2.180\n", + "1 2.37650 133.405 0.0 0.000000 -2.000\n", + "2 2.59380 167.850 1.0 0.000000 -1.740\n", + "3 2.02890 133.405 1.0 0.000000 -1.480\n", + "4 2.91890 187.375 1.0 0.000000 -3.040\n", + "... ... ... ... ... ...\n", + "1139 1.98820 287.343 8.0 0.000000 1.144\n", + "1140 3.42130 286.114 2.0 0.333333 -4.925\n", + "1141 3.60960 308.333 4.0 0.695652 -3.893\n", + "1142 2.56214 354.815 3.0 0.521739 -3.790\n", + "1143 2.02164 179.219 1.0 0.461538 -2.581\n", + "\n", + "[1144 rows x 5 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 15 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qQYE-jCRSmCn", + "colab_type": "text" + }, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WRXEmm941h13", + "colab_type": "text" + }, + "source": [ + "## **2. Create X and Y variables**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JeSbdwYA1l3K", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 419 + }, + "outputId": "5e139655-ff97-4f95-947a-c6ecbc9c8861" + }, + "source": [ + "X = delaney_descriptors_df.drop('logS', axis=1)\n", + "X" + ], + "execution_count": 19, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
MolLogPMolWtNumRotatableBondsAromaticProportion
02.59540167.8500.00.000000
12.37650133.4050.00.000000
22.59380167.8501.00.000000
32.02890133.4051.00.000000
42.91890187.3751.00.000000
...............
11391.98820287.3438.00.000000
11403.42130286.1142.00.333333
11413.60960308.3334.00.695652
11422.56214354.8153.00.521739
11432.02164179.2191.00.461538
\n", + "

1144 rows × 4 columns

\n", + "
" + ], + "text/plain": [ + " MolLogP MolWt NumRotatableBonds AromaticProportion\n", + "0 2.59540 167.850 0.0 0.000000\n", + "1 2.37650 133.405 0.0 0.000000\n", + "2 2.59380 167.850 1.0 0.000000\n", + "3 2.02890 133.405 1.0 0.000000\n", + "4 2.91890 187.375 1.0 0.000000\n", + "... ... ... ... ...\n", + "1139 1.98820 287.343 8.0 0.000000\n", + "1140 3.42130 286.114 2.0 0.333333\n", + "1141 3.60960 308.333 4.0 0.695652\n", + "1142 2.56214 354.815 3.0 0.521739\n", + "1143 2.02164 179.219 1.0 0.461538\n", + "\n", + "[1144 rows x 4 columns]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 19 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VGwuoNKs2ReN", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 221 + }, + "outputId": "27773f04-d07f-4d6f-80b8-323944ddac7a" + }, + "source": [ + "Y = delaney_descriptors_df.logS\n", + "Y" + ], + "execution_count": 21, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 -2.180\n", + "1 -2.000\n", + "2 -1.740\n", + "3 -1.480\n", + "4 -3.040\n", + " ... \n", + "1139 1.144\n", + "1140 -4.925\n", + "1141 -3.893\n", + "1142 -3.790\n", + "1143 -2.581\n", + "Name: logS, Length: 1144, dtype: float64" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 21 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SzrfuUZNFg_X", + "colab_type": "text" + }, + "source": [ + "## **3. Data split**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "dMRn8EVjFlrT", + "colab_type": "code", + "colab": {} + }, + "source": [ + "from sklearn.model_selection import train_test_split" + ], + "execution_count": 22, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "aOIAljc1FmXb", + "colab_type": "code", + "colab": {} + }, + "source": [ + "X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)" + ], + "execution_count": 23, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "39nTAc3UFUMW", + "colab_type": "text" + }, + "source": [ + "## **4. Linear Regression Model**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "K0MokzGBCimk", + "colab_type": "code", + "colab": {} + }, + "source": [ + "from sklearn import linear_model\n", + "from sklearn.metrics import mean_squared_error, r2_score" + ], + "execution_count": 24, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "vkR1siPuFZ6X", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + }, + "outputId": "e20472e7-0179-419f-8dc8-8710cfd008cc" + }, + "source": [ + "model = linear_model.LinearRegression()\n", + "model.fit(X_train, Y_train)" + ], + "execution_count": 25, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 25 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aG4DMzc5Rks9", + "colab_type": "text" + }, + "source": [ + "### **Predicts the X_train**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tZr9CBGvRp1F", + "colab_type": "code", + "colab": {} + }, + "source": [ + "Y_pred_train = model.predict(X_train)" + ], + "execution_count": 26, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "0x3saPCyRtJP", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 85 + }, + "outputId": "68235d8c-e66a-4acb-e4d3-83ef26e92629" + }, + "source": [ + "print('Coefficients:', model.coef_)\n", + "print('Intercept:', model.intercept_)\n", + "print('Mean squared error (MSE): %.2f'\n", + " % mean_squared_error(Y_train, Y_pred_train))\n", + "print('Coefficient of determination (R^2): %.2f'\n", + " % r2_score(Y_train, Y_pred_train))" + ], + "execution_count": 27, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Coefficients: [-0.76779153 -0.00668131 0.00654032 -0.36959403]\n", + "Intercept: 0.3108998121270652\n", + "Mean squared error (MSE): 1.01\n", + "Coefficient of determination (R^2): 0.77\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "M6evZTPNRecd", + "colab_type": "text" + }, + "source": [ + "### **Predicts the X_test**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "I_eFbrlaHhPU", + "colab_type": "code", + "colab": {} + }, + "source": [ + "Y_pred_test = model.predict(X_test)" + ], + "execution_count": 28, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "TQnDfyl5HkUr", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 85 + }, + "outputId": "32b22212-424a-458f-8f0f-57095f5903ec" + }, + "source": [ + "print('Coefficients:', model.coef_)\n", + "print('Intercept:', model.intercept_)\n", + "print('Mean squared error (MSE): %.2f'\n", + " % mean_squared_error(Y_test, Y_pred_test))\n", + "print('Coefficient of determination (R^2): %.2f'\n", + " % r2_score(Y_test, Y_pred_test))" + ], + "execution_count": 29, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Coefficients: [-0.76779153 -0.00668131 0.00654032 -0.36959403]\n", + "Intercept: 0.3108998121270652\n", + "Mean squared error (MSE): 1.00\n", + "Coefficient of determination (R^2): 0.74\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nERFfdQBRFF5", + "colab_type": "text" + }, + "source": [ + "### **Linear Regression Equation**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "j3xLiGWHFiY1", + "colab_type": "text" + }, + "source": [ + "The work of Delaney$^1$ provided the following linear regression equation:\n", + "\n", + "> LogS = 0.16 - 0.63 cLogP - 0.0062 MW + 0.066 RB - 0.74 AP\n", + "\n", + "The reproduction by Pat Walters$^2$ provided the following:\n", + "\n", + "> LogS = 0.26 - 0.74 LogP - 0.0066 MW + 0.0034 RB - 0.42 AP\n", + "\n", + "This notebook's reproduction gave the following equation:\n", + "\n", + "* Based on the Train set\n", + "> LogS = 0.30 -0.75 LogP - .0066 MW -0.0041 RB - 0.36 AP\n", + "\n", + "* Based on the Full dataset\n", + "> LogS = 0.26 -0.74 LogP - 0.0066 + MW 0.0032 RB - 0.42 AP" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FaWyYnMbWtYu", + "colab_type": "text" + }, + "source": [ + "#### **Our linear regression equation**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0TH6J9evHIIE", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + }, + "outputId": "9c960b7b-d2cf-4309-d187-60fa783a9529" + }, + "source": [ + "print('LogS = %.2f %.2f LogP %.4f MW + %.4f RB %.2f AP' % (model.intercept_, model.coef_[0], model.coef_[1], model.coef_[2], model.coef_[3] ) )" + ], + "execution_count": 33, + "outputs": [ + { + "output_type": "stream", + "text": [ + "LogS = 0.31 -0.77 LogP -0.0067 MW + 0.0065 RB -0.37 AP\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VcJyUzsLSz2A", + "colab_type": "text" + }, + "source": [ + "The same equation can also be produced with the following code (which breaks up the previous one-line code into several comprehensible lines." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "byUbJ9QqK5gA", + "colab_type": "code", + "colab": {} + }, + "source": [ + "yintercept = '%.2f' % model.intercept_\n", + "LogP = '%.2f LogP' % model.coef_[0]\n", + "MW = '%.4f MW' % model.coef_[1]\n", + "RB = '%.4f RB' % model.coef_[2]\n", + "AP = '%.2f AP' % model.coef_[3]" + ], + "execution_count": 31, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "QY-9rh--S-6g", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + }, + "outputId": "095cefae-a92c-466d-8fb2-67abe856494f" + }, + "source": [ + "print('LogS = ' + \n", + " ' ' + \n", + " yintercept + \n", + " ' ' + \n", + " LogP + \n", + " ' ' + \n", + " MW + \n", + " ' + ' + \n", + " RB + \n", + " ' ' + \n", + " AP)" + ], + "execution_count": 34, + "outputs": [ + { + "output_type": "stream", + "text": [ + "LogS = 0.31 -0.77 LogP -0.0067 MW + 0.0065 RB -0.37 AP\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "R3lRkSOJRm1q", + "colab_type": "text" + }, + "source": [ + "#### **Use entire dataset for model training (For Comparison)**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "QUye6SsIRl9T", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + }, + "outputId": "d93a60d1-cfb8-4ddd-843c-b3bd42be4161" + }, + "source": [ + "full = linear_model.LinearRegression()\n", + "full.fit(X, Y)" + ], + "execution_count": 35, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 35 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6tMI8n0oR1b5", + "colab_type": "code", + "colab": {} + }, + "source": [ + "full_pred = model.predict(X)" + ], + "execution_count": 36, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "7ZVD8Fg1R6zt", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 85 + }, + "outputId": "323a7815-5a7d-49c6-cb13-ad60c7f57e69" + }, + "source": [ + "print('Coefficients:', full.coef_)\n", + "print('Intercept:', full.intercept_)\n", + "print('Mean squared error (MSE): %.2f'\n", + " % mean_squared_error(Y, full_pred))\n", + "print('Coefficient of determination (R^2): %.2f'\n", + " % r2_score(Y, full_pred))" + ], + "execution_count": 37, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Coefficients: [-0.74173609 -0.00659927 0.00320051 -0.42316387]\n", + "Intercept: 0.2565006830997194\n", + "Mean squared error (MSE): 1.01\n", + "Coefficient of determination (R^2): 0.77\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "AFYYzcc1VqIo", + "colab_type": "code", + "colab": {} + }, + "source": [ + "full_yintercept = '%.2f' % full.intercept_\n", + "full_LogP = '%.2f LogP' % full.coef_[0]\n", + "full_MW = '%.4f MW' % full.coef_[1]\n", + "full_RB = '+ %.4f RB' % full.coef_[2]\n", + "full_AP = '%.2f AP' % full.coef_[3]" + ], + "execution_count": 38, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zwU4QJhhVsKb", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + }, + "outputId": "7e3042f6-6447-497b-e6a8-6b58a57599e7" + }, + "source": [ + "print('LogS = ' + \n", + " ' ' + \n", + " full_yintercept + \n", + " ' ' + \n", + " full_LogP + \n", + " ' ' + \n", + " full_MW + \n", + " ' ' + \n", + " full_RB + \n", + " ' ' + \n", + " full_AP)" + ], + "execution_count": 39, + "outputs": [ + { + "output_type": "stream", + "text": [ + "LogS = 0.26 -0.74 LogP -0.0066 MW + 0.0032 RB -0.42 AP\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qp-hjUv4IWe-", + "colab_type": "text" + }, + "source": [ + "## **Scatter plot of experimental vs. predicted LogS**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Q6bP41fKEY9O", + "colab_type": "text" + }, + "source": [ + "### **Quick check of the variable dimensions of Train and Test sets**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "LA5dH5oiEUnP", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + }, + "outputId": "11d5f348-d3da-4bc3-ae52-96c6d607e14f" + }, + "source": [ + "Y_train.shape, Y_pred_train.shape" + ], + "execution_count": 41, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "((915,), (915,))" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 41 + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HIu7YbbFP-7o", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + }, + "outputId": "6460df10-4f85-4b2b-bf60-d555afb7dc41" + }, + "source": [ + "Y_test.shape, Y_pred_test.shape" + ], + "execution_count": 42, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "((229,), (229,))" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 42 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OHqv3TlYa5qF", + "colab_type": "text" + }, + "source": [ + "### **Vertical plot**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "shQPfrHIOmRD", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 660 + }, + "outputId": "1ddf9dac-2be1-482f-8451-3e4a978fffd0" + }, + "source": [ + "import matplotlib.pyplot as plt\n", + "import numpy as np\n", + "\n", + "plt.figure(figsize=(5,11))\n", + "\n", + "# 2 row, 1 column, plot 1\n", + "plt.subplot(2, 1, 1)\n", + "plt.scatter(x=Y_train, y=Y_pred_train, c=\"#7CAE00\", alpha=0.3)\n", + "\n", + "# Add trendline\n", + "# https://stackoverflow.com/questions/26447191/how-to-add-trendline-in-python-matplotlib-dot-scatter-graphs\n", + "z = np.polyfit(Y_train, Y_pred_train, 1)\n", + "p = np.poly1d(z)\n", + "plt.plot(Y_test,p(Y_test),\"#F8766D\")\n", + "\n", + "plt.ylabel('Predicted LogS')\n", + "\n", + "\n", + "# 2 row, 1 column, plot 2\n", + "plt.subplot(2, 1, 2)\n", + "plt.scatter(x=Y_test, y=Y_pred_test, c=\"#619CFF\", alpha=0.3)\n", + "\n", + "z = np.polyfit(Y_test, Y_pred_test, 1)\n", + "p = np.poly1d(z)\n", + "plt.plot(Y_test,p(Y_test),\"#F8766D\")\n", + "\n", + "plt.ylabel('Predicted LogS')\n", + "plt.xlabel('Experimental LogS')\n", + "\n", + "plt.savefig('plot_vertical_logS.png')\n", + "plt.savefig('plot_vertical_logS.pdf')\n", + "plt.show()" + ], + "execution_count": 44, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PswCQ7Yra_CW", + "colab_type": "text" + }, + "source": [ + "### **Horizontal plot**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xG7NWEscT8QO", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 334 + }, + "outputId": "c22932ab-ba03-46d6-a83f-ae0ebe0d3dca" + }, + "source": [ + "plt.figure(figsize=(11,5))\n", + "\n", + "# 1 row, 2 column, plot 1\n", + "plt.subplot(1, 2, 1)\n", + "plt.scatter(x=Y_train, y=Y_pred_train, c=\"#7CAE00\", alpha=0.3)\n", + "\n", + "z = np.polyfit(Y_train, Y_pred_train, 1)\n", + "p = np.poly1d(z)\n", + "plt.plot(Y_test,p(Y_test),\"#F8766D\")\n", + "\n", + "plt.ylabel('Predicted LogS')\n", + "plt.xlabel('Experimental LogS')\n", + "\n", + "# 1 row, 2 column, plot 2\n", + "plt.subplot(1, 2, 2)\n", + "plt.scatter(x=Y_test, y=Y_pred_test, c=\"#619CFF\", alpha=0.3)\n", + "\n", + "z = np.polyfit(Y_test, Y_pred_test, 1)\n", + "p = np.poly1d(z)\n", + "plt.plot(Y_test,p(Y_test),\"#F8766D\")\n", + "\n", + "plt.xlabel('Experimental LogS')\n", + "\n", + "plt.savefig('plot_horizontal_logS.png')\n", + "plt.savefig('plot_horizontal_logS.pdf')\n", + "plt.show()" + ], + "execution_count": 45, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ARiv3f1iC565", + "colab_type": "text" + }, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jwM1QHeLbxJl", + "colab_type": "text" + }, + "source": [ + "## **Reference**\n", + "\n", + "1. John S. Delaney. [ESOL:  Estimating Aqueous Solubility Directly from Molecular Structure](https://pubs.acs.org/doi/10.1021/ci034243x). ***J. Chem. Inf. Comput. Sci.*** 2004, 44, 3, 1000-1005.\n", + "\n", + "2. Pat Walters. [Predicting Aqueous Solubility - It's Harder Than It Looks](http://practicalcheminformatics.blogspot.com/2018/09/predicting-aqueous-solubility-its.html). ***Practical Cheminformatics Blog***\n", + "\n", + "3. Bharath Ramsundar, Peter Eastman, Patrick Walters, and Vijay Pande. [Deep Learning for the Life Sciences: Applying Deep Learning to Genomics, Microscopy, Drug Discovery, and More](https://learning.oreilly.com/library/view/deep-learning-for/9781492039822/), O'Reilly, 2019.\n", + "\n", + "4. [Supplementary file](https://pubs.acs.org/doi/10.1021/ci034243x) from Delaney's ESOL:  Estimating Aqueous Solubility Directly from Molecular Structure." + ] + } + ] +} \ No newline at end of file