Machine Learning and Chemistry

html

School

Temple University *

*We aren’t endorsed by this school

Course

853

Subject

Computer Science

Date

Jan 9, 2024

Type

html

Pages

51

Uploaded by ProfFreedom5002

Report
Machine Learning and prediction Elements of Data Science In this laboratory we will use training data to predict outcomes. We will first test these ideas using our Old Faithful data again. Next we will look at data on the iris flower to classify iris' based on sepal width and length. In our culminating activity we will predict molecular acidity using data computed by Prof. Vince Voelz in the Temple Chemistry department and a graduate student, Robert Raddi. See their paper: Stacking Gaussian processes to improve pKa predictions in the SAMPL7 challenge . In [1]: Your_name = "Madlyn Anglin" Learning from training data A key concept in machine learning is using a subset of a dataset to train an algorithm to make estimates on a separate set of test data. The quality of the machine learning and algorithm can be assesed based on the accuracy of the predictions made on test data. Many times there are also parameters sometimes termed hyper-parameters which can be optimized through an iterative approach on test or validation data. In practice a dataset is randomly split into training and test sets using sampling. k nearest neighbor We will examine one machine learning algorithm in the laboratory, k nearest neighbor. Many of the concepts are applicable to the broad range of machine learning algorithms available. In [2]: ## import statements # These lines load the tests. from gofer.ok import check import numpy as np from datascience import * import pandas as pd import matplotlib %matplotlib inline import matplotlib.pyplot as plt plt.style.use('ggplot') import warnings warnings.simplefilter('ignore', UserWarning) #from IPython.display import Image from matplotlib.colors import ListedColormap from sklearn import neighbors, datasets # Fix for datascience collections Iterable import collections as collections import collections.abc as abc collections.Iterable = abc.Iterable !pip install jupyterquiz from jupyterquiz import display_quiz import json from IPython.core.display import HTML Requirement already satisfied: jupyterquiz in /opt/conda/lib/python3.10/site-packages (2.6.1) k nearest neighbor regression We will use the k nearest neighbor algorithm to make predictions of wait time in minutes
following an eruption duration ofa given number of minutes (independent variable). In [3]: faithful = Table.read_table("data/faithful.csv") faithful.scatter(0, 1, fit_line=True) Question 1 Use the datascience .split(n) Table method to split the dataset into 80% training and 20% test. The argument for .split(n) method,n, needs to be an integer. See datascience documentation In [4]: trainf, testf = faithful.split(int(faithful.num_rows * 0.8)) print(trainf.num_rows, 'training and', testf.num_rows, 'test instances.') 217 training and 55 test instances. In [5]: check('tests/q1.py') Out[5]: All tests passed! Nearest neighbor concept The training examines the chreacteristics of k nearest neighbors to the data point for which a prediction will be made. Nearness is measured using several different metrics with Euclidean distance being a common one for numerical attributes. Euclidean distance: 1-D $$ d(p,q) = \sqrt{(p-q)^{2}} $$ 2-D $$ d(p,q) = \sqrt{(p_1-q_1)^{2}+(p_2-q_2)^{2}} $$ For multiple points (rows): 2-D $$ d(p,q) = \sum{{\sqrt{((p_1-q_1)^{2}+(p_2- q_2)^{2}}}} $$
Try different attribute values in the following 2D Euclidean distance example code below to get a feel for the computation In [6]: # Example code to compute an Euclidean distance between two 2-D points d_p_q = np.sqrt(sum((make_array(2,3)-make_array(4,3))**2)) d_p_q Out[6]: 2.0 To get values from Table row as an array as is done in row_distance. Note in the faithful data case we will only consider the duration column in nearest neighbor computation but in examples below we will use a 2-D array of attributes with the iris data and a 10-D array in the chemistry and molecular acidity case. In [7]: f_array = np.array(faithful.row(0)) f_array Out[7]: array([ 3.6, 79. ]) A couple quick review questions about nearest neighbor below, select the best answer (multiple tries ok). Execute the below cell to reveal the self-check quiz. In [8]: with open("questions.json", "r") as file: questions=json.load(file) display_quiz(questions) Question 2 Define a function which is the Euclidean distance between two values. Use the last two example code cells above as inspiration. This is where we will compute the distance between two duration values. In [11]: def distance(pt1, pt2): """The distance between two points, represented as arrays.""" return np.sqrt(sum((pt1 - pt2)**2)) In [12]: check('tests/q2.py') Out[12]: All tests passed! Rest of the nearest neighbor algorithm Execute these cells to create the complete algorithm In [13]: def row_distance(row1, row2): """The distance between two rows of a table.""" return distance(np.array(row1), np.array(row2)) # Need to convert rows into arrays def distances(training, example, output): """Compute the distance from example for each row in training.""" dists = [] attributes = training.drop(output) for row in attributes.rows: dists.append(row_distance(row, example))
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
return training.with_column('Distance', dists) def closest(training, example, k, output): """Return a table of the k closest neighbors to example.""" return distances(training, example, output).sort('Distance').take(np.arange(k)) Question 3 Take an example row from the test data (testf), drop the prediction column and use the closest function to see the top 10 closest points to the target in the training data. In [14]: example_row = testf.row(1) example_row # This should display data contained in selected row in testf table. Out[14]: Row(duration=3.7669999999999999, wait=83.0) In [15]: k = 10 # Number of nearest neighbors closest(testf,example_row,k,'wait') Out[15]: duration wait Distance 5.033 77 77.9773 4.9 82 78.1082 4.817 77 78.1901 4.8 76 78.2068 4.767 78 78.2394 4.716 90 78.2898 4.7 84 78.3056 4.7 83 78.3056 4.7 88 78.3056 4.7 73 78.3056 In [16]: check('tests/q3.py') Out[16]: All tests passed! Question 4 Predict the value for this row using the defined predict_nn function below and compare to the value reported for wait in the test data. How do they compare? In [17]: def predict_nn(example):
"""Return the majority class among the k nearest neighbors.""" k = 10 return np.average(closest(trainf, example, k , 'wait').column('wait')) In [18]: predictionf = predict_nn(example_row) # This is the value predicted for wait using the average of the k nearest neighbors in the test set actual = example_row print(predictionf,actual) 85.8 Row(duration=3.7669999999999999, wait=83.0) The prediction value and the reported value for wait in the test data are somewhat close, but not perfect. In [19]: check('tests/q4.py') Out[19]: All tests passed! Question 5 Predictions Now we will make predictions for the whole data set using the apply Table method. We will then look at the root mean squared error (RMSE) for the nearest neighbor fit and a scatter plot. Try adjusting the value of k in the predict_nn function to see it's effect on the quality of fit by rerunning these cells. Are the predicted points in a perfect straight line, why or why not? In [20]: testf = testf.with_columns("predict",testf.apply(predict_nn,"duration")) nn_test_predictions = testf.column("predict") test_wait = testf.column("wait") rmse_nn = np.mean((test_wait - nn_test_predictions) ** 2) ** 0.5 print('Test set RMSE for nearest neighbor regression:', round(rmse_nn,2)) Test set RMSE for nearest neighbor regression: 6.81 In [21]: testf.scatter("duration")
The points are not in a straight line. It looks like there is a slight trend but there are still two clusters of data points. Classify iris data with machine learning Next we will take on the problem of classifying iris data into three categories, setosa, versicolor, and virginica. Here we will also learn the basics of the k nearest neighbor algorithm. The first data set we will look at consists of 50 samples from three species of Iris (Iris Setosa, Iris virginica, and Iris versicolor). Four features were measured including the length and the width of the sepals and petals, in centimeters for each observation. Iris stainglass, J.R. Smith In [22]: n_neighbors = 15 # Load iris data iris = datasets.load_iris() # We only take the first two features. iris_table = Table().with_columns("Name",iris.target,iris.feature_names[0],iris.data[:,0],iris.feature_ names[1],iris.data[:,1])
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
iris_table Out[22]: Name sepal length (cm) sepal width (cm) 0 5.1 3.5 0 4.9 3 0 4.7 3.2 0 4.6 3.1 0 5 3.6 0 5.4 3.9 0 4.6 3.4 0 5 3.4 0 4.4 2.9 0 4.9 3.1 ... (140 rows omitted) In [23]: iris.target_names Out[23]: array(['setosa', 'versicolor', 'virginica'], dtype='<U10') Question 6 Train and test split the iris_table @ 80% as above. In [24]: train_i, test_i = iris_table.split(int(iris_table.num_rows * 0.8)) print(train_i.num_rows, 'training and', test_i.num_rows, 'test instances.') 120 training and 30 test instances. In [25]: check('tests/q6.py') Out[25]: All tests passed! Question 7 With classification we need to use training data to decide how to classify data given a set of attributes, sepal length and sepal width in this case. Create a function which returns the majority classification among three possibilities in "Name" coded as 0, 1, 2 (setosa, versicolor, and virginica respectively). The and below combines two conditionals. For example, (twos > ones) and ... In [26]: def majority(topkclasses):
twos = topkclasses.where('Name', are.equal_to(2)).num_rows ones = topkclasses.where('Name', are.equal_to(1)).num_rows zeros = topkclasses.where('Name', are.equal_to(0)).num_rows # Now test to see what the majority name for each k class if twos and ones: return 2 elif ones and zeros: return 1 else: return 0 In [27]: check('tests/q7.py') Out[27]: All tests passed! In [28]: def classify(training, new_point, k): closestk = closest(training, new_point, k,"Name") topkclasses = closestk.select('Name') return majority(topkclasses) In [29]: test_row = 21 print("Prediction: ",classify(train_i,example_row,test_row)," Actual: ",test_i.select("Name").row(test_row)) Prediction: 0 Actual: Row(Name=0) In [30]: def predict(train, test_attributes, k): pred = [] for i in np.arange(test_attributes.num_rows): pred.append(classify(train,test_attributes.row(i),k)) return pred Question 8 Make a new table called prediction which includes original columns of test Table but also includes a "predict" column. In [31]: k = 10 prediction = test_i.with_columns("predict",predict(train_i, test_i.drop('Name'), k)) prediction.show(30) Name sepal length (cm) sepal width (cm) predict 0 4.8 3.4 0 1 5.6 2.7 2 0 4.9 3.1 0 2 6 3 2 1 6.4 2.9 2
Name sepal length (cm) sepal width (cm) predict 1 6.1 3 2 0 5.5 3.5 0 1 5.9 3.2 2 1 5.7 2.6 2 0 5.4 3.4 1 1 5.6 2.5 2 2 5.9 3 2 2 7.7 2.8 2 0 4.8 3.1 0 2 6.4 2.8 2 1 5.7 2.9 2 0 5.4 3.9 0 2 6.5 3 2 1 5.5 2.4 2 0 5 3 1 2 6.7 3.3 2 0 5 3.2 0 0 4.7 3.2 0 2 6.3 2.5 2 2 7.4 2.8 2 0 5 3.3 0 1 6.9 3.1 2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Name sepal length (cm) sepal width (cm) predict 2 7.9 3.8 2 1 5.2 2.7 2 0 4.3 3 0 In [32]: check('tests/q8.py') Out[32]: All tests passed! Plot decision outcomes for test set Question 9 Use above prediction Table to make a scatter plot of the color coded predictions based on the tweo attributes(use colors="predict" in scatter plot after specifying x and y axis based on attributes). In [34]: prediction.drop("Name").scatter('sepal length (cm)','sepal width (cm)', colors, group = 2) colors = "predict" Fancy plot showing color coded decision boundaries We can make a more informative plot by predicting on a grid of attribute values as shown below. Seaborn is an add-on to the Matplotlib plotting we have been using which provides more control of plotting. Execute (this may take a minute+) and study the below input and resulting output for your information.
In [35]: def make_colors(iris, y, cmap): colors = [] cdict = {'setosa':0, 'virginica':2, 'versicolor':1} for x in iris.target_names[y]: colors.append(cmap[cdict[x]]) return colors In [36]: import seaborn as sns # Plot the decision boundary. For that, we will assign a color to each # point in the mesh [x_min, x_max]x[y_min, y_max]. h = .1 # step size in the mesh k = 10 x_min, x_max = iris.data[:, 0].min() - 1, iris.data[:, 0].max() + 1 y_min, y_max = iris.data[:, 1].min() - 1, iris.data[:, 1].max() + 1 xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h)) ## Create a grid of predictions in a Table attribute_grid = Table().with_columns(iris.feature_names[0],np.c_[xx.ravel(), yy.ravel()] [:,0],iris.feature_names[1],np.c_[xx.ravel(), yy.ravel()][:,1]) Z = np.array(predict(train_i,attribute_grid,k)) # Create color maps cmap_light = ListedColormap(['orange', 'cyan', 'cornflowerblue']) cmap_bold = ['darkorange', 'c', 'darkblue'] # Put the result into a color plot Z = Z.reshape(xx.shape) plt.figure(figsize=(8, 6)) plt.contourf(xx, yy, Z, cmap=cmap_light) # Plot the test points but convert to numpy arrays predictions = prediction.column('predict') attribute1 = prediction.column(1) attribute2 = prediction.column(2) plt.scatter(x=attribute1, y=attribute2, c = make_colors(iris, predictions, cmap_bold), alpha=1.0, edgecolor="black") plt.xlim(xx.min(), xx.max()) plt.ylim(yy.min(), yy.max()) plt.title("3-Class classification (k = %i')" % (k)) plt.xlabel(iris.feature_names[0]) plt.ylabel(iris.feature_names[1]) Out[36]: Text(0, 0.5, 'sepal width (cm)')
Use scikit learn Scikit learn is a standard state of the art machine learning library. For demonstration purposes execute the below commands to classify and generate a comparable output. In [37]: from sklearn.neighbors import KNeighborsRegressor from sklearn.metrics import mean_squared_error from sklearn.metrics import r2_score from sklearn.model_selection import train_test_split In [38]: clf = neighbors.KNeighborsClassifier(k) # Initiate the classifier x_train, x_test, y_train, y_test = train_test_split(iris.data[:,:2], iris.target, random_state=22) # scikit split # Now fit clf.fit(x_train, y_train) Out[38]: KNeighborsClassifier(n_neighbors=10) In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. KNeighborsClassifier
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
KNeighborsClassifier(n_neighbors=10) In [39]: import seaborn as sns # Plot the decision boundary. For that, we will assign a color to each # point in the mesh [x_min, x_max]x[y_min, y_max]. h = .1 # step size in the mesh x_min, x_max = iris.data[:, 0].min() - 1, iris.data[:, 0].max() + 1 y_min, y_max = iris.data[:, 1].min() - 1, iris.data[:, 1].max() + 1 xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h)) Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # Create color maps cmap_light = ListedColormap(['orange', 'cyan', 'cornflowerblue']) cmap_bold = ['darkorange', 'c', 'darkblue'] # Put the result into a color plot Z = Z.reshape(xx.shape) plt.figure(figsize=(8, 6)) plt.contourf(xx, yy, Z, cmap=cmap_light) # Plot also the training points y = y_test plt.scatter(x=x_test[:, 0], y=x_test[:, 1], c = make_colors(iris, y, cmap_bold), alpha=1.0, edgecolor="black") plt.xlim(xx.min(), xx.max()) plt.ylim(yy.min(), yy.max()) plt.title("3-Class classification (k = %i')" % (k)) plt.xlabel(iris.feature_names[0]) plt.ylabel(iris.feature_names[1]) Out[39]: Text(0, 0.5, 'sepal width (cm)')
Question 10 Comment on the quality of the predictions by 1. Your nearest neighbor algorithm 2. scikit learn 3. Comparison The nearest neighbor algorithm quality is good because when looking at the graph each color lines up with its classification. The scikit learn quality is also good when compared to the data. Both are good quality but the scikit learn is more accurate than the neighbor algorithm. Molecules and predicting acidity measured by pKa Within the Jupyter notebook we can also analyze molecules and their molecular data using the library RDKit. RDKit adds the ability to visualize 2D and 3D molecular structures. We can apply many of the data science tools we have learned to molecular data as well. First we will briefly look at acid-base chemistry and how acidity is defined. pH is a measure of the acidity of a water-based (aqueous) solution. A pH of 1 is acidic, a pH of 7 is neutral and a pH of 14 is basic. Next we will use some computed atributes of a large set of molecules to train a k nearest neighbor model to predict acidity. We will use a range of attributes including the partial charges on atoms adjacent to the acidic proton, molecular weight, solvent accessible surface area (SASA), carbon-oxygen bond order, and some thermochemistry measures all of which may help predict acidity with a lower
pKa indicating a stronger (weak) acid. Acid-base and pKa background A very brief background on acid - base equilibria demonstrated for glycine. See OpenStax Chemistry for details based on interest. RDKit RDKit is a specialized library to handle the complexities of molecules within Python. In [40]: from rdkit import Chem from rdkit.Chem.Draw import IPythonConsole #Needed to show molecules from rdkit.Chem.Draw.MolDrawing import MolDrawing, DrawingOptions #Only needed if modifying defaults from rdkit.Chem import rdRGroupDecomposition from rdkit.Chem import rdDepictor from rdkit.Chem import PandasTools from rdkit.Chem import AllChem from rdkit.Chem import Draw from rdkit import DataStructs # Options DrawingOptions.bondLineWidth=1.8 rd =True Load detailed molecular data for 2000 molecules In [41]: url = "https://raw.githubusercontent.com/robraddi/GP-SAMPL7/main/pKaDatabase/OChem/ochem0- 2000.csv" data = Table.read_table(url) data=data.sort('N') data.show(5) SMILES CASRN RECORDID MOLECULEID EXTERNALID N NAME NAM 1 [O-]C1=C2C=C C=CC2=NC=N 1 - R1207641 M20829 - - 4- Hydroxyquinazoline - OC1=CC2=CN =CN=C2C=C1 - R1207643 M1107327 - - 6- hydroxyquinazoline - OC1=CC2=NC =NC=C2C=C1 - R1207645 M1107328 - - 7- hydroxyquinazoline - CC1=C2C=CC= C(O)C2=NC=N - R1207650 M46729 - - 8-hydroxy-4- -
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
SMILES CASRN RECORDID MOLECULEID EXTERNALID N NAME NAM 1 1 methylquinazoline [S-]c1ncc2cccc c2[n+]1 - R1207652 M1158202 - - 2- mercaptoquinazoline - ... (1995 rows omitted) Question 11. Select an amino acid Use the Table above to view data for an amino acid of your selection from the 21 amino acids which are building blocks of protiens. See web page for possible choices. Hint: use are.containing within the .where() Table method. For example below we can find compounds which contain a trimethyl group (3 CH$_3$) groups. We get 11 rows (records). In [46]: trimethyl = data.where("NAME",are.containing("trimethyl")) trimethyl Out[46]: SMILES CASRN RECORDID MOLECULEID EXTERNALID N NAME NAME. 1 INT CC1=NC (C)=C(C )C(=N1) S([O-]) (=O)=O - R1207394 M1158149 - - 2,4,5-trimethyl-6- sulphopyrimidine - Koe CC1=NC (C)=C(C )N=N1 - R1207401 M1158154 - - 3,5,6-trimethyl- 1,2,4-triazine - Koe CC1=NC ([O-])=N - R1207338 M1106856 - - 2-hydroxy-4,5,6- - Koe
SMILES CASRN RECORDID MOLECULEID EXTERNALID N NAME NAME. 1 INT C(C)=C1 C trimethylpyrimidine Cc1nc([ O-]) [n+]c(C) c1C - R1207339 M1158110 - - 2-hydroxy-4,5,6- trimethylpyrimidine - Koe CCC1=C (C)N=N C(C)=C1 C - R1207140 M1157998 - - 4-ethyl-3,5,6- trimethylpyridazine - Koe CC1=CC (C)=C(C )N=N1 - R1207154 M1158007 - - 3,4,6- trimethylpyridazine - Koe CNC(=N )N(C)C - R1206678 M1106709 - - N,N',N'- trimethylaguanidine - Koe CNN(C)C - R1206476 M1016037 - - trimethylhydrazine - Koe CC1=C( C)C(C)= NO1 - R1206399 M32590 - - Isoxazoline, 3,4,5- trimethyl- - Koe CC1(C)C (CCC1(C )C(O)=O )C(O)=O - R1203666 M6436 - - cyclopentan-1,3- dicarboxlic acid- 1,2,2-trimethyl - Koe ... (1 rows omitted) In [47]: amino = data.where("NAME",are.containing("phenylalanine")) amino Out[47]:
SMILES CASRN RECORDID MOLECULEID EXTERNALID N NAME NAME. 1 [NH2+]C (CC1=C C=CC= C1F)C(O )=O - R1204234 M1157740 - - o-Fluorophenylalanine - [NH2+]C (CC1=C C(F)=CC =C1)C(O )=O - R1204236 M1157741 - - m-Fluorophenylalanine - [NH2+]C (CC1=C C=C(F)C =C1)C(O )=O - R1204238 M1157742 - - p=fluorophenylalanine - [NH2+]C (CC1=C C=CC= C1Cl)C( O)=O - R1204240 M1157743 - - O-Chlorophenylalanine - [NH2+]C (CC1=C C(Cl)=C C=C1)C( O)=O - R1204242 M1157744 - - m-chlorophenylalanine - [NH2+]C (CC1=C C=C(O) C(O)=C1 )C(O)=O - R1204248 M1157747 - - 3,4- Dihydroxyphenylalanin e - [NH2+]C (CC1=C C=C(O) - R1204249 M1157747 - - 3,4- Dihydroxyphenylalanin e -
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
SMILES CASRN RECORDID MOLECULEID EXTERNALID N NAME NAME. 1 C(O)=C1 )C([O-]) =O NC(CC1 =CC=C( [O-])C(O )=C1)C([ O-])=O - R1204250 M12590 - - 3,4- Dihydroxyphenylalanin e - In [48]: check('tests/q11.py') Out[48]: All tests passed! Display molecular structure SMILES is a shorthand language to describe molecular structure. Execute each structure below. In [49]: Chem.MolFromSmiles("[H]-O-[H]") #Water Out[49]: In [50]: Chem.MolFromSmiles("[CH3]") #Methyl radical Out[50]:
In [51]: Chem.MolFromSmiles("C-C-O") #Ethanol Out[51]: In [52]: Chem.MolFromSmiles("[NH2+]CC(O)=O") # Glycine Out[52]: Selected amino acid 2D molecular structure Try it out for fun! Use the same above syntax and the SMILES string from your above Table to display a 2D amino acid structurefrom your selection. Even if there is no rdkit, try your hand at the SMILES molecular description. In [53]: smile_struct = '[NH2+]C(CC1=CC=CC=C1Cl)C(O)=O' Chem.MolFromSmiles(smile_struct) Out[53]: Code to create a grid of molecular images with labels Execute and study the below code In [54]: mols = [Chem.MolFromSmiles(x) for x in amino.column("SMILES") if x is not None] #Iterator mols name = amino.column("NAME") for i,m in enumerate(mols): m.SetProp("Name",name[i])
p = Draw.MolsToGridImage( [mols[x] for x in range(0,3)] , legends=[x.GetProp("Name") for x in mols],molsPerRow=3,subImgSize=(300,250), useSVG=True ) p Out[54]: Use pandas to add 2D structures to dataframe We can convert our Table to pandas then use the RDKit AddMoleculeColumnToFrame method to add structures. One row has an anomolous nitrogen atom, N, so don't be alarmed by the error presented. Occasionally the 2D images of the structures fail to appear, unfortunate but not a cause for concern either. In [55]: df = data.to_df() df Out[55]: SMILES CASRN RECORDID MOLECULEID EXTERNALID N NA 0 [O-]C1=C2C=C C=CC2=NC=N 1 - R1207641 M20829 - - 4-Hydroxyquinazo 1 OC1=CC2=CN =CN=C2C=C1 - R1207643 M1107327 - - 6-hydroxyquinazo 2 OC1=CC2=NC =NC=C2C=C1 - R1207645 M1107328 - - 7-hydroxyquinazo 3 CC1=C2C=CC =C(O)C2=NC= N1 - R1207650 M46729 - - 8-hydroxy-4-meth 4 [S-]c1ncc2cccc c2[n+]1 - R1207652 M1158202 - - 2-mercaptoquina ... ... ... ... ... ... ... ... 1995 OC(=O)COC1= CC=CC(=C1) [N+]([O-])=O 1878- 88-2 R2182408 M896 - 99 m-Nitrophenoxya
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
SMILES CASRN RECORDID MOLECULEID EXTERNALID N NA 1996 COC1=CC=CC( OC)=C1C(=O)N [C@H]1[C@H]2 SC(C)(C)[C@... 61-32-5 R1509608 M9792 - 99 6-({[2,6- bis(methyloxy)ph mino)-... 1997 CCCCCCCC\ C=C\ CCCCCCCC(=O )OC[C@@H] (CO[P@@](O) (=... - R1321912 M663928 - 99 1,2-Dioleoylphosp 1998 CCCCCCCC\ C=C\ CCCCCCCC(=O )OCC(COP(O) (=O)OCCN)OC.. . - R1321798 M659539 - 99 1,2- Dioleoylphosphat e 1999 NC1=CC=C(N) C(O)=C1N - R2172809 M2608463 - nan - 2000 rows × 32 columns In [56]: df = data.to_df() # Convert Table to pandas dataframe PandasTools.AddMoleculeColumnToFrame(df,smilesCol='SMILES',molCol='Molecule',includeFinger prints=True) col = df.pop("NAME") df.insert(0, col.name, col) # Move structure to first column col = df.pop("Molecule") df.insert(1, col.name, col) # Move structure to first column df [02:03:33] Explicit valence for atom # 11 N, 4, is greater than permitted Out[56]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
NAME Molecule SMILES CASRN REC 0 4-Hydroxyquinazoline [O-]C1=C2C=C C=CC2=NC=N 1 - R12 1 6-hydroxyquinazoline OC1=CC2=CN =CN=C2C=C1 - R12 2 7-hydroxyquinazoline OC1=CC2=NC =NC=C2C=C1 - R12 3 8-hydroxy-4-methylquinazoline CC1=C2C=CC =C(O)C2=NC= N1 - R12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
NAME Molecule SMILES CASRN REC 4 2-mercaptoquinazoline [S-]c1ncc2cccc c2[n+]1 - R12 ... ... ... ... ... ... 1995 m-Nitrophenoxyacetic Acid OC(=O)COC1= CC=CC(=C1) [N+]([O-])=O 1878- 88-2 R21 1996 6-({[2,6- bis(methyloxy)phenyl]carbonyl}a mino)-... COC1=CC=CC( OC)=C1C(=O)N [C@H]1[C@H]2 SC(C)(C)[C@... 61-32-5 R15 1997 1,2-Dioleoylphosphatidylethano CCCCCCCC\ C=C\ CCCCCCCC(=O )OC[C@@H] (CO[P@@](O) (=... - R13
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
NAME Molecule SMILES CASRN REC 1998 1,2- Dioleoylphosphatidylethanolamin e CCCCCCCC\ C=C\ CCCCCCCC(=O )OCC(COP(O) (=O)OCCN)OC.. . - R13 1999 - NC1=CC=C(N) C(O)=C1N - R21 2000 rows × 33 columns pKa data examination Now we will look at a data set derived from the above data but with computed molecular attributes for our machine learning. This data set is computed and described by Prof. Vince Voelz in the Temple Chemistry department and a graduate student, Robert Raddi. See their paper: Stacking Gaussian processes to improve pKa predictions in the SAMPL7 challenge . In [57]: db = pd.read_pickle("data/pKaDatabaseF22.pkl") db_table = Table().from_df(db) # Datascience Table from pandas dataframe db Out[57]: deprotonated microstate ID protonated microstate ID deprotonated microstate smiles protonated microstate smiles AM1BCC partial charge (prot. atom) AM1 par cha (dep ato 0 methyclothiazide_mic ro001 methyclothiazide_mic ro000 C[N@]1[C@@H ] ([N-]c2cc(c(cc2 C[N@]1[C@@ H] (Nc2cc(c(cc2S -0.79657 -0.55
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
deprotonated microstate ID protonated microstate ID deprotonated microstate smiles protonated microstate smiles AM1BCC partial charge (prot. atom) AM1 par cha (dep ato S1(=O)=O)S(= O)(=O)N... 1(=O)=O)S(= O)(=O)N)Cl... 1 sulpiride_micro001 sulpiride_micro000 CC[N@]1CCC[C @H]1C[N-]C(= O)c2cc(ccc2OC )S(=O)(=O)N CC[N@]1CCC[ C@H]1CNC(= O)c2cc(ccc2O C)S(=O) (=O)N -0.53903 -0.59 2 celecoxib_micro001 celecoxib_micro000 Cc1ccc(cc1)c2c c(nn2c3ccc(cc3 )S(=O)(=O) [NH-])C(... Cc1ccc(cc1)c2 cc(nn2c3ccc(c c3)S(=O) (=O)N)C(F) (F)F -1.02861 -1.31 3 metolazone_micro00 1 metolazone_micro00 0 Cc1ccccc1N2[C @@H] ([N-]c3cc(c(cc3 C2=O)S(=O) (=O)... Cc1ccccc1N2[ C@@H] (Nc3cc(c(cc3C 2=O)S(=O) (=O)N)Cl)C -0.74198 -0.55 4 polythiazide_micro00 1 polythiazide_micro00 0 C[N@]1[C@@H ] ([N-]c2cc(c(cc2 S1(=O)=O)S(= O)(=O)N... C[N@]1[C@@ H] (Nc2cc(c(cc2S 1(=O)=O)S(= O)(=O)N)Cl... -0.79841 -0.56 ... ... ... ... ... ... ... 137 sulfadimethoxine_mic ro001 sulfadimethoxine_mic ro000 COc1cc([N-]S( =O) (=O)c2ccc(N)cc 2)nc(OC)n1 N([H]) (c1nc(OC([H]) ([H]) [H])nc(OC([H] )([H])[H]... -0.86987 -0.98 138 sulfamethoxydiazine_ micro001 sulfamethoxydiazine_ micro000 COc1cnc([N-]S( =O) (=O)c2ccc(N)cc 2)nc1 N([H]) (c1nc([H])c(O C([H])([H]) [H])c([H])n1)S (=... -0.85693 -0.86
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
deprotonated microstate ID protonated microstate ID deprotonated microstate smiles protonated microstate smiles AM1BCC partial charge (prot. atom) AM1 par cha (dep ato 139 sulfisomidine_micro0 01 sulfisomidine_micro0 00 Cc1cc([N-]S(= O) (=O)c2ccc(N)cc 2)nc(C)n1 N([H]) (c1nc(C([H]) ([H]) [H])nc(C([H]) ([H])[H])c... -0.87755 -0.90 140 sulfamethazine_micr o001 sulfamethazine_micr o000 Cc1cc(C)nc([N- ]S(=O) (=O)c2ccc(N)cc 2)n1 N([H]) (c1nc(C([H]) ([H]) [H])c([H])c(C([ H])([H])... -0.85927 -0.94 141 sulfapyridine_micro0 01 sulfapyridine_micro0 00 Nc1ccc(S(=O) (=O) [N-]c2ccccn2)c c1 N([H]) (c1nc([H])c([H ])c([H])c1[H]) S(=O) (=O)c1c... -0.88207 -0.94 3456 rows × 34 columns We can look at the structures and data on several derivatives of acetic acid by executing the code below In [58]: glycine=db_table.where("protonated microstate ID",are.containing("acetic acid")) # Select those data containing acetic acid in the name. mols = [Chem.MolFromSmiles(x) for x in glycine.column("protonated microstate smiles") if x is not None] pKa = glycine.column("pKa") for i,m in enumerate(mols): m.SetProp("Name","pKa: "+str(pKa[i])) p = Draw.MolsToGridImage( [mols[x] for x in range(0,3)] , legends=[x.GetProp("Name") for x in mols],molsPerRow=2,subImgSize=(300,250), useSVG=True ) p Out[58]: protonated microstate smiles Here we place the pKa which we will predict in the first column. Use SMILES format to display structures. Execute the below cells. In [59]: #PandasTools.AddMoleculeColumnToFrame(db,smilesCol='protonated microstate smiles',molCol='Molecule',includeFingerprints=True) PandasTools.AddMoleculeColumnToFrame(db, smilesCol='protonated microstate smiles', molCol='protonated molecule')
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
PandasTools.AddMoleculeColumnToFrame(db, smilesCol='deprotonated microstate smiles', molCol='deprotonated molecule') col = db.pop('protonated molecule') db.insert(0, col.name, col) col = db.pop('deprotonated molecule') db.insert(1, col.name, col) col = db.pop("pKa") db.insert(0, col.name, col) db=db.sort_values(by='pKa',ascending=False) db.head() Out[59]: pKa protonated molecule deprotonated molecule deprotonated microstate ID protonated microstate ID d 735 19.2 <rdkit.Chem.rdchem.Mol object at 0x7fcdf84eac70> <rdkit.Chem.rdchem.Mol object at 0x7fcdf81de1f0> 2-Methyl-2- propanol_micro 001 2-Methyl-2- propanol_micr o000 C 1323 17.6 <rdkit.Chem.rdchem.Mol object at 0x7fcdf84aeff0> <rdkit.Chem.rdchem.Mol object at 0x7fcdf815e570> 2- butanol_micro0 01 2- butanol_micro 000 C [O 832 17.1 <rdkit.Chem.rdchem.Mol object at 0x7fcdf8499770> <rdkit.Chem.rdchem.Mol object at 0x7fcdf80f4cf0> 2- Propanol_micro 001 2- Propanol_micr o000 C 1200 16.6 <rdkit.Chem.rdchem.Mol object at 0x7fcdf84c7990> <rdkit.Chem.rdchem.Mol object at 0x7fcdf81baf10> 3- methylindole_ micro001 3- methylindole_ micro000 C c 1473 16.4 <rdkit.Chem.rdchem.Mol object at 0x7fcdf840cdd0> <rdkit.Chem.rdchem.Mol object at 0x7fcdf82300b0> Mandelic acid_micro001 Mandelic acid_micro00 0 c [C [O 5 rows × 34 columns Examine a few acidity and molecular weight distribution The below code will generate histograms for acidity as measured by pKa and molar molecular weight measured in grams per mole. Execute code and examine output. In [60]: fig = plt.figure() ax = plt.subplot(2,2,1) ax1 = plt.subplot(2,2,2)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
db.sort_values('Weight', ascending=True) ax = db["Weight"].plot.hist(rot=0, figsize=(14, 4), bins=25, edgecolor='black', linewidth=1.2, ax=ax) ax.set_xlabel("molecular weight", size=16) ax.set_ylabel("", size=12) ax.axvline(x=db['Weight'].mean(), linewidth=4, color='r') ax1 = db["pKa"].plot.hist(rot=0, figsize=(14, 4), bins=25, edgecolor='black', linewidth=1.2, ax=ax1)#, subplots=True, layout=(2,2)) ax1.set_xlabel(r"$pK_{a}$", size=16) ax1.set_ylabel("", size=12) ax1.axvline(x=4, linewidth=4, color='r') ax1.axvline(x=9, linewidth=4, color='r') fig = ax1.get_figure() fig.savefig("MW_dist.pdf") Look at pKa and molecular attribute relationships Here we will plot some of the attributes to see if there is a relationship between their values and the pKa we are trying to predict. Execute these cells. In [61]: db.plot.scatter("Weight","pKa") Out[61]: <Axes: xlabel='Weight', ylabel='pKa'>
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [62]: db.plot.scatter("AM1BCC partial charge (prot. atom)","pKa") Out[62]: <Axes: xlabel='AM1BCC partial charge (prot. atom)', ylabel='pKa'> In [63]: db.plot.scatter("Bond Order","pKa")
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Out[63]: <Axes: xlabel='Bond Order', ylabel='pKa'> Nearest Neighbor Let's restrict our consideration to acids with 0 < pKa < 7. The reason may be evident from the distribution in the histogram of pKa's above with two peaks, one around 4 and another at 8. The negative pKa's are outliers which also will be difficult to predict. For machine learning we will also drop the 2D molecular structures. In [64]: dblow = db[db["pKa"].values<7] dblow = dblow[dblow["pKa"].values>0] List attribute columns with index In [65]: for (i, item) in enumerate(list(dblow.columns)): print(i, item) 0 pKa 1 protonated molecule 2 deprotonated molecule 3 deprotonated microstate ID 4 protonated microstate ID 5 deprotonated microstate smiles 6 protonated microstate smiles 7 AM1BCC partial charge (prot. atom) 8 AM1BCC partial charge (deprot. atom) 9 AM1BCC partial charge (prot. atoms 1 bond away) 10 AM1BCC partial charge (deprot. atoms 1 bond away) 11 AM1BCC partial charge (prot. atoms 2 bond away) 12 AM1BCC partial charge (deprot. atoms 2 bond away)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
13 Gasteiger partial charge (prot. atom) 14 Gasteiger partial charge (deprot. atom) 15 Gasteiger partial charge (prot. atoms 1 bond away) 16 Gasteiger partial charge (deprot. atoms 1 bond away) 17 Gasteiger partial charge (prot. atoms 2 bond away) 18 Gasteiger partial charge (deprot. atoms 2 bond away) 19 Extented Hückel partial charge (prot. atom) 20 Extented Hückel partial charge (deprot. atom) 21 Extented Hückel partial charge (prot. atoms 1 bond away) 22 Extented Hückel partial charge (deprot. atoms 1 bond away) 23 Extented Hückel partial charge (prot. atoms 2 bond away) 24 Extented Hückel partial charge (deprot. atoms 2 bond away) 25 ∆G_solv (kJ/mol) (prot-deprot) 26 SASA (Shrake) 27 SASA (Lee) 28 Bond Order 29 Change in Enthalpy (kJ/mol) (prot-deprot) 30 href 31 Weight 32 num ionizable groups 33 pKa source Selection of attributes/features for training and prediction We need too select the features that we will use in the training. These will include the charges computed for key atoms adjacent to the acidic proton (H+) in columns (7-12) using the AM1BCC method, ∆G_solv (kJ/mol) (prot-deprot) in column 25,solvent accessible surface area (SASA) in column 26, bond order in column 28, Change in Enthalpy (kJ/mol) (prot-deprot) in column 29, and molecular weight in column 32. These are the 11 attributes features we will use. We also keep the labels and SMILES as well as the pKa we will train on. In [66]: molecular = Table().from_df(dblow) # Now back to Table molecular = molecular.select(0, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 25, 26, 28, 29, 32) molecular Out[66]: pKa deprotonated microstate ID protonated microstate ID deprotonated microstate smiles protonated microstate smiles AM1BCC partial charge (prot. atom) A p c (d 6.98 Fenpropimorph_micro 001 Fenpropimorph_micro 000 C[C@@H]1CN( C[C@@H] (O1)C)C[C@H] (C)Cc2ccc(cc2) C(C)(C)C C[C@@H]1C[N H+](C[C@@H] (O1)C)C[C@H] (C)Cc2ccc(cc2) C(C)(C)C -0.68511 -0 6.95 Imidazole_micro001 Imidazole_micro000 c1cnc[n-]1 c1cnc[nH]1 -0.32082 -0
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
pKa deprotonated microstate ID protonated microstate ID deprotonated microstate smiles protonated microstate smiles AM1BCC partial charge (prot. atom) A p c (d 6.95 3-methoxy-6- mercaptopyridazine_ micro001 3-methoxy-6- mercaptopyridazine_ micro000 COc1ccc(nn1) [S-] COc1ccc(nn1)S -0.31734 -0 6.95 2-(N-(2-cyanoethyl)- N- methyl)aminopropylb enzene_micro001 2-(N-(2-cyanoethyl)- N- methyl)aminopropylb enzene_micro000 C[N@@] (CCCc1ccccc1) CCC#N C[N@@H+] (CCCc1ccccc1) CCC#N -0.6868 -0 6.95 1- Methylimidazole_micr o001 1- Methylimidazole_micr o000 Cn1ccnc1 Cn1cc[nH+]c1 -0.13447 -0 6.95 Morpholine, N-(3- ethylcarbonyl-3,3- diphenyl)propyl- _micro001 Morpholine, N-(3- ethylcarbonyl-3,3- diphenyl)propyl- _micro000 CCC(=O)C(CCN 1CCOCC1) (c2ccccc2)c3cc ccc3 CCC(=O)C(CC[ NH+]1CCOCC1) (c2ccccc2)c3cc ccc3 -0.68628 -0 6.95 2-bis(2- chloroethyl)aminopro pane_micro001 2-bis(2- chloroethyl)aminopro pane_micro000 CC(C)N(CCCl)C CCl CC(C)[NH+] (CCCl)CCCl -0.70654 -0 6.94 4- mercaptopyrimidine_ micro001 4- mercaptopyrimidine_ micro000 c1cncnc1[S-] c1cncnc1S -0.31642 -0 6.94 Barbituric acid_micro001 Barbituric acid_micro000 CC[C@]1(C(=O) NC(=O) [N-]C1=O)c2cc c(cc2)[N+](=O) [O-] CCC1(C(=O)NC (=O)NC1=O)c2 ccc(cc2)[N+] (=O)[O-] -0.59474 -0 6.92 2,3- diaminobutane_micro 001 2,3- diaminobutane_micro 000 C[C@@H] ([C@H](C) [NH-])[NH3+] C[C@@H] ([C@H](C)N) [NH3+] -0.92502 -0 ... (2035 rows omitted) Train, test split Question 12 Split the molecular Table into train and test data using 80% for training and remembering that the split must be an integer using int() function. Again we will select
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
certain rowsas attributes. In [67]: train, test = molecular.split(int(molecular.num_rows*0.8)) print(train.num_rows, 'training and', test.num_rows, 'test instances.') train.show(3) 1636 training and 409 test instances. pKa deprotonated microstate ID protonated microstate ID deprotonated microstate smiles protonated microstate smiles AM1BCC partial charge (prot. atom) AM1BCC partial charge (deprot. atom) AM p c ( at a 4.37 4-Methyl-benzoic acid_micro001 4-Methyl-benzoic acid_micro000 Cc1ccc(cc1)C( =O)[O-] Cc1ccc(cc1) C(=O)O -0.6085 -0.83399 0.6 2.45 2-iodo-4- methylthioaniline_ micro001 2-iodo-4- methylthioaniline_ micro000 CSc1ccc(c(c1)I) [NH-] CSc1ccc(c(c 1)I)N -0.87739 -0.68227 0.2 3.2 2- Fluoroaniline_micr o001 2- Fluoroaniline_micr o000 c1ccc(c(c1) [NH-])F c1ccc(c(c1)N )F -0.81357 -0.72807 0.1 ... (1633 rows omitted) In [68]: check('tests/q12.py') Out[68]: All tests passed! Our k nearest neighbors code Remember our the k nearest neighbor code from above which wewill again use here. def row_distance(row1, row2): """The distance between two rows of a table.""" return distance(np.array(row1), np.array(row2)) def distances(training, example, output): """Compute the distance from example for each row in training.""" dists = [] attributes = training.drop(output) for row in attributes.rows: dists.append(row_distance(row, example)) return training.with_column('Distance', dists) def closest(training, example, k, output): """Return a table of the k closest neighbors to example.""" return distances(training, example, output).sort('Distance').take(np.arange(k)) Test algorithm Execute these cells to define the predict_nn function for pKa, pick an example row, predict and compare. In [69]: def predict_nn(example): """Return the majority class among the k nearest neighbors.""" k = 10
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
return np.average(closest(train.drop(1,2,3,4), example, k, 'pKa').column('pKa')) Examine 1 row in test set to try to predict In [70]: test.drop(1,2,3,4).take(100) Out[70]: pKa AM1BCC partial charge (prot. atom) AM1BCC partial charge (deprot. atom) AM1BCC partial charge (prot. atoms 1 bond away) AM1BCC partial charge (deprot. atoms 1 bond away) AM1BCC partial charge (prot. atoms 2 bond away) AM1BCC partial charge (deprot. atoms 2 bond away) ∆G_sol v (kJ/mol ) (prot- deprot) SASA (Shrake ) Bond Order 5.3 -0.8744 -0.86723 0.19836 0.314295 -0.3455 -0.423077 3.61438 8.62053 0.715061 In [71]: # Look at closest in training set to test row, need to drop pKa from test k = 10 closest(train.drop(1,2,3,4), test.drop(0,1,2,3,4).row(100), k, 'pKa').select('pKa','Distance') Out[71]: pKa Distance 4.26 1.73079 6.08 2.15534 0.98 2.17089 4.75 2.20239 1.33 2.27305 4.98 2.46338 4.66 2.70047 2.49 2.70827 5.16 2.72613 5.4 2.76962 If we use test data in both cases we get exact match (Distance = 0) and no training, not machine learning but matching! In [72]: closest(test.drop(1,2,3,4), test.drop(0,1,2,3,4).row(100), k, 'pKa').select('pKa','Distance')
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Out[72]: pKa Distance 5.3 0 0.33 1.07805 4.45 2.95984 5.41 2.99021 4.7 3.09566 2.72 3.09574 2.58 3.11376 0.2 3.1951 1.4 3.3731 5.95 3.38494 Histogram of experimental acidity to be predicted Question Make a histogram of acidity measured by pKa in the training data In [74]: train.hist('pKa')
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Question 13 Prediction time Now predict the pKa of the 10 molecule in the test_nn dataset using predict. We need to drop the experimental pKa and descriptors in the first 4 columns to create and example_nn_row with the attributes for the k nearest neighbor. Discuss the quality of the fit and the name of the name of the molecule from column 1. Repeat for two more rows and discuss the prediction quality. Keep in mind that the prediction of pKa is a very challenging task for machine learning. In [75]: example_nn_row = test.drop(0,1,2,3,4).row(9) example_nn_row Out[75]: Row(AM1BCC partial charge (prot. atom)=-0.61150002394887537, AM1BCC partial charge (deprot. atom)=-0.8293499943046343, AM1BCC partial charge (prot. atoms 1 bond away)=0.65188002671030432, AM1BCC partial charge (deprot. atoms 1 bond away)=0.90706998145296458, AM1BCC partial charge (prot. atoms 2 bond away)=- 0.32791000892492861, AM1BCC partial charge (deprot. atoms 2 bond away)=- 0.46774499827907201, ∆G_solv (kJ/mol) (prot-deprot)=285.65048792778322, SASA (Shrake)=16.009556162693585, Bond Order=0.58754159243616033, Change in Enthalpy (kJ/mol) (prot-deprot)=2.8655543841587572, num ionizable groups=1.0) In [77]: example_nn_row_table = test.drop(0,1,2,3,4).take(9) # For display and verification example_nn_row_table Out[77]: AM1BCC partial charge (prot. atom) AM1BCC partial charge (deprot. atom) AM1BCC partial charge (prot. atoms 1 bond away) AM1BCC partial charge (deprot. atoms 1 bond away) AM1BCC partial charge (prot. atoms 2 bond away) AM1BCC partial charge (deprot. atoms 2 bond away) ∆G_sol v (kJ/mo l) (prot- depro t) SASA (Shrake ) Bond Order Chan in Entha (kJ/m (pro depr -0.6115 -0.82935 0.65188 0.90707 -0.32791 -0.467745 285.65 16.0096 0.587542 2.865 In [78]: predict_nn(example_nn_row) Out[78]: 3.5556000000000005 In [79]: print('Experimental pKa:', test.column('pKa').item(9)) print('Predicted pKa using nearest neighbors:', round(predict_nn(example_nn_row),2)) Experimental pKa: 4.17 Predicted pKa using nearest neighbors: 3.56 In [80]: check('tests/q13.py') Out[80]: All tests passed! Now let's plot knn prediction success Execute the next three cells In [81]: exp_pKa = make_array()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
predict_pKA = make_array() In [82]: # This takes a while! for i in np.arange(test.num_rows): exp_pKa = np.append(exp_pKa,test.column('pKa').item(i)) example_nn_row = test.drop(0,1,2,3,4).row(i) predict_pKA = np.append(predict_pKA,predict_nn(example_nn_row) ) In [83]: plt.scatter(exp_pKa,predict_pKA) #calculate equation for regression line z = np.polyfit(exp_pKa,predict_pKA, 1) p = np.poly1d(z) #add trendline to plot plt.plot(exp_pKa, p(exp_pKa),'blue',label="{}".format(p)) # Equation of line placed in legend from label plt.xlabel("Experimental pKa") plt.ylabel("Predicted pKa") plt.legend(fontsize="small") plt.show() Conclusions on our k nearest neighbor model Question 14 Evaluate the overall quality of our machine learning prediction based on the above plot and your 3 predictions above. I think our machine learning prediction overall is good quality since there are strong correlations between the predictions on the above plot. Now we will try a few values for k to try to optimize this hyperparameter. We need a new
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
version of predict_nn that also has an argument of k. In [84]: def predict_knn(example,k): """Return the majority class among the k nearest neighbors.""" return np.average(closest(train.drop(1,2,3,4), example, k, 'pKa').column('pKa')) In [88]: for k in [5,7,10,15,20]: exp_pKa = make_array() predict_pKA = make_array() for i in np.arange(test.num_rows): exp_pKa = np.append(exp_pKa,test.column('pKa').item(i)) example_nn_row = test.drop(0,1,2,3,4).row(i) predict_pKA = np.append(predict_pKA,predict_knn(example_nn_row,k) ) plt.scatter(exp_pKa,predict_pKA) z = np.polyfit(exp_pKa,predict_pKA, 1) p = np.poly1d(z) plt.plot(exp_pKa, p(exp_pKa),'blue',label="{}".format(p)) # Equation of line placed in legend from label plt.xlabel("Experimental pKa") plt.ylabel("Predicted pKa") plt.title("k = "+str(k)) plt.legend(fontsize="small") plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Question: Which value of k makes the best estimation? In [89]: k = 10
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [90]: check('tests/q14.py') Out[90]: All tests passed! Extra (advanced): Try to vary the set of parameters/attributes by using fewer attributes or where choices exist such as using Gasteiger partial charges instead of AM1BCC or removing moolecular weight or other attributes. In [91]: for (i, item) in enumerate(list(dblow.columns)): # List of attributes print(i, item) 0 pKa 1 protonated molecule 2 deprotonated molecule 3 deprotonated microstate ID 4 protonated microstate ID 5 deprotonated microstate smiles 6 protonated microstate smiles 7 AM1BCC partial charge (prot. atom) 8 AM1BCC partial charge (deprot. atom) 9 AM1BCC partial charge (prot. atoms 1 bond away) 10 AM1BCC partial charge (deprot. atoms 1 bond away) 11 AM1BCC partial charge (prot. atoms 2 bond away) 12 AM1BCC partial charge (deprot. atoms 2 bond away) 13 Gasteiger partial charge (prot. atom) 14 Gasteiger partial charge (deprot. atom) 15 Gasteiger partial charge (prot. atoms 1 bond away) 16 Gasteiger partial charge (deprot. atoms 1 bond away) 17 Gasteiger partial charge (prot. atoms 2 bond away) 18 Gasteiger partial charge (deprot. atoms 2 bond away) 19 Extented Hückel partial charge (prot. atom) 20 Extented Hückel partial charge (deprot. atom) 21 Extented Hückel partial charge (prot. atoms 1 bond away) 22 Extented Hückel partial charge (deprot. atoms 1 bond away) 23 Extented Hückel partial charge (prot. atoms 2 bond away) 24 Extented Hückel partial charge (deprot. atoms 2 bond away) 25 ∆G_solv (kJ/mol) (prot-deprot) 26 SASA (Shrake) 27 SASA (Lee) 28 Bond Order 29 Change in Enthalpy (kJ/mol) (prot-deprot) 30 href 31 Weight 32 num ionizable groups 33 pKa source Selection of attributes In [92]: molecular = Table().from_df(dblow) # Now back to Table molecular=molecular.select(0, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 25, 26, 28, 29, 32) # Change these molecular Out[92]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
pKa deprotonated microstate ID protonated microstate ID deprotonated microstate smiles protonated microstate smiles AM1BCC partial charge (prot. atom) A p c (d 6.98 Fenpropimorph_micro 001 Fenpropimorph_micro 000 C[C@@H]1CN( C[C@@H] (O1)C)C[C@H] (C)Cc2ccc(cc2) C(C)(C)C C[C@@H]1C[N H+](C[C@@H] (O1)C)C[C@H] (C)Cc2ccc(cc2) C(C)(C)C -0.68511 -0 6.95 Imidazole_micro001 Imidazole_micro000 c1cnc[n-]1 c1cnc[nH]1 -0.32082 -0 6.95 3-methoxy-6- mercaptopyridazine_ micro001 3-methoxy-6- mercaptopyridazine_ micro000 COc1ccc(nn1) [S-] COc1ccc(nn1)S -0.31734 -0 6.95 2-(N-(2-cyanoethyl)- N- methyl)aminopropylb enzene_micro001 2-(N-(2-cyanoethyl)- N- methyl)aminopropylb enzene_micro000 C[N@@] (CCCc1ccccc1) CCC#N C[N@@H+] (CCCc1ccccc1) CCC#N -0.6868 -0 6.95 1- Methylimidazole_micr o001 1- Methylimidazole_micr o000 Cn1ccnc1 Cn1cc[nH+]c1 -0.13447 -0 6.95 Morpholine, N-(3- ethylcarbonyl-3,3- diphenyl)propyl- _micro001 Morpholine, N-(3- ethylcarbonyl-3,3- diphenyl)propyl- _micro000 CCC(=O)C(CCN 1CCOCC1) (c2ccccc2)c3cc ccc3 CCC(=O)C(CC[ NH+]1CCOCC1) (c2ccccc2)c3cc ccc3 -0.68628 -0 6.95 2-bis(2- chloroethyl)aminopro pane_micro001 2-bis(2- chloroethyl)aminopro pane_micro000 CC(C)N(CCCl)C CCl CC(C)[NH+] (CCCl)CCCl -0.70654 -0 6.94 4- mercaptopyrimidine_ micro001 4- mercaptopyrimidine_ micro000 c1cncnc1[S-] c1cncnc1S -0.31642 -0 6.94 Barbituric acid_micro001 Barbituric acid_micro000 CC[C@]1(C(=O) NC(=O) [N-]C1=O)c2cc c(cc2)[N+](=O) [O-] CCC1(C(=O)NC (=O)NC1=O)c2 ccc(cc2)[N+] (=O)[O-] -0.59474 -0
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
pKa deprotonated microstate ID protonated microstate ID deprotonated microstate smiles protonated microstate smiles AM1BCC partial charge (prot. atom) A p c (d 6.92 2,3- diaminobutane_micro 001 2,3- diaminobutane_micro 000 C[C@@H] ([C@H](C) [NH-])[NH3+] C[C@@H] ([C@H](C)N) [NH3+] -0.92502 -0 ... (2035 rows omitted) Now test by copying appropriate code from above In [93]: exp_pKa = make_array() predict_pKA = make_array() In [94]: # This takes a while! for i in np.arange(test.num_rows): exp_pKa = np.append(exp_pKa,test.column('pKa').item(i)) example_nn_row = test.drop(0,1,2,3,4).row(i) predict_pKA = np.append(predict_pKA,predict_nn(example_nn_row) ) In [95]: # Plot sns.regplot(x=predict_pKA,y=exp_pKa) Out[95]: <Axes: > In [96]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
# Comments: Now demonstrate knn from scikit learn scikit learn is a standard and powerful machine learning library. Below is a demonstration for your information of the same machine learning task using scikit learn. Execute the below cells. In [97]: from sklearn.neighbors import KNeighborsRegressor from sklearn.metrics import mean_squared_error from sklearn.metrics import r2_score from sklearn.model_selection import train_test_split In [98]: knn = KNeighborsRegressor(n_neighbors=15, weights='distance',p=1) In [99]: X = make_array() attributes = train.drop('pKa',1,2,3,4) for i in np.arange(attributes.num_rows): X=np.append(X,np.array(attributes.row(i))) X=X.reshape(attributes.num_rows,len(attributes)) In [100]: y=train.column('pKa') y Out[100]: array([ 4.37, 2.45, 3.2 , ..., 3.14, 2.85, 3.54]) In [101]: knn.fit(X,y) Out[101]: KNeighborsRegressor(n_neighbors=15, p=1, weights='distance') In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. KNeighborsRegressor KNeighborsRegressor(n_neighbors=15, p=1, weights='distance') Now test attributes In [102]: attributes = test.drop('pKa',1,2,3,4) Xtest = make_array() for i in np.arange(attributes.num_rows): Xtest=np.append(Xtest,np.array(attributes.row(i))) Xtest=Xtest.reshape(attributes.num_rows,len(attributes)) In [103]: ytest=test.column('pKa') In [104]: y_predicted = knn.predict(Xtest) predict_nn = test.with_columns("pKa predict",y_predicted) In [105]: plt.scatter(ytest,y_predicted) #calculate equation for regression line z = np.polyfit(ytest,y_predicted, 1) p = np.poly1d(z) # Label with equation
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
print(p) #add trendline to plot plt.plot(ytest, p(ytest),'red',label="{}".format(p)) plt.legend(fontsize="small") plt.show() 0.2374 x + 2.937 Return the coefficient of determination of the prediction. The coefficient of determination $R^2$ is defined as $$ (1-\frac{u}{v}) $$ u is the residual sum of squares ((y_true - y_pred)** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a $R^2 = 0.0 $ In [106]: knn.score(Xtest, ytest) # knn score 0-1 Out[106]: 0.22954352412772705 Final fancy plotting of select molecules Question 15 Use a part of a molecular name to see if it is included in the test data and then execute the code to examine structures. Structures will be default structures if rdkit is not available. In [107]: molecular_name = 'methyl' predict_nn.where("protonated microstate ID",are.containing(molecular_name)) Out[107]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
pKa deprotonated microstate ID protonated microstate ID deprotonated microstate smiles protonate microstat smiles 3.56 b'(2-butanoyloxy-4-hydroxy- 4-oxobutyl)- trimethylazanium' ... b'(2-butanoyloxy-4-hydroxy- 4-oxobutyl)- trimethylazanium' ... CCCC(=O)O[C @H](CC(=O) [O-])C[N+](C) (C)C CCCC(=O)O @H] (CC(=O)O)C +](C)(C)C 4.61 2-amino-4- methylaniline_micro001 2-amino-4- methylaniline_micro000 Cc1ccc(c(c1)N) [NH-] Cc1ccc(c(c1 )N 3.16 3-cyano-2-methyl-benzoic acid_micro001 3-cyano-2-methyl-benzoic acid_micro000 Cc1c(cccc1C(= O)[O-])C#N Cc1c(cccc1C =O)O)C#N 3.45 2-methylthioaniline_micro001 2-methylthioaniline_micro000 CSc1ccccc1[NH -] CSc1ccccc1 3.305 methylene-bis-thioacetic acid_micro001 methylene-bis-thioacetic acid_micro000 C(C(=O)O)SCS CC(=O)[O-] C(C(=O)O)S CC(=O)O 4.84 alpha,alpha-diphenyl-adipic acid-alpha-methylhalfester_m ... alpha,alpha-diphenyl-adipic acid-alpha-methylhalfester_m ... COC(=O)C(CCC C(=O)[O-]) (c1ccccc1)c2cc ccc2 COC(=O)C(C CC(=O)O) (c1ccccc1)c cccc2 4.88 2,4-dimethyl-8- methylthioquinoline_micro00 1 2,4-dimethyl-8- methylthioquinoline_micro00 0 Cc1cc(nc2c1cc cc2SC)C Cc1cc([nH+ 2c1cccc2SC 5.36 trans-hexahydro-3,6- endomethylene-o-phthalic acid_micro001 trans-hexahydro-3,6- endomethylene-o-phthalic acid_micro000 C1C[C@H]2C[C @@H]1[C@H] ([C@H]2C(=O) [O-])C(=O)[O-] C1C[C@H]2 C@@H]1[C@ ] ([C@H]2C(= [O-])C(=O)O 3.12 2-amino-7- methylsulphonylnaphthalene _micro001 2-amino-7- methylsulphonylnaphthalene _micro000 CS(=O) (=O)c1ccc2ccc( cc2c1)[NH-] CS(=O) (=O)c1ccc2 c(cc2c1)N 2 Morpholine, N-cyanomethyl- _micro001 Morpholine, N-cyanomethyl- _micro000 C1COCCN1CC# N C1COCC[NH 1CC#N ... (78 rows omitted)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [108]: check('tests/q15.py') Out[108]: All tests passed! In [109]: glycine=predict_nn.where("protonated microstate ID",are.containing(molecular_name)) mols = [Chem.MolFromSmiles(x) for x in glycine.column("protonated microstate smiles") if x is not None] molde = [Chem.MolFromSmiles(x) for x in glycine.column("deprotonated microstate smiles") if x is not None] mol = [None] * 2 * glycine.num_rows mol[0::2]=mols mol[1::2]=molde label = [None] * 2 * glycine.num_rows lpred = [None] * 2 * glycine.num_rows exp = glycine.column("pKa") pred = glycine.column("pKa predict") label[0::2] = exp label[1::2] = exp lpred[0::2] = pred lpred[1::2] = pred for i,m in enumerate(mol): m.SetProp("Name","pKa: "+str(np.round(label[i],2))+" knn: "+str(np.round(lpred[i],2))) p = Draw.MolsToGridImage( [mol[x] for x in range(0,(2 * glycine.num_rows))] , legends=[x.GetProp("Name") for x in mol],molsPerRow=2,subImgSize=(300,250)) print("\t\tProtonated","\t\t\tDeprotonated") p Protonated Deprotonated Out[109]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
All finished... Run checks and submit .html and .ipynb files after downloading. In [110]: # For your convenience, you can run this cell to run all the tests at once! import glob from gofer.ok import check correct = 0 checks = [1,2,3,4,6,7,8,11,12,13,14,15] total = len(checks) for x in checks: print('Testing question {}: '.format(str(x))) g = check('tests/q{}.py'.format(str(x))) if g.grade == 1.0: print("Passed") correct += 1 else: print('Failed') display(g) print('Grade: {}'.format(str(correct/total))) print("Nice work ",Your_name) import time; localtime = time.asctime( time.localtime(time.time()) ) print("Submitted @ ", localtime) Testing question 1: Passed Testing question 2: Passed Testing question 3: Passed Testing question 4: Passed Testing question 6: Passed Testing question 7: Passed Testing question 8: Passed Testing question 11: Passed Testing question 12: Passed Testing question 13: Passed Testing question 14: Passed Testing question 15: Passed Grade: 1.0 Nice work Madlyn Anglin
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Submitted @ Fri Dec 1 02:16:01 2023
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help