Machine Learning and Chemistry

html

School

Temple University *

*We aren’t endorsed by this school

Course

853

Subject

Computer Science

Date

Jan 9, 2024

Type

html

Pages

Uploaded by ProfFreedom5002

Machine Learning and prediction ¶ Elements of Data Science In this laboratory we will use training data to predict outcomes. We will first test these ideas using our Old Faithful data again. Next we will look at data on the iris flower to classify iris' based on sepal width and length. In our culminating activity we will predict molecular acidity using data computed by Prof. Vince Voelz in the Temple Chemistry department and a graduate student, Robert Raddi. See their paper: Stacking Gaussian processes to improve pKa predictions in the SAMPL7 challenge . In [1]: Your_name = "Madlyn Anglin" Learning from training data ¶ A key concept in machine learning is using a subset of a dataset to train an algorithm to make estimates on a separate set of test data. The quality of the machine learning and algorithm can be assesed based on the accuracy of the predictions made on test data. Many times there are also parameters sometimes termed hyper-parameters which can be optimized through an iterative approach on test or validation data. In practice a dataset is randomly split into training and test sets using sampling. k nearest neighbor ¶ We will examine one machine learning algorithm in the laboratory, k nearest neighbor. Many of the concepts are applicable to the broad range of machine learning algorithms available. In [2]: ## import statements # These lines load the tests. from gofer.ok import check import numpy as np from datascience import * import pandas as pd import matplotlib %matplotlib inline import matplotlib.pyplot as plt plt.style.use('ggplot') import warnings warnings.simplefilter('ignore', UserWarning) #from IPython.display import Image from matplotlib.colors import ListedColormap from sklearn import neighbors, datasets # Fix for datascience collections Iterable import collections as collections import collections.abc as abc collections.Iterable = abc.Iterable !pip install jupyterquiz from jupyterquiz import display_quiz import json from IPython.core.display import HTML Requirement already satisfied: jupyterquiz in /opt/conda/lib/python3.10/site-packages (2.6.1) k nearest neighbor regression ¶ We will use the k nearest neighbor algorithm to make predictions of wait time in minutes

following an eruption duration ofa given number of minutes (independent variable). In [3]: faithful = Table.read_table("data/faithful.csv") faithful.scatter(0, 1, fit_line=True) Question 1 ¶ Use the datascience .split(n) Table method to split the dataset into 80% training and 20% test. The argument for .split(n) method,n, needs to be an integer. See datascience documentation In [4]: trainf, testf = faithful.split(int(faithful.num_rows * 0.8)) print(trainf.num_rows, 'training and', testf.num_rows, 'test instances.') 217 training and 55 test instances. In [5]: check('tests/q1.py') Out[5]: All tests passed! Nearest neighbor concept ¶ The training examines the chreacteristics of k nearest neighbors to the data point for which a prediction will be made. Nearness is measured using several different metrics with Euclidean distance being a common one for numerical attributes. Euclidean distance: 1-D $$ d(p,q) = \sqrt{(p-q)^{2}} $$ 2-D $$ d(p,q) = \sqrt{(p_1-q_1)^{2}+(p_2-q_2)^{2}} $$ For multiple points (rows): 2-D $$ d(p,q) = \sum{{\sqrt{((p_1-q_1)^{2}+(p_2- q_2)^{2}}}} $$

Try different attribute values in the following 2D Euclidean distance example code below to get a feel for the computation ¶ In [6]: # Example code to compute an Euclidean distance between two 2-D points d_p_q = np.sqrt(sum((make_array(2,3)-make_array(4,3))**2)) d_p_q Out[6]: 2.0 To get values from Table row as an array as is done in row_distance. Note in the faithful data case we will only consider the duration column in nearest neighbor computation but in examples below we will use a 2-D array of attributes with the iris data and a 10-D array in the chemistry and molecular acidity case. In [7]: f_array = np.array(faithful.row(0)) f_array Out[7]: array([ 3.6, 79. ]) A couple quick review questions about nearest neighbor below, select the best answer (multiple tries ok). Execute the below cell to reveal the self-check quiz. ¶ In [8]: with open("questions.json", "r") as file: questions=json.load(file) display_quiz(questions) Question 2 ¶ Define a function which is the Euclidean distance between two values. Use the last two example code cells above as inspiration. This is where we will compute the distance between two duration values. In [11]: def distance(pt1, pt2): """The distance between two points, represented as arrays.""" return np.sqrt(sum((pt1 - pt2)**2)) In [12]: check('tests/q2.py') Out[12]: All tests passed! Rest of the nearest neighbor algorithm ¶ Execute these cells to create the complete algorithm In [13]: def row_distance(row1, row2): """The distance between two rows of a table.""" return distance(np.array(row1), np.array(row2)) # Need to convert rows into arrays def distances(training, example, output): """Compute the distance from example for each row in training.""" dists = [] attributes = training.drop(output) for row in attributes.rows: dists.append(row_distance(row, example))

Your preview ends here