01 PYTHON notebook -- Sarah Gets a Diamond1

pdf

School

Queens University *

*We aren’t endorsed by this school

Course

867

Subject

Computer Science

Date

Oct 30, 2023

Type

pdf

Pages

Uploaded by DoctorFalcon697

Linear Regression in Python, "Sarah Gets a Diamond" case This notebook provides the code for building several linear models for the "Sarah Gets a Diamond case". It containt the following steps: 1. Load (install, if needed) the required packages and libraries. 2. Data engineering: load the data, clean the data, transform as needed, visualized, select evaluation metrics. 3. Builidng linear models: OLS, Ridge Regression, Lasso Regression. 4. Log-transformation is applied, and this improves quality. 5. Visualization and feature-engineering are used for fine-tuning. 6. Hurray--we are done! 0. Basic arrangements. 0.1. A Very brief Introduction to Python The code of this notebook is in Python 3. We would highly recommend installing the newest stable version that is available. At the time of writing this notebook, it is 3.8. If you are already familiar with some programming languages, this may boost your starting progress: https://learnxinyminutes.com/docs/python/ (https://learnxinyminutes.com/docs/python/) Official Python tutorial is available here: https://docs.python.org/3/tutorial/ (https://docs.python.org/3/tutorial/) It is also a good source to start learning programming if you are unfamiliar with it. Complete documentation is available here: https://docs.python.org/3/ (https://docs.python.org/3/) The common sense of programming is understanding basic language structures and using your favorite search engines, of which Google (www.google.com) definitely works. Usually good answers and questions can be found on StackOverflow (www.stackoverflow.com) and StackExchange (www.stackexchange.com), as well as on Quora (www.quora.com). Some relevant blogs can be available on Medium (www.medium.com), Habrahabr (www.habr.com), and sometimes useful dicussions appear on Reddit (www.reddit.com). Lastly, numerous online courses are also available on Udemy (www.udemy.com), Coursera (www.coursera.org), and DataCampt ( https://www.datacamp.com/ (https://www.datacamp.com/) ) among others. 0.2. Getting required libraries In [1]: # Importing some standard python libraries: import copy import math

This notebook will use certain third-party libraries. They are not installed with Python by default, however, if you install the Anaconda ( https://www.anaconda.com/ (https://www.anaconda.com/) ) distribution, then they will be installed together with Python. 1. Numpy. This is a de-facto standard library for linear algebra in Python, documentation is available by https://numpy.org/doc/ (https://numpy.org/doc/) 2. Pandas. It is most commonly used library for data engineering. Documentation is at https://pandas.pydata.org (https://pandas.pydata.org) 3. Scikit-Learn. This one contains majority of simple machine-learning algorithms ready to be applied out of box. https://scikit-learn.org/stable/ (https://scikit-learn.org/stable/) 4. Statsmodels. Documentation can be found at https://www.statsmodels.org/stable/index.html (https://www.statsmodels.org/stable/index.html) . 5. Matplotlib. Provides basic plot functionality in Python, https://matplotlib.org (https://matplotlib.org) In [2]: # Anaconda Libraries Installation: # 1. Check conda environment and installed packages and libaries # import sys # !conda env list # !conda list # 2. Use these if the required libaries/packages (pandas, numpy, scikit-learn, statsmodels) are not installed # !conda install pandas # !conda install scikit-learn # !conda install statsmodels # !conda install matplotlib In [3]: # Installation with Pip Installer: # !pip3 install numpy pandas sklearn statsmodels matplotlib In [4]: # Loading library pandas with giving it short name pd. This alias is an indust ry-standard. # Same for numpy, scikit-learn and statsmodels. import pandas as pd import numpy as np import statsmodels.api as sm import sklearn as sk import matplotlib as mp # This allows Jupyter-inlined plots. import matplotlib.pyplot as plt % matplotlib inline 1 Data Engineering

In [5]: # Reading the data from local machine. sarah_raw_data = pd . read_csv( "01 CSV data -- Sarah Gets a Diamond.csv" , sep = ',' ) # Update the path to guide to where your data is, e.g., Sarah_raw_data = pd.re ad_csv("C:\\Users\\A.OVCHINNIKOV\\01 CSV data -- Sarah Gets a Diamond.csv", se p = ',') print ( "Shape of Data: " , sarah_raw_data . shape) # Display first few rows of data. sarah_raw_data . head() In [6]: # Display last few rows of data. sarah_raw_data . tail() Below is workaround for Google Colab. If you are running Jupyter on local machine, this one can be skipped for now. Data first needs to be loaded to the cloud, and then read it into a dataframe. More information can be found at https://colab.research.google.com (https://colab.research.google.com) . Manuals can be found at: https://colab.research.google.com/notebooks/intro.ipynb (https://colab.research.google.com/notebooks/intro.ipynb) and https://colab.research.google.com/notebooks/io.ipynb (https://colab.research.google.com/notebooks/io.ipynb) . Shape of Data: (9142, 9) Out[5]: ID Carat Weight Cut Color Clarity Polish Symmetry Report Price 0 1 1.10 Ideal H SI1 VG EX GIA 5169.0 1 2 0.83 Ideal H VS1 ID ID AGSL 3470.0 2 3 0.85 Ideal H SI1 EX EX GIA 3183.0 3 4 0.91 Ideal E SI1 VG VG GIA 4370.0 4 5 0.83 Ideal G SI1 EX EX GIA 3171.0 Out[6]: ID Carat Weight Cut Color Clarity Polish Symmetry Report Price 9137 9138 0.96 Ideal F SI1 EX EX GIA NaN 9138 9139 1.02 Very Good E VVS1 EX G GIA NaN 9139 9140 1.51 Good I VS1 G G GIA NaN 9140 9141 1.24 Ideal H VS2 VG VG GIA NaN 9141 9142 0.79 Ideal I VS1 EX EX GIA NaN

Your preview ends here