shurenb2_HW 6 EDA + Data Prep_student.ipynb - Colaboratory

pdf

School

William Rainey Harper College *

*We aren’t endorsed by this school

Course

MISC

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

Uploaded by elliebat

3/12/23, 10:16 PM shurenb2_HW 6 EDA + Data Prep_student.ipynb - Colaboratory https://colab.research.google.com/drive/1d8ev2LynLiXY-icoWg7KzyfUVjg1zGqq#scrollTo=tpLWPXW4NmMu&printMode=true 1/7 Homework 6: Data Mining Overview You have been hired as a consultant by a laptop retailing ±rm, SmartRetail . SmartRetail faces stiff competition in this space. In order to be pro±table and survive, the company needs to predict the price points of laptops with different speci±cations. Your task is to predict the market price of a given laptop speci±cation with the least amount of prediction error. In particular, the ±rm's sales department wants to understand how different components contribute to the Retail Price . The Retail Price is measured in dollars. For some transactions, this value may not have been captured due to a glitch in the system. Different components and the way they are measured are listed below: Screen Size is in inches Battery Life in Hours RAM in GB HD Size in GB Processing Speeds in GHz Integrated Wireless and Bundled Applications have a Yes or No value according to the customer's choice Problem Statement Your goals are to do the following: 1. Identify the factors that explain the variance in the price of a Laptop 2. Predict the Retail Price of a Laptop To achieve these goals you must perform exploratory data analysis (EDA) and prepare the data for modeling. Therefore, in this notebook you will perform the following tasks: 1. Clean the data for duplicates (if any) and missing values (if any) 2. Perform EDA 3. Prepare the data for Modeling (Dummy Coding and Data Partitioning). Once you have completed these tasks, answer the accompanying questions in the Canvas quiz. In HW 7 (on Multiple Linear Regression), you will perform the next steps, which are ±tting two types of linear regression models (explanatory and predictive). Business Goals Coding and Analysis Before you begin, let's upgrade seaborn . !pip install seaborn --upgrade Looking in indexes: https://pypi.org/simple , https://us-python.pkg.dev/colab-wheels/public/simple/ Requirement already satisfied: seaborn in /usr/local/lib/python3.9/dist-packages (0.11.2) Collecting seaborn Downloading seaborn-0.12.2-py3-none-any.whl (293 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 293.3/293.3 KB 3.7 MB/s eta 0:00:00 Requirement already satisfied: matplotlib!=3.6.1,>=3.1 in /usr/local/lib/python3.9/dist-packages (from seaborn) (3.5.3) Requirement already satisfied: numpy!=1.24.0,>=1.17 in /usr/local/lib/python3.9/dist-packages (from seaborn) (1.22.4) Requirement already satisfied: pandas>=0.25 in /usr/local/lib/python3.9/dist-packages (from seaborn) (1.3.5) Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.9/dist-packages (from matplotlib!=3.6.1,>=3.1->seaborn Requirement already satisfied: pyparsing>=2.2.1 in /usr/local/lib/python3.9/dist-packages (from matplotlib!=3.6.1,>=3.1->seab Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.9/dist-packages (from matplotlib!=3.6.1,>=3.1->sea Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.9/dist-packages (from matplotlib!=3.6.1,>=3.1->seabo Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.9/dist-packages (from matplotlib!=3.6.1,>=3.1->seaborn) Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.9/dist-packages (from matplotlib!=3.6.1,>=3.1->sea Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.9/dist-packages (from matplotlib!=3.6.1,>=3.1-> Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.9/dist-packages (from pandas>=0.25->seaborn) (2022.7.1) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.9/dist-packages (from python-dateutil>=2.7->matplotlib!=3.6 Installing collected packages: seaborn

3/12/23, 10:16 PM shurenb2_HW 6 EDA + Data Prep_student.ipynb - Colaboratory https://colab.research.google.com/drive/1d8ev2LynLiXY-icoWg7KzyfUVjg1zGqq#scrollTo=tpLWPXW4NmMu&printMode=true 2/7 Attempting uninstall: seaborn Found existing installation: seaborn 0.11.2 Uninstalling seaborn-0.11.2: Successfully uninstalled seaborn-0.11.2 Successfully installed seaborn-0.12.2 Next, import the following packages and functions using suitable aliases: pandas , numpy , matplotlib.pyplot , seaborn , and train_test_split . import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split %matplotlib inline Next, load the dataset into a dataframe named lsales_df from https://raw.githubusercontent.com/ashish-cell/BADM-211- FA21/main/Data/laptop_hw6.csv . lsales_df = pd.read_csv('https://raw.githubusercontent.com/ashish-cell/BADM-211-FA21/main/Data/laptop_hw6.csv') For future reference, let's check how many rows and columns there are. lsales_df.shape (5963, 10) STEP 1: Display the ±rst 7 rows of the dataset. (1 pt) Your output should look like this: Transaction_ID Configuration Screen_Size Battery_Life RAM Processor_Speeds Integrated_Wireless HD_Size Bundled_Appli 0 5622 51 15 4 2 1.5 Yes 80 1 52758 695 17 5 4 2.0 Yes 300 2 977 299 15 6 1 1.5 No 80 3 215756 303 15 6 1 1.5 No 300 4 294662 394 15 6 4 1.5 No 40 5 150118 603 17 5 1 2.0 No 80 6 265452 512 17 4 2 2.0 No 300 # Write the code for step 1 here lsales_df.head(7) Q1: Which of these conclusions can you draw based on only the output shown above? 1. There are three Screen Sizes. 2. The maximum Retail Price (across all laptops) is 550. 3. Processor Speeds vary between 2.0 and 2.4

3/12/23, 10:16 PM shurenb2_HW 6 EDA + Data Prep_student.ipynb - Colaboratory https://colab.research.google.com/drive/1d8ev2LynLiXY-icoWg7KzyfUVjg1zGqq#scrollTo=tpLWPXW4NmMu&printMode=true 3/7 4. None of these STEP 2a: Show the number of duplicate rows in the dataset. (1 pt) # Write the code for step 2a here lsales_df.duplicated().sum() 12 Q2: Which of the following is correct with respect to the output of Step 2. 1. There are 12 duplicate columns in the dataset 2. There are 12 duplicate rows in the dataset 3. There are 5963 duplicate rows in the dataset 4. None of these STEP 2b: Remove the duplicate rows, if any. (1 pt) (You may want to compare the number of rows/columns before and after this step to ensure you did it correctly.) # Write the code for step 2b here lsales_df.drop_duplicates(inplace=True) lsales_df.shape (5951, 10) STEP 3a: Show the total number of missing values for each column. (1 pt) # Write the code for step 3a here lsales_df.isnull().sum() Transaction_ID 0 Configuration 0 Screen_Size 0 Battery_Life 0 RAM 0 Processor_Speeds 0 Integrated_Wireless 0 HD_Size 0 Bundled_Applications 0 Retail_Price 293 dtype: int64 Q3: Which of the following is correct with respect to the output of STEP 3. (Select all that apply) 1. There is only one column with missing values 2. The count of missing values in the Retail_Price column is 293 3. The proportion of rows with missing values in any column is less than 10% of the overall dataset. 4. None of these. STEP 3b: Drop rows with missing values. (1 pt) (You may want to compare the number of rows/columns before and after this step to ensure you did it correctly.) # Write the code for step 3b here lsales_df.dropna(inplace=True) STEP 4: Generate a bar chart to show the mean Retail Price by HD Size, and distinguishing by Battery_Life within each HD Size category. (2 pts) Your output should look like this:

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

3/12/23, 10:16 PM shurenb2_HW 6 EDA + Data Prep_student.ipynb - Colaboratory https://colab.research.google.com/drive/1d8ev2LynLiXY-icoWg7KzyfUVjg1zGqq#scrollTo=tpLWPXW4NmMu&printMode=true 4/7 # Write the code for step 6 here sns.barplot(x=lsales_df["HD_Size"],y=lsales_df["Retail_Price"],hue=lsales_df["Battery_Life"],estimator="mean") plt.show() Q4: Referring to the visualization created for Step 6, what can be concluded? 1. For any Battery_Life, the average price of a laptop with 300 GB HD is higher than that of a laptop with 120 GB. 2. The most expensive laptop in our data was the one with 300 GB HD_Size and 6 hours of Battery_Life. 3. The least expensive laptop in our data was the one with 40 GB HD_Size and 4 hours of Battery_Life. 4. None of these. STEP 5: Generate a visualization to show relationship between Con±guration and Retail Price. (1 pt) Your output should look like this: # Write the code for step 7 here sns.scatterplot(x=lsales_df["Configuration"],y=lsales_df["Retail_Price"]) plt.show()

3/12/23, 10:16 PM shurenb2_HW 6 EDA + Data Prep_student.ipynb - Colaboratory https://colab.research.google.com/drive/1d8ev2LynLiXY-icoWg7KzyfUVjg1zGqq#scrollTo=tpLWPXW4NmMu&printMode=true 5/7 Q5: Referring to the scatterplots describing Con±guration and Retail Price, choose the correct answer. 1. There appears to be strongly negative linear relationship between these two predictors. 2. Almost all of the observations have a Retail Price of less than $400. 3. The record with the highest Retail Price has a Con±guration of 600. 4. The record with the highest Retail Price has a Con±guration of more than 800. STEP 6: Remove unnecessary columns. (1 pt) The ±rst column of the dataframe is Transaction_ID . This is not a valid predictor, so we must remove it. After doing so, display the names of the remaining columns. This step must be completed for the rest to be accurate. Your output should look like this: # Write the code for step 9 here lsales_df=lsales_df.drop(columns=["Transaction_ID"]) lsales_df.columns Index(['Configuration', 'Screen_Size', 'Battery_Life', 'RAM', 'Processor_Speeds', 'Integrated_Wireless', 'HD_Size', 'Bundled_Applications', 'Retail_Price'], dtype='object') STEP 7: Dummy Coding (2 pts) Perform dummy coding for the categorical variables. One-hot encode (AKA (dummy code) the variables Integrated_Wireless and Bundled_Applications , and display the variable names for the entire dataset. Pay close attention to the name and number of variables. We need to do this step correctly to prepare for multiple linear regression. You can choose to overwrite the existing dataframe or save this output to a new dataframe that you will use going forward. Your output should look like this: # Write the code for step 11 here lsales_df_dummy = pd.get_dummies(lsales_df, columns=["Integrated_Wireless","Bundled_Applications"],drop_first=True) lsales_df_dummy.columns Index(['Configuration', 'Screen_Size', 'Battery_Life', 'RAM', 'Processor_Speeds', 'HD_Size', 'Retail_Price', 'Integrated_Wireless_Yes', 'Bundled_Applications_Yes'], dtype='object')

3/12/23, 10:16 PM shurenb2_HW 6 EDA + Data Prep_student.ipynb - Colaboratory https://colab.research.google.com/drive/1d8ev2LynLiXY-icoWg7KzyfUVjg1zGqq#scrollTo=tpLWPXW4NmMu&printMode=true 6/7 STEP 8: Create predictors and outcome variable (1 pt) Create an object named X to include all predictors. Create an object named y that holds the Retail Price variable. Print the number rows and columns in X and y . Your output should look like this: # Write the code for step 12 here x = lsales_df.drop(columns = ["Retail_Price"]) y = lsales_df["Retail_Price"] print(x.shape) print(y.shape) (5658, 8) (5658,) STEP 9: Train Test Split (2 pts) Using the train_test_split function, split the dataset into two parts: 60% of the samples into train_X and train_y and 40% of the samples into test_X and test_y. Enter a random seed value of seven. Display the shape of the training data and test data predictors. Your output should look like this # Write the code for step 13 here train_x, test_x, train_y, test_y = train_test_split(x, y, test_size = .4,random_state=7) print("Train X:", train_x.shape, "\nTest X:", test_x.shape) Train X: (3394, 8) Test X: (2264, 8) Configuration Screen_Size Battery_Life RAM Processor_Speeds Integrated_Wireless HD_Size Bundled_Applications 2363 169 15 5 1 2.0 No 40 Yes 469 82 15 4 2 2.4 Yes 40 No 5915 391 15 6 4 1.5 Yes 300 Yes 4894 77 15 4 2 2.0 No 120 Yes 5485 590 17 5 1 1.5 No 120 No # Use this cell to help in answering Q6 train_x.head() Q6: Look at the ±rst 5 rows of the dataframe with the training predictor variables. What is the value of Con±guration for the record displayed ±rst? 1. 318 2. 27 3. 169 4. 506

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version