Any idea where my code is messing up?The taxis dataset contains information on taxi journeys during March 2019 in New York City. The data includes time, number of passengers, distance, taxi color, payment method, and trip locations. Use sklearn's cross_validate() function to fit a linear regression model and a k-nearest neighbors regression model with 10-fold cross-validation. Create dataframe X with the feature distance. Create dataframe y with the feature fare. Split the data into 80% training, 10% validation and 10% testing sets, with random_state = 42. Initialize a linear regression model. Initialize a k-nearest neighbors regression model with k = 3. Define a set of 10 cross-validation folds with random_state=42. Fit the models with cross-validation to the training data, using the default performance metric. For each model, print the test score for each fold, as well as the mean and standard deviation for the model. Ex: If the file taxis_small.csv is used, the output is: k-nearest neighbor scores: [0.412 0.741 0.708 0.855 0.056 0.974 0.622 0.769 0.754 0.236] Mean: 0.613 SD: 0.274 Simple linear regression scores: [ 0.603  0.584  0.651  0.956 -0.564  0.941  0.828 -0.294  0.908  0.723] Mean: 0.534 SD: 0.502 main.py:import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_validate from sklearn.model_selection import KFold from sklearn.neighbors import KNeighborsRegressor from sklearn.linear_model import LinearRegression   # Load the dataset taxis = pd.read_csv("taxis.csv")   # Create dataframe X with the feature distance X = taxis[['distance']] # Create dataframe y with the feature fare y = taxis['fare']   # Set aside 10% of instances for testing X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.1, random_state=42)   # Split training again into 80% training and 10% validation X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.1, random_state=42)   # Initialize a linear regression model SLRModel = LinearRegression() # Initialize a k-nearest neighbors regression model with k = 3 knnModel = KNeighborsRegressor(n_neighbors=3)   # Define a set of 10 cross-validation folds with random_state=42 kf = KFold(n_splits=10, shuffle=True, random_state=42)   # Fit k-nearest neighbors with cross-validation to the training data knnResults = cross_validate(knnModel, X_train, y_train, cv=kf, return_train_score=False)   # Find the test score for each fold knnScores = knnResults['test_score'] print('k-nearest neighbor scores:', knnScores.round(3))   # Calculate descriptive statistics for k-nearest neighbor model print('Mean:', knnScores.mean().round(3)) print('SD:', knnScores.std().round(3))   # Fit simple linear regression with cross-validation to the training data SLRModelResults = cross_validate(SLRModel, X_train, y_train, cv=kf, return_train_score=False)   # Find the test score for each fold SLRScores = SLRModelResults['test_score'] print('Simple linear regression scores:', SLRScores.round(3))   # Calculate descriptive statistics for simple linear regression model print('Mean:', SLRScores.mean().round(3)) print('SD:', SLRScores.std().round(3))   Your Output: k-nearest neighbor scores: [0.677 0.813 0.883 0.852 0.818 0.875 0.848 0.768 0.928 0.841] Mean: 0.83 SD: 0.066 Simple linear regression scores: [0.897 0.79  0.889 0.78  0.832 0.911 0.815 0.802 0.934 0.915] Mean: 0.857 SD: 0.055Expected output: k-nearest neighbor scores: [0.815 0.786 0.856 0.877 0.742 0.92  0.846 0.909 0.921 0.845] Mean: 0.852 SD: 0.056 Simple linear regression scores: [0.804 0.784 0.853 0.907 0.757 0.9   0.919 0.919 0.931 0.844] Mean: 0.862 SD: 0.06

Database System Concepts
7th Edition
ISBN:9780078022159
Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Chapter1: Introduction
Section: Chapter Questions
Problem 1PE
icon
Related questions
Question

Any idea where my code is messing up?

The taxis dataset contains information on taxi journeys during March 2019 in New York City. The data includes time, number of passengers, distance, taxi color, payment method, and trip locations. Use sklearn's cross_validate() function to fit a linear regression model and a k-nearest neighbors regression model with 10-fold cross-validation.

  • Create dataframe X with the feature distance.
  • Create dataframe y with the feature fare.
  • Split the data into 80% training, 10% validation and 10% testing sets, with random_state = 42.
  • Initialize a linear regression model.
  • Initialize a k-nearest neighbors regression model with k = 3.
  • Define a set of 10 cross-validation folds with random_state=42.
  • Fit the models with cross-validation to the training data, using the default performance metric.
  • For each model, print the test score for each fold, as well as the mean and standard deviation for the model.

Ex: If the file taxis_small.csv is used, the output is:

k-nearest neighbor scores: [0.412 0.741 0.708 0.855 0.056 0.974 0.622 0.769 0.754 0.236]

Mean: 0.613

SD: 0.274

Simple linear regression scores: [ 0.603  0.584  0.651  0.956 -0.564  0.941  0.828 -0.294  0.908  0.723]

Mean: 0.534

SD: 0.502


main.py:
import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.model_selection import cross_validate

from sklearn.model_selection import KFold

from sklearn.neighbors import KNeighborsRegressor

from sklearn.linear_model import LinearRegression

 

# Load the dataset

taxis = pd.read_csv("taxis.csv")

 

# Create dataframe X with the feature distance

X = taxis[['distance']]

# Create dataframe y with the feature fare

y = taxis['fare']

 

# Set aside 10% of instances for testing

X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

 

# Split training again into 80% training and 10% validation

X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.1, random_state=42)

 

# Initialize a linear regression model

SLRModel = LinearRegression()

# Initialize a k-nearest neighbors regression model with k = 3

knnModel = KNeighborsRegressor(n_neighbors=3)

 

# Define a set of 10 cross-validation folds with random_state=42

kf = KFold(n_splits=10, shuffle=True, random_state=42)

 

# Fit k-nearest neighbors with cross-validation to the training data

knnResults = cross_validate(knnModel, X_train, y_train, cv=kf, return_train_score=False)

 

# Find the test score for each fold

knnScores = knnResults['test_score']

print('k-nearest neighbor scores:', knnScores.round(3))

 

# Calculate descriptive statistics for k-nearest neighbor model

print('Mean:', knnScores.mean().round(3))

print('SD:', knnScores.std().round(3))

 

# Fit simple linear regression with cross-validation to the training data

SLRModelResults = cross_validate(SLRModel, X_train, y_train, cv=kf, return_train_score=False)

 

# Find the test score for each fold

SLRScores = SLRModelResults['test_score']

print('Simple linear regression scores:', SLRScores.round(3))

 

# Calculate descriptive statistics for simple linear regression model

print('Mean:', SLRScores.mean().round(3))

print('SD:', SLRScores.std().round(3))

 

Your Output:

k-nearest neighbor scores: [0.677 0.813 0.883 0.852 0.818 0.875 0.848 0.768 0.928 0.841]

Mean: 0.83

SD: 0.066

Simple linear regression scores: [0.897 0.79  0.889 0.78  0.832 0.911 0.815 0.802 0.934 0.915]

Mean: 0.857

SD: 0.055

Expected output:

k-nearest neighbor scores: [0.815 0.786 0.856 0.877 0.742 0.92  0.846 0.909 0.921 0.845]

Mean: 0.852

SD: 0.056

Simple linear regression scores: [0.804 0.784 0.853 0.907 0.757 0.9   0.919 0.919 0.931 0.844]

Mean: 0.862

SD: 0.06

Expert Solution
steps

Step by step

Solved in 2 steps

Blurred answer
Knowledge Booster
Dynamic Table
Learn more about
Need a deep-dive on the concept behind this application? Look no further. Learn more about this topic, computer-science and related others by exploring similar questions and additional content below.
Similar questions
Recommended textbooks for you
Database System Concepts
Database System Concepts
Computer Science
ISBN:
9780078022159
Author:
Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:
McGraw-Hill Education
Starting Out with Python (4th Edition)
Starting Out with Python (4th Edition)
Computer Science
ISBN:
9780134444321
Author:
Tony Gaddis
Publisher:
PEARSON
Digital Fundamentals (11th Edition)
Digital Fundamentals (11th Edition)
Computer Science
ISBN:
9780132737968
Author:
Thomas L. Floyd
Publisher:
PEARSON
C How to Program (8th Edition)
C How to Program (8th Edition)
Computer Science
ISBN:
9780133976892
Author:
Paul J. Deitel, Harvey Deitel
Publisher:
PEARSON
Database Systems: Design, Implementation, & Manag…
Database Systems: Design, Implementation, & Manag…
Computer Science
ISBN:
9781337627900
Author:
Carlos Coronel, Steven Morris
Publisher:
Cengage Learning
Programmable Logic Controllers
Programmable Logic Controllers
Computer Science
ISBN:
9780073373843
Author:
Frank D. Petruzella
Publisher:
McGraw-Hill Education