Any idea where my code is messing up?The taxis dataset contains information on taxi journeys during March 2019 in New York City. The data includes time, number of passengers, distance, taxi color, payment method, and trip locations. Use sklearn's cross_validate() function to fit a linear regression model and a k-nearest neighbors regression model with 10-fold cross-validation. Create dataframe X with the feature distance. Create dataframe y with the feature fare. Split the data into 80% training, 10% validation and 10% testing sets, with random_state = 42. Initialize a linear regression model. Initialize a k-nearest neighbors regression model with k = 3. Define a set of 10 cross-validation folds with random_state=42. Fit the models with cross-validation to the training data, using the default performance metric. For each model, print the test score for each fold, as well as the mean and standard deviation for the model. Ex: If the file taxis_small.csv is used, the output is: k-nearest neighbor scores: [0.412 0.741 0.708 0.855 0.056 0.974 0.622 0.769 0.754 0.236] Mean: 0.613 SD: 0.274 Simple linear regression scores: [ 0.603 0.584 0.651 0.956 -0.564 0.941 0.828 -0.294 0.908 0.723] Mean: 0.534 SD: 0.502 main.py:import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_validate from sklearn.model_selection import KFold from sklearn.neighbors import KNeighborsRegressor from sklearn.linear_model import LinearRegression # Load the dataset taxis = pd.read_csv("taxis.csv") # Create dataframe X with the feature distance X = taxis[['distance']] # Create dataframe y with the feature fare y = taxis['fare'] # Set aside 10% of instances for testing X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.1, random_state=42) # Split training again into 80% training and 10% validation X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.1, random_state=42) # Initialize a linear regression model SLRModel = LinearRegression() # Initialize a k-nearest neighbors regression model with k = 3 knnModel = KNeighborsRegressor(n_neighbors=3) # Define a set of 10 cross-validation folds with random_state=42 kf = KFold(n_splits=10, shuffle=True, random_state=42) # Fit k-nearest neighbors with cross-validation to the training data knnResults = cross_validate(knnModel, X_train, y_train, cv=kf, return_train_score=False) # Find the test score for each fold knnScores = knnResults['test_score'] print('k-nearest neighbor scores:', knnScores.round(3)) # Calculate descriptive statistics for k-nearest neighbor model print('Mean:', knnScores.mean().round(3)) print('SD:', knnScores.std().round(3)) # Fit simple linear regression with cross-validation to the training data SLRModelResults = cross_validate(SLRModel, X_train, y_train, cv=kf, return_train_score=False) # Find the test score for each fold SLRScores = SLRModelResults['test_score'] print('Simple linear regression scores:', SLRScores.round(3)) # Calculate descriptive statistics for simple linear regression model print('Mean:', SLRScores.mean().round(3)) print('SD:', SLRScores.std().round(3)) Your Output: k-nearest neighbor scores: [0.677 0.813 0.883 0.852 0.818 0.875 0.848 0.768 0.928 0.841] Mean: 0.83 SD: 0.066 Simple linear regression scores: [0.897 0.79 0.889 0.78 0.832 0.911 0.815 0.802 0.934 0.915] Mean: 0.857 SD: 0.055Expected output: k-nearest neighbor scores: [0.815 0.786 0.856 0.877 0.742 0.92 0.846 0.909 0.921 0.845] Mean: 0.852 SD: 0.056 Simple linear regression scores: [0.804 0.784 0.853 0.907 0.757 0.9 0.919 0.919 0.931 0.844] Mean: 0.862 SD: 0.06
Any idea where my code is messing up?
The taxis dataset contains information on taxi journeys during March 2019 in New York City. The data includes time, number of passengers, distance, taxi color, payment method, and trip locations. Use sklearn's cross_validate() function to fit a linear regression model and a k-nearest neighbors regression model with 10-fold cross-validation.
- Create dataframe X with the feature distance.
- Create dataframe y with the feature fare.
- Split the data into 80% training, 10% validation and 10% testing sets, with random_state = 42.
- Initialize a linear regression model.
- Initialize a k-nearest neighbors regression model with k = 3.
- Define a set of 10 cross-validation folds with random_state=42.
- Fit the models with cross-validation to the training data, using the default performance metric.
- For each model, print the test score for each fold, as well as the mean and standard deviation for the model.
Ex: If the file taxis_small.csv is used, the output is:
k-nearest neighbor scores: [0.412 0.741 0.708 0.855 0.056 0.974 0.622 0.769 0.754 0.236]
Mean: 0.613
SD: 0.274
Simple linear regression scores: [ 0.603 0.584 0.651 0.956 -0.564 0.941 0.828 -0.294 0.908 0.723]
Mean: 0.534
SD: 0.502
main.py:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.model_selection import KFold
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
# Load the dataset
taxis = pd.read_csv("taxis.csv")
# Create dataframe X with the feature distance
X = taxis[['distance']]
# Create dataframe y with the feature fare
y = taxis['fare']
# Set aside 10% of instances for testing
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
# Split training again into 80% training and 10% validation
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.1, random_state=42)
# Initialize a linear regression model
SLRModel = LinearRegression()
# Initialize a k-nearest neighbors regression model with k = 3
knnModel = KNeighborsRegressor(n_neighbors=3)
# Define a set of 10 cross-validation folds with random_state=42
kf = KFold(n_splits=10, shuffle=True, random_state=42)
# Fit k-nearest neighbors with cross-validation to the training data
knnResults = cross_validate(knnModel, X_train, y_train, cv=kf, return_train_score=False)
# Find the test score for each fold
knnScores = knnResults['test_score']
print('k-nearest neighbor scores:', knnScores.round(3))
# Calculate descriptive statistics for k-nearest neighbor model
print('Mean:', knnScores.mean().round(3))
print('SD:', knnScores.std().round(3))
# Fit simple linear regression with cross-validation to the training data
SLRModelResults = cross_validate(SLRModel, X_train, y_train, cv=kf, return_train_score=False)
# Find the test score for each fold
SLRScores = SLRModelResults['test_score']
print('Simple linear regression scores:', SLRScores.round(3))
# Calculate descriptive statistics for simple linear regression model
print('Mean:', SLRScores.mean().round(3))
print('SD:', SLRScores.std().round(3))
Your Output:
k-nearest neighbor scores: [0.677 0.813 0.883 0.852 0.818 0.875 0.848 0.768 0.928 0.841]
Mean: 0.83
SD: 0.066
Simple linear regression scores: [0.897 0.79 0.889 0.78 0.832 0.911 0.815 0.802 0.934 0.915]
Mean: 0.857
SD: 0.055
Expected output:
k-nearest neighbor scores: [0.815 0.786 0.856 0.877 0.742 0.92 0.846 0.909 0.921 0.845]
Mean: 0.852
SD: 0.056
Simple linear regression scores: [0.804 0.784 0.853 0.907 0.757 0.9 0.919 0.919 0.931 0.844]
Mean: 0.862
SD: 0.06
Step by step
Solved in 2 steps