Given that we want to evaluate the performance of 'n' different machine learning models on the same data, why would the following splitting mechanism be incorrect: def get_splits(): df rnd pd.DataFrame(...) np.random. rand (len (df)) train df[ rnd <0.8] valid df[ rnd >= 0.8 & rnd < 0.9 ] test =df[ rnd >= 0.9] return train, valid, test #Model 1 from sklearn.tree import Decision Tree Classifier train, valid, test = get_splits() ... #Model 2. from sklearn.linear_model import Logistic Regression train, valid, test = get_splits ()
Given that we want to evaluate the performance of 'n' different machine learning models on the same data, why would the following splitting mechanism be incorrect: def get_splits(): df rnd pd.DataFrame(...) np.random. rand (len (df)) train df[ rnd <0.8] valid df[ rnd >= 0.8 & rnd < 0.9 ] test =df[ rnd >= 0.9] return train, valid, test #Model 1 from sklearn.tree import Decision Tree Classifier train, valid, test = get_splits() ... #Model 2. from sklearn.linear_model import Logistic Regression train, valid, test = get_splits ()
Related questions
Question
Expert Solution
This question has been solved!
Explore an expertly crafted, step-by-step solution for a thorough understanding of key concepts.
Step by step
Solved in 3 steps