3.1. Cross-validation_ evaluating estimator performance — scikit-learn 1.4.1 documentation

pdf

School

Purdue University *

*We aren’t endorsed by this school

Course

106

Subject

Statistics

Date

Apr 3, 2024

Type

pdf

Pages

Uploaded by ColonelField7454

3/25/24, 9:05 PM 3.1. Cross-validation: evaluating estimator performance — scikit-learn 1.4.1 documentation https://scikit-learn.org/stable/modules/cross_validation.html 1/12 X_test, y_test train_test_split C >>> import numpy as np >>> from sklearn.model_selection import train_test_split >>> from sklearn import datasets >>> from sklearn import svm >>> X, y = datasets . load_iris(return_X_y = True ) >>> X . shape, y . shape ((150, 4), (150,)) >>> X_train, X_test, y_train, y_test = train_test_split( ... X, y, test_size = 0.4 , random_state = 0 ) >>> X_train . shape, y_train . shape ((90, 4), (90,)) >>> X_test . shape, y_test . shape ((60, 4), (60,)) >>> clf = svm . SVC(kernel = 'linear' , C = 1 ) . fit(X_train, y_train) >>> clf . score(X_test, y_test) 0.96...

3/25/24, 9:05 PM 3.1. Cross-validation: evaluating estimator performance — scikit-learn 1.4.1 documentation https://scikit-learn.org/stable/modules/cross_validation.html 2/12 cross_val_score score cv cross_val_score KFold StratifiedKFold ClassifierMixin >>> from sklearn.model_selection import cross_val_score >>> clf = svm . SVC(kernel = 'linear' , C = 1 , random_state = 42 ) >>> scores = cross_val_score(clf, X, y, cv = 5 ) >>> scores array([0.96..., 1. , 0.96..., 0.96..., 1. ]) >>> print ( " %0.2f accuracy with a standard deviation of %0.2f " % (scores . mean(), scores . std())) 0.98 accuracy with a standard deviation of 0.02 >>> from sklearn import metrics >>> scores = cross_val_score( ... clf, X, y, cv = 5 , scoring = 'f1_macro' ) >>> scores array([0.96..., 1. ..., 0.96..., 0.96..., 1. ]) >>> from sklearn.model_selection import ShuffleSplit >>> n_samples = X . shape[ 0 ] >>> cv = ShuffleSplit(n_splits = 5 , test_size = 0.3 , random_state = 0 ) >>> cross_val_score(clf, X, y, cv = cv) array([0.977..., 0.977..., 1. ..., 0.955..., 1. ])

3/25/24, 9:05 PM 3.1. Cross-validation: evaluating estimator performance — scikit-learn 1.4.1 documentation https://scikit-learn.org/stable/modules/cross_validation.html 3/12 Pipeline cross_validate cross_val_score ['test_score', 'fit_time', 'score_time'] ['test_<scorer1_name>', 'test_<scorer2_name>', 'test_<scorer...>', 'fit_time', 'score_time'] return_train_score False True return_estimator=True re‐ turn_indices=True >>> def custom_cv_2folds (X): ... n = X . shape[ 0 ] ... i = 1 ... while i <= 2 : ... idx = np . arange(n * (i - 1 ) / 2 , n * i / 2 , dtype = int ) ... yield idx, idx ... i += 1 ... >>> custom_cv = custom_cv_2folds(X) >>> cross_val_score(clf, X, y, cv = custom_cv) array([1. , 0.973...]) >>> from sklearn import preprocessing >>> X_train, X_test, y_train, y_test = train_test_split( ... X, y, test_size = 0.4 , random_state = 0 ) >>> scaler = preprocessing . StandardScaler() . fit(X_train) >>> X_train_transformed = scaler . transform(X_train) >>> clf = svm . SVC(C = 1 ) . fit(X_train_transformed, y_train) >>> X_test_transformed = scaler . transform(X_test) >>> clf . score(X_test_transformed, y_test) 0.9333... >>> from sklearn.pipeline import make_pipeline >>> clf = make_pipeline(preprocessing . StandardScaler(), svm . SVC(C = 1 )) >>> cross_val_score(clf, X, y, cv = cv) array([0.977..., 0.933..., 0.955..., 0.933..., 0.977...]) >>> from sklearn.model_selection import cross_validate >>> from sklearn.metrics import recall_score >>> scoring = [ 'precision_macro' , 'recall_macro' ] >>> clf = svm . SVC(kernel = 'linear' , C = 1 , random_state = 0 ) >>> scores = cross_validate(clf, X, y, scoring = scoring) >>> sorted (scores . keys()) ['fit_time', 'score_time', 'test_precision_macro', 'test_recall_macro'] >>> scores[ 'test_recall_macro' ] array([0.96..., 1. ..., 0.96..., 0.96..., 1. ])

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

3/25/24, 9:05 PM 3.1. Cross-validation: evaluating estimator performance — scikit-learn 1.4.1 documentation https://scikit-learn.org/stable/modules/cross_validation.html 4/12 cross_validate cross_val_predict cross_val_score cross_val_predict cross_val_score cross_val_score cross_val_predict cross_val_predict cross_val_predict KFold >>> from sklearn.metrics import make_scorer >>> scoring = { 'prec_macro' : 'precision_macro' , ... 'rec_macro' : make_scorer(recall_score, average = 'macro' )} >>> scores = cross_validate(clf, X, y, scoring = scoring, ... cv = 5 , return_train_score = True ) >>> sorted (scores . keys()) ['fit_time', 'score_time', 'test_prec_macro', 'test_rec_macro', 'train_prec_macro', 'train_rec_macro'] >>> scores[ 'train_rec_macro' ] array([0.97..., 0.97..., 0.99..., 0.98..., 0.98...]) >>> scores = cross_validate(clf, X, y, ... scoring = 'precision_macro' , cv = 5 , ... return_estimator = True ) >>> sorted (scores . keys()) ['estimator', 'fit_time', 'score_time', 'test_score']

3/25/24, 9:05 PM 3.1. Cross-validation: evaluating estimator performance — scikit-learn 1.4.1 documentation https://scikit-learn.org/stable/modules/cross_validation.html 5/12 KFold RepeatedKFold KFold RepeatedStratifiedKFold LeaveOneOut >>> import numpy as np >>> from sklearn.model_selection import KFold >>> X = [ "a" , "b" , "c" , "d" ] >>> kf = KFold(n_splits = 2 ) >>> for train, test in kf . split(X): ... print ( " %s %s " % (train, test)) [2 3] [0 1] [0 1] [2 3] >>> X = np . array([[ 0. , 0. ], [ 1. , 1. ], [ - 1. , - 1. ], [ 2. , 2. ]]) >>> y = np . array([ 0 , 1 , 0 , 1 ]) >>> X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test] >>> import numpy as np >>> from sklearn.model_selection import RepeatedKFold >>> X = np . array([[ 1 , 2 ], [ 3 , 4 ], [ 1 , 2 ], [ 3 , 4 ]]) >>> random_state = 12883823 >>> rkf = RepeatedKFold(n_splits = 2 , n_repeats = 2 , random_state = random_state) >>> for train, test in rkf . split(X): ... print ( " %s %s " % (train, test)) ... [2 3] [0 1] [0 1] [2 3] [0 2] [1 3] [1 3] [0 2] >>> from sklearn.model_selection import LeaveOneOut >>> X = [ 1 , 2 , 3 , 4 ] >>> loo = LeaveOneOut() >>> for train, test in loo . split(X): ... print ( " %s %s " % (train, test)) [1 2 3] [0] [0 2 3] [1] [0 1 3] [2] [0 1 2] [3]

3/25/24, 9:05 PM 3.1. Cross-validation: evaluating estimator performance — scikit-learn 1.4.1 documentation https://scikit-learn.org/stable/modules/cross_validation.html 6/12 LeavePOut LeaveOneOut LeaveOneOut KFold ShuffleSplit random_state ShuffleSplit ShuffleSplit KFold >>> from sklearn.model_selection import LeavePOut >>> X = np . ones( 4 ) >>> lpo = LeavePOut(p = 2 ) >>> for train, test in lpo . split(X): ... print ( " %s %s " % (train, test)) [2 3] [0 1] [1 3] [0 2] [1 2] [0 3] [0 3] [1 2] [0 2] [1 3] [0 1] [2 3] >>> from sklearn.model_selection import ShuffleSplit >>> X = np . arange( 10 ) >>> ss = ShuffleSplit(n_splits = 5 , test_size = 0.25 , random_state = 0 ) >>> for train_index, test_index in ss . split(X): ... print ( " %s %s " % (train_index, test_index)) [9 1 6 7 3 0 5] [2 8 4] [2 9 8 0 6 7 4] [3 5 1] [4 5 1 0 6 9 7] [2 3 8] [2 7 5 8 0 3 4] [6 1 9] [4 1 0 6 8 9 3] [5 2 7]

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

3/25/24, 9:05 PM 3.1. Cross-validation: evaluating estimator performance — scikit-learn 1.4.1 documentation https://scikit-learn.org/stable/modules/cross_validation.html 7/12 StratifiedKFold StratifiedShuffleSplit StratifiedKFold KFold StratifiedKFold RepeatedStratifiedKFold StratifiedShuffleSplit >>> from sklearn.model_selection import StratifiedKFold, KFold >>> import numpy as np >>> X, y = np . ones(( 50 , 1 )), np . hstack(([ 0 ] * 45 , [ 1 ] * 5 )) >>> skf = StratifiedKFold(n_splits = 3 ) >>> for train, test in skf . split(X, y): ... print ( 'train - {} | test - {} ' . format( ... np . bincount(y[train]), np . bincount(y[test]))) train - [30 3] | test - [15 2] train - [30 3] | test - [15 2] train - [30 4] | test - [15 1] >>> kf = KFold(n_splits = 3 ) >>> for train, test in kf . split(X, y): ... print ( 'train - {} | test - {} ' . format( ... np . bincount(y[train]), np . bincount(y[test]))) train - [28 5] | test - [17] train - [28 5] | test - [17] train - [34] | test - [11 5]

3/25/24, 9:05 PM 3.1. Cross-validation: evaluating estimator performance — scikit-learn 1.4.1 documentation https://scikit-learn.org/stable/modules/cross_validation.html 8/12 groups GroupKFold GroupKFold StratifiedGroupKFold KFold GroupKFold KFold GroupKFold KFold shuffle=True StratifiedGroupKFold StratifiedKFold GroupKFold GroupKFold >>> from sklearn.model_selection import GroupKFold >>> X = [ 0.1 , 0.2 , 2.2 , 2.4 , 2.3 , 4.55 , 5.8 , 8.8 , 9 , 10 ] >>> y = [ "a" , "b" , "b" , "b" , "c" , "c" , "c" , "d" , "d" , "d" ] >>> groups = [ 1 , 1 , 1 , 2 , 2 , 2 , 3 , 3 , 3 , 3 ] >>> gkf = GroupKFold(n_splits = 3 ) >>> for train, test in gkf . split(X, y, groups = groups): ... print ( " %s %s " % (train, test)) [0 1 2 3 4 5] [6 7 8 9] [0 1 2 6 7 8 9] [3 4 5] [3 4 5 6 7 8 9] [0 1 2] >>> from sklearn.model_selection import StratifiedGroupKFold >>> X = list ( range ( 18 )) >>> y = [ 1 ] * 6 + [ 0 ] * 12 >>> groups = [ 1 , 2 , 3 , 3 , 4 , 4 , 1 , 1 , 2 , 2 , 3 , 4 , 5 , 5 , 5 , 6 , 6 , 6 ] >>> sgkf = StratifiedGroupKFold(n_splits = 3 ) >>> for train, test in sgkf . split(X, y, groups = groups): ... print ( " %s %s " % (train, test)) [ 0 2 3 4 5 6 7 10 11 15 16 17] [ 1 8 9 12 13 14] [ 0 1 4 5 6 7 8 9 11 12 13 14] [ 2 3 10 15 16 17] [ 1 2 3 8 9 10 12 13 14 15 16 17] [ 0 4 5 6 7 11]

3/25/24, 9:05 PM 3.1. Cross-validation: evaluating estimator performance — scikit-learn 1.4.1 documentation https://scikit-learn.org/stable/modules/cross_validation.html 9/12 GroupKFold LeaveOneGroupOut LeavePGroupsOut n_groups=1 GroupKFold n_splits groups LeaveOneGroupOut LeavePGroupsOut LeaveOneGroupOut >>> from sklearn.model_selection import LeaveOneGroupOut >>> X = [ 1 , 5 , 10 , 50 , 60 , 70 , 80 ] >>> y = [ 0 , 1 , 1 , 2 , 2 , 2 , 2 ] >>> groups = [ 1 , 1 , 2 , 2 , 3 , 3 , 3 ] >>> logo = LeaveOneGroupOut() >>> for train, test in logo . split(X, y, groups = groups): ... print ( " %s %s " % (train, test)) [2 3 4 5 6] [0 1] [0 1 4 5 6] [2 3] [0 1 2 3] [4 5 6] >>> from sklearn.model_selection import LeavePGroupsOut >>> X = np . arange( 6 ) >>> y = [ 1 , 1 , 1 , 2 , 2 , 2 ] >>> groups = [ 1 , 1 , 2 , 2 , 3 , 3 ] >>> lpgo = LeavePGroupsOut(n_groups = 2 ) >>> for train, test in lpgo . split(X, y, groups = groups): ... print ( " %s %s " % (train, test)) [4 5] [0 1 2 3] [2 3] [0 1 4 5] [0 1] [2 3 4 5]

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

3/25/24, 9:05 PM 3.1. Cross-validation: evaluating estimator performance — scikit-learn 1.4.1 documentation https://scikit-learn.org/stable/modules/cross_validation.html 10/12 GroupShuffleSplit ShuffleSplit LeavePGroupsOut LeavePGroupsOut GroupShuffleSplit LeavePGroupsOut PredefinedSplit test_fold train_test_split ShuffleSplit split() >>> from sklearn.model_selection import GroupShuffleSplit >>> X = [ 0.1 , 0.2 , 2.2 , 2.4 , 2.3 , 4.55 , 5.8 , 0.001 ] >>> y = [ "a" , "b" , "b" , "b" , "c" , "c" , "c" , "a" ] >>> groups = [ 1 , 1 , 2 , 2 , 3 , 3 , 4 , 4 ] >>> gss = GroupShuffleSplit(n_splits = 4 , test_size = 0.5 , random_state = 0 ) >>> for train, test in gss . split(X, y, groups = groups): ... print ( " %s %s " % (train, test)) ... [0 1 2 3] [4 5 6 7] [2 3 6 7] [0 1 4 5] [2 3 4 5] [0 1 6 7] [4 5 6 7] [0 1 2 3] >>> import numpy as np >>> from sklearn.model_selection import GroupShuffleSplit >>> X = np . array([ 0.1 , 0.2 , 2.2 , 2.4 , 2.3 , 4.55 , 5.8 , 0.001 ]) >>> y = np . array([ "a" , "b" , "b" , "b" , "c" , "c" , "c" , "a" ]) >>> groups = np . array([ 1 , 1 , 2 , 2 , 3 , 3 , 4 , 4 ]) >>> train_indx, test_indx = next ( ... GroupShuffleSplit(random_state = 7 ) . split(X, y, groups) ... ) >>> X_train, X_test, y_train, y_test = \ ... X[train_indx], X[test_indx], y[train_indx], y[test_indx] >>> X_train . shape, X_test . shape ((6,), (2,)) >>> np . unique(groups[train_indx]), np . unique(groups[test_indx]) (array([1, 2, 4]), array([3]))

3/25/24, 9:05 PM 3.1. Cross-validation: evaluating estimator performance — scikit-learn 1.4.1 documentation https://scikit-learn.org/stable/modules/cross_validation.html 11/12 KFold ShuffleSplit TimeSeriesSplit TimeSeriesSplit KFold cv=some_integer cross_val_score train_test_split random_state None KFold(..., shuffle=True) GridSearchCV fit random_state >>> from sklearn.model_selection import TimeSeriesSplit >>> X = np . array([[ 1 , 2 ], [ 3 , 4 ], [ 1 , 2 ], [ 3 , 4 ], [ 1 , 2 ], [ 3 , 4 ]]) >>> y = np . array([ 1 , 2 , 3 , 4 , 5 , 6 ]) >>> tscv = TimeSeriesSplit(n_splits = 3 ) >>> print (tscv) TimeSeriesSplit(gap=0, max_train_size=None, n_splits=3, test_size=None) >>> for train, test in tscv . split(X): ... print ( " %s %s " % (train, test)) [0 1 2] [3] [0 1 2 3] [4] [0 1 2 3 4] [5]

3/25/24, 9:05 PM 3.1. Cross-validation: evaluating estimator performance — scikit-learn 1.4.1 documentation https://scikit-learn.org/stable/modules/cross_validation.html 12/12 permutation_test_score permutation_test_score n_permutations n_permutations cv permutation_test_score permutation_test_score (n_permutations + 1) * n_cv

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version