activity-8-machine-learning-amulya

pdf

School

University of North Texas *

*We aren’t endorsed by this school

Course

5300

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

13

Uploaded by amulyam232

Report
activity-8-machine-learning-amulya November 20, 2023 0.1 CSCE 5300: Introduction to Big data and Data Science 0.1.1 FALL 2023 0.1.2 Activity-8 0.1.3 Submission Guidelines 1. Submit .ipynb and pdf(do not Ctrl+p to generate pdf, use some external ipynb to pdf converter Ex:vertopal.com) to canvas. 2. Perform tasks wherever needed. 3. Plagiarism should be less than 15%. 0.1.4 Machine Learning Machine learning is a study of computer algorithms that improve automatically through experience. Categories of Machine Learning Classification: predict class from observations. Clustering: group observations into meaningful groups. Regression: predict value from observations. #### Algorithms in MLlib ####Clus- tering: k-means ####Classification: SVMs, naive Bayes, decision tree. ####Regression: linear regression, logistic regression ##Decision Tree ####A decision tree is a supervised machine learning algorithm used for both classification and regression tasks. It is a graphical representation of a decision-making process that mimics a tree structure, where an internal node represents a feature or attribute, a branch represents a decision rule, and a leaf node represents the outcome or prediction [1]: ! pip install pyspark 1
Collecting pyspark Downloading pyspark-3.5.0.tar.gz (316.9 MB) ￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿ 316.9/316.9 MB 3.0 MB/s eta 0:00:00 Preparing metadata (setup.py) … done Requirement already satisfied: py4j==0.10.9.7 in /usr/local/lib/python3.10/dist- packages (from pyspark) (0.10.9.7) Building wheels for collected packages: pyspark Building wheel for pyspark (setup.py) … done Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425344 sha256=67824ceae39af0e44018d24ec9a03144efc80d12c3d37eca2a5cf2d45059cb60 Stored in directory: /root/.cache/pip/wheels/41/4e/10/c2cf2467f71c678cfc8a6b9a c9241e5e44a01940da8fbb17fc Successfully built pyspark Installing collected packages: pyspark Successfully installed pyspark-3.5.0 [2]: # importing pypsark and decision tree classifier import pyspark from pyspark.sql import SparkSession from pyspark.ml.feature import VectorAssembler from pyspark.ml.classification import DecisionTreeClassifier from pyspark.ml.evaluation import MulticlassClassificationEvaluator from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler # Initialize a Spark session spark = SparkSession . builder . appName( "WineDecisionTree" ) . getOrCreate() [3]: # importing pandas and loading the dataset using sklearn import pandas as pd from sklearn.datasets import load_wine [4]: # Load the Wine dataset from Scikit-learn wine_data = load_wine() # Convert the Scikit-learn dataset to a Pandas DataFrame wine_df = pd . DataFrame(data = wine_data . data, columns = wine_data . feature_names) wine_df[ 'class' ] = wine_data . target # Convert the Pandas DataFrame to a Spark DataFrame wine_spark_df = spark . createDataFrame(wine_df) # Split the data into training and testing sets 2
train_data, test_data = wine_spark_df . randomSplit([ 0.8 , 0.2 ], seed =123 ) # Define the feature columns feature_columns = wine_data . feature_names # Create a VectorAssembler to assemble features assembler = VectorAssembler(inputCols = feature_columns, outputCol = "features" ) train_data = assembler . transform(train_data) test_data = assembler . transform(test_data) [5]: # Create a Decision Tree classifier dt_classifier = DecisionTreeClassifier(labelCol = "class" , featuresCol = "features" ) # Train the model dt_model = dt_classifier . fit(train_data) [6]: # Make predictions on the test data predictions = dt_model . transform(test_data) Hyperparameter Tuning Hyperparameters are parameters that are not learned from the training data but are set prior to training a machine learning model. They control various aspects of the model’s behavior and can significantly impact the model’s performance. Tuning hyperparameters involves finding the best combination of hyperparameter values to optimize the model’s performance on a specific task. [7]: from pyspark.ml.tuning import ParamGridBuilder, CrossValidator [8]: param_grid = ParamGridBuilder() \ . addGrid(dt_classifier . maxDepth, [ 9 , 8 , 10 ]) \ . addGrid(dt_classifier . maxBins, [ 16 , 32 ]) \ . addGrid(dt_classifier . impurity, [ "gini" , "entropy" ]) \ . build() [9]: evaluator = MulticlassClassificationEvaluator(labelCol = "class" , predictionCol = "prediction" , metricName = "f1" ) [10]: crossval = CrossValidator(estimator = dt_classifier, estimatorParamMaps = param_grid, evaluator = evaluator, numFolds =5 ) [12]: cv_model = crossval . fit(train_data) [13]: best_dt_model = cv_model . bestModel best_maxDepth = best_dt_model . getOrDefault( "maxDepth" ) 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
best_maxBins = best_dt_model . getOrDefault( "maxBins" ) best_impurity = best_dt_model . getOrDefault( "impurity" ) [14]: best_predictions = best_dt_model . transform(test_data) [15]: best_f1_score = evaluator . evaluate(best_predictions) [16]: print ( f"Best maxDepth: { best_maxDepth } " ) print ( f"Best maxBins: { best_maxBins } " ) print ( f"Best impurity: { best_impurity } " ) print ( f"Best F1 Score: { best_f1_score } " ) Best maxDepth: 9 Best maxBins: 32 Best impurity: gini Best F1 Score: 0.8040507111935683 Task1: Tune the hyperparameters and find the best F1 score for the above algorithm [17]: # Type your code here param_grid_new = ParamGridBuilder() \ . addGrid(dt_classifier . maxDepth, [ 2 , 6 , 7 ]) \ . addGrid(dt_classifier . maxBins, [ 10 , 38 ]) \ . addGrid(dt_classifier . impurity, [ "gini" , "entropy" ]) \ . build() [19]: crossval_new = CrossValidator(estimator = dt_classifier, estimatorParamMaps = param_grid_new, evaluator = evaluator, numFolds =5 ) [21]: cv_model_new = crossval_new . fit(train_data) [22]: best_dt_model_new = cv_model_new . bestModel [23]: best_predictions_new = best_dt_model_new . transform(test_data) [24]: best_f1_score_new = evaluator . evaluate(best_predictions_new) [25]: print ( f"Best F1 Score New: { best_f1_score_new } " ) Best F1 Score New: 0.886335733232285 Evaluation metrics [26]: # Evaluate the model using a MulticlassClassificationEvaluator evaluator = MulticlassClassificationEvaluator(labelCol = "class" , predictionCol = "prediction" , metricName = "accuracy" ) 4
accuracy = evaluator . evaluate(predictions) print ( "Accuracy:" , accuracy) Accuracy: 0.7714285714285715 [27]: from pyspark.sql import functions as F true_positives = predictions . filter((predictions[ "class" ] == 1 ) & (predictions[ "prediction" ] == 1 )) . count() false_negatives = predictions . filter((predictions[ "class" ] == 1 ) & (predictions[ "prediction" ] == 0 )) . count() recall_class_1 = true_positives / (true_positives + false_negatives) print ( "Recall for class 1:" , recall_class_1) Recall for class 1: 1.0 [28]: # Calculate precision for class 1 manually true_positives = predictions . filter((predictions[ "class" ] == 1 ) & (predictions[ "prediction" ] == 1 )) . count() false_positives = predictions . filter((predictions[ "class" ] == 1 ) & (predictions[ "prediction" ] == 1 )) . count() precision_class_1 = true_positives / (true_positives + false_positives) print ( "Precision for class 1:" , precision_class_1) Precision for class 1: 0.5 [29]: f1_score_class_1 = 2 * (precision_class_1 * recall_class_1) / (precision_class_1 + recall_class_1) print ( "F1 score for class 1:" ,f1_score_class_1) F1 score for class 1: 0.6666666666666666 ###Naive Bayes ####Naive Bayes is a family of probabilistic machine learning algorithms that are based on Bayes’ theorem and are particularly well-suited for classification tasks. These algorithms are known as “naive” because they make a simplifying assumption that the features used for classification are conditionally independent, given the class label. [30]: # importing Naive Bayes Classifier from pyspark.ml.classification import NaiveBayes from pyspark.ml.evaluation import MulticlassClassificationEvaluator [31]: # Create a Naive Bayes classifier nb_classifier = NaiveBayes(labelCol = "class" , featuresCol = "features" ) # Train the model nb_model = nb_classifier . fit(train_data) 5
[32]: # Make predictions on the test data nb_predictions = nb_model . transform(test_data) [33]: param_grid = ParamGridBuilder() \ . addGrid(nb_classifier . smoothing, [ 0.2 , 0.4 , 0.6 , 0.8 , 0.1 , 0.0 ]) \ . build() [34]: evaluator = MulticlassClassificationEvaluator(labelCol = "class" , predictionCol = "prediction" , metricName = "f1" ) crossval = CrossValidator(estimator = nb_classifier, estimatorParamMaps = param_grid, evaluator = evaluator, numFolds =5 ) [35]: cv_model = crossval . fit(train_data) [36]: best_nb_model = cv_model . bestModel best_smoothing = best_nb_model . getOrDefault( "smoothing" ) print ( f"Best smoothing parameter: { best_smoothing } " ) Best smoothing parameter: 0.2 Task2: For the Naive Bayes classifier using cross-validation, find the best smoothing parameter based on the accuracy. [40]: param_grid_Accuracy = ParamGridBuilder() \ . addGrid(nb_classifier . smoothing, [ 0.9 , 0.4 , 0.5 , 0.2 , 0.1 , 0.0 ]) \ . build() [41]: # Type your code here evaluator_Accuracy = MulticlassClassificationEvaluator(labelCol = "class" , predictionCol = "prediction" , metricName = "accuracy" ) crossval_Accuracy = CrossValidator(estimator = nb_classifier, estimatorParamMaps = param_grid_Accuracy, evaluator = evaluator_Accuracy, numFolds =5 ) [42]: cv_model_Accuracy = crossval_Accuracy . fit(train_data) [43]: best_nb_model_Accuracy = cv_model_Accuracy . bestModel best_smoothing_Accuracy = best_nb_model_Accuracy . getOrDefault( "smoothing" ) print ( f"Best smoothing parameter: { best_smoothing_Accuracy } " ) Best smoothing parameter: 0.9 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
###Random Forest #####A Random Forest is an ensemble machine learning technique that combines multiple decision trees to improve predictive accuracy and reduce the risk of overfitting. It is one of the most popular and powerful ensemble methods used in both classification and regression tasks. [44]: # importing Random Forest Classifier from pyspark.ml.classification import RandomForestClassifier from pyspark.ml.evaluation import MulticlassClassificationEvaluator [45]: # Create a Random Forest classifier rf_classifier = RandomForestClassifier(labelCol = "class" , featuresCol = "features" ) # Train the model rf_model = rf_classifier . fit(train_data) [46]: # Make predictions on the test data rf_predictions = rf_model . transform(test_data) [47]: param_grid = ParamGridBuilder() \ . addGrid(rf_classifier . maxDepth, [ 5 , 10 , 15 ]) \ . addGrid(rf_classifier . numTrees, [ 10 , 20 , 30 ]) \ . addGrid(rf_classifier . featureSubsetStrategy, [ "auto" , "sqrt" , "log2" ]) \ . build() [48]: evaluator = MulticlassClassificationEvaluator(labelCol = "class" , predictionCol = "prediction" , metricName = "accuracy" ) crossval = CrossValidator(estimator = rf_classifier, estimatorParamMaps = param_grid, evaluator = evaluator, numFolds =5 ) [49]: cv_model = crossval . fit(train_data) [50]: best_rf_model = cv_model . bestModel best_maxDepth = best_rf_model . getOrDefault( "maxDepth" ) best_numTrees = best_rf_model . getOrDefault( "numTrees" ) best_featureSubsetStrategy = best_rf_model . getOrDefault( "featureSubsetStrategy" ) print ( f"Best maxDepth: { best_maxDepth } " ) print ( f"Best numTrees: { best_numTrees } " ) print ( f"Best featureSubsetStrategy: { best_featureSubsetStrategy } " ) Best maxDepth: 5 Best numTrees: 10 Best featureSubsetStrategy: auto [51]: best_rf_predictions = best_rf_model . transform(test_data) 7
0.1.5 Task3: Perform hyperparameter tuning for the Random Forest Classifier using cross-validation and find the best performance model based on any evaluation metric. [52]: param_grid_rf = ParamGridBuilder() \ . addGrid(rf_classifier . maxDepth, [ 2 , 8 , 10 ]) \ . addGrid(rf_classifier . numTrees, [ 15 , 22 , 25 ]) \ . addGrid(rf_classifier . featureSubsetStrategy, [ "auto" , "sqrt" , "log2" ]) \ . build() [54]: evaluator_rf = MulticlassClassificationEvaluator(labelCol = "class" , predictionCol = "prediction" , metricName = "accuracy" ) crossval_rf = CrossValidator(estimator = rf_classifier, estimatorParamMaps = param_grid_rf, evaluator = evaluator_rf, numFolds =5 ) [55]: cv_model_rf = crossval_rf . fit(train_data) [56]: best_rf_model_rf = cv_model_rf . bestModel best_maxDepth_rf = best_rf_model_rf . getOrDefault( "maxDepth" ) best_numTrees_rf = best_rf_model_rf . getOrDefault( "numTrees" ) best_featureSubsetStrategy_rf = best_rf_model_rf . getOrDefault( "featureSubsetStrategy" ) print ( f"Best maxDepth: { best_maxDepth_rf } " ) print ( f"Best numTrees: { best_numTrees_rf } " ) print ( f"Best featureSubsetStrategy: { best_featureSubsetStrategy_rf } " ) Best maxDepth: 8 Best numTrees: 25 Best featureSubsetStrategy: auto [57]: best_rf_predictions_rf = best_rf_model_rf . transform(test_data) [58]: accuracy = evaluator_rf . evaluate(best_rf_predictions_rf) print ( "Accuracy:" , accuracy) Accuracy: 0.9428571428571428 [59]: evaluator_f1_rf = MulticlassClassificationEvaluator(labelCol = "class" ,predictionCol = "prediction" , metricName = "f1" ) f1 = evaluator_f1_rf . evaluate(best_rf_predictions_rf) print ( "f1:" , f1) f1: 0.9445238095238095 8
###K Means Clustering ####K-Means clustering is a popular unsupervised machine learning algorithm used for partitioning a dataset into K distinct, non-overlapping clusters. It is commonly applied in data analysis, pattern recognition, and data mining to find groups of similar data points within a larger dataset. [ ]: # importing KMeans from pyspark.ml.clustering import KMeans from pyspark.ml.feature import VectorAssembler [ ]: # Define the feature columns feature_columns = wine_data . feature_names # Create a VectorAssembler to assemble features assembler = VectorAssembler(inputCols = feature_columns, outputCol = "features" ) wine_features = assembler . transform(wine_spark_df) # Specify the number of clusters (k) # You can adjust this value based on the requirements k = 3 [ ]: # Create a K-Means model kmeans = KMeans() . setK(k) . setSeed( 1 ) model = kmeans . fit(wine_features) [ ]: # Add the cluster predictions to the DataFrame clustered_data = model . transform(wine_features) [ ]: # Cluster centers cluster_centers = model . clusterCenters() for i, center in enumerate (cluster_centers): print ( f"Cluster { i } Center: { center } " ) Cluster 0 Center: [1.25985294e+01 2.45343137e+00 2.32186275e+00 2.06460784e+01 9.36960784e+01 2.05362745e+00 1.64754902e+00 3.95980392e-01 1.42509804e+00 4.67333332e+00 9.17843137e-01 2.39480392e+00 5.21558824e+02] Cluster 1 Center: [1.38507407e+01 1.77851852e+00 2.48777778e+00 1.69259259e+01 1.05629630e+02 2.94148148e+00 3.13666667e+00 2.98888889e-01 2.00703704e+00 6.27518519e+00 1.10296296e+00 3.00222222e+00 1.30877778e+03] Cluster 2 Center: [1.33691837e+01 2.40000000e+00 2.39265306e+00 1.85142857e+01 1.09081633e+02 2.44163265e+00 2.21367347e+00 3.25510204e-01 1.70673469e+00 5.18836735e+00 9.59714286e-01 2.84795918e+00 9.06346939e+02] [ ]: import seaborn as sns import matplotlib.pyplot as plt 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
# Assuming you have already loaded the data and performed clustering pandas_df = clustered_data . select( "features" , "prediction" ) . toPandas() predictions = pandas_df[ "prediction" ] features = pandas_df[ "features" ] data_to_plot = pd . DataFrame() data_to_plot[ "Cluster" ] = predictions for i in range ( len (feature_columns)): data_to_plot[feature_columns[i]] = [vector[i] for vector in features] # Set the style sns . set(style = "ticks" ) # Create a pair plot sns . pairplot(data_to_plot, hue = "Cluster" ) plt . show() 10
0.1.6 Calculating evaluation metrics Task 4.1: Choose any model from above and calculate F1-score, precision, recall and accuracy. Task 4.2: Display the results using any visualization libraries. [64]: # Type your code here evaluator_Task_1 = MulticlassClassificationEvaluator(labelCol = "class" ,predictionCol = "prediction" , metricName = "accuracy" ) accuracy = evaluator_Task_1 . evaluate(best_rf_predictions_rf) print ( "Accuracy:" , accuracy) Accuracy: 0.9428571428571428 11
[65]: evaluator_Task_2 = MulticlassClassificationEvaluator(labelCol = "class" ,predictionCol = "prediction" , metricName = "f1" ) f1 = evaluator_Task_2 . evaluate(best_rf_predictions_rf) print ( "F1:" , f1) F1: 0.9445238095238095 [66]: evaluator_Task_3 = MulticlassClassificationEvaluator(labelCol = "class" ,predictionCol = "prediction" , metricName = "weightedPrecision" ) precision = evaluator_Task_3 . evaluate(best_rf_predictions_rf) print ( "Weighted Precision:" , precision) Weighted Precision: 0.9555555555555556 [67]: # Create a MulticlassClassificationEvaluator evaluator_Task_4 = MulticlassClassificationEvaluator(labelCol = "class" ,predictionCol = "prediction" , metricName = "weightedRecall" ) # Calculate the weighted precision recall = evaluator_Task_4 . evaluate(best_rf_predictions_rf) print ( "Weighted Recall:" , recall) Weighted Recall: 0.942857142857143 [69]: import matplotlib.pyplot as plt # Create a list of metric names and corresponding values metrics = [ 'Accuracy' , 'F1 Score' , 'Weighted Precision' , 'Weighted Recall' ] values = [accuracy, f1, precision, recall] 12 # Create a bar chart plt . figure(figsize = ( 5 , 5 )) plt . bar(metrics, values, color = [ 'pink' , 'blue' , 'red' , 'orange' ]) plt . title( 'Model Evaluation Metrics (Bar Chart)' ) plt . xlabel( 'Metric' ) plt . ylabel( 'Score' ) plt . ylim( 0 , 1 ) # Set the y-axis limit to better visualize scores between 0 and␣ 1 plt . grid(axis = 'y' ) # Add data points to the chart for metric, value in zip (metrics, values): plt . text(metric, value, f' { value : .2f } ' , ha = 'center' , va = 'bottom' ) plt . show() 12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
[70]: # stopping the spark session spark . stop() 13