Normalization

.docx

School

The University of Newcastle *

*We aren’t endorsed by this school

Course

MANAGERIAL

Subject

Computer Science

Date

Nov 24, 2024

Type

docx

Pages

11

Uploaded by BrigadierMusic12519

Report
1 Assignment: Implementing k-Nearest Neighbors (KNN) Algorithm with Normalization
2 Understanding KNN k-nearest neighbor (KNN) algorithm is a form of lazy learning. It does not become an individual once it has been trained. Therefore, it stores the training details but assumes that a new object belongs to the closest objects in the trained set (Henderi et al., 2021). It is very crucial how we calculate the distance and can be determinant in terms of effectiveness to the extent that; use the Euclidean method of estimating closest neighbor nodes. Normalization Data | Feature 1 | Feature 2 | Class | |---|---|---| | 5.1 | 3.5 | Setosa | | 4.9 | 3.0 | Setosa | | 6.0 | 3.0 | Virginica | | 6.7 | 3.3 | Virginica | | 5.6 | 2.8 | Setosa | | 5.8 | 2.7 | Virginica | | 4.7 | 3.2 | Setosa | | 5.4 | 3.0 | Setosa | | 5.5 | 2.3 | Setosa | | 6.3 | 2.9 | Virginica | | 6.2 | 2.8 | Virginica | | 5.9 | 3.0 | Virginica | Min-Max Scaling: X norm ¿ x 4.7 6.7 4.7 Calculating X norm for each data point in feature 1 X norm ¿ 5.1 4.7 6.7 4.7 = 0.2 X norm ¿ 4.9 4.7 6.7 4.7 = 0.1
3 X norm ¿ 6.0 4.7 6.7 4.7 = 0.65 X norm ¿ 6.7 4.7 6.7 4.7 = 1 X norm ¿ 5.6 4.7 6.7 4.7 = 0.45 X norm ¿ 5.8 4.7 6.7 4.7 = 0.55 X norm ¿ 4.7 4.7 6.7 4.7 = 0 X norm ¿ 5.4 4.7 6.7 4.7 = 0.35 X norm ¿ 5.5 4.7 6.7 4.7 = 0.4 X norm ¿ 6.3 4.7 6.7 4.7 =0.8 X norm ¿ 6.2 4.7 6.7 4.7 = 0.75 X norm ¿ 5.9 4.7 6.7 4.7 = 0.6 Calculating X norm for each data point in feature 2 X norm ¿ x 2.3 3.5 2.3 X norm ¿ 3.5 2.3 3.5 2.3 = 1 X norm ¿ 3.0 2.3 3.5 2.3 = 0.583 X norm ¿ 3.0 2.3 3.5 2.3 = 0.583 X norm ¿ 3.3 2.3 3.5 2.3 = 0.833 X norm ¿ 2.8 2.3 3.5 2.3 = 0.417
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4 X norm ¿ 2.7 2.3 3.5 2.3 = 0.333 X norm ¿ 3.2 2.3 3.5 2.3 = 0.75 X norm ¿ 3.0 2.3 3.5 2.3 = 0.583 X norm ¿ 2.3 2.3 3.5 2.3 = 0 X norm ¿ 2.9 2.3 3.5 2.3 = 0.5 X norm ¿ 2.8 2.3 3.5 2.3 = 0.417 X norm ¿ 3.0 2.3 3.5 2.3 = 0.583 Z-score Normalization: For Feature 1: Mean ( μ F1) = 5.6167 Standard Deviation ( σ F1) = 0.7552 X norm ¿ x 5.6167 0.7552 Calculating z-score for each data point X norm ¿ 5.1 5.6167 0.7552 = -0.6823 X norm ¿ 4.9 5.6167 0.7552 = -0.9526 X norm ¿ 6.0 5.6167 0.7552 = 0.5078 X norm ¿ 6.7 5.6167 0.7552 = 1.4406 X norm ¿ 5.6 5.6167 0.7552 = -0.1287
5 X norm ¿ 5.8 5.6167 0.7552 = 0.1709 X norm ¿ 4.7 5.6167 0.7552 = -1.1972 X norm ¿ 5.4 5.6167 0.7552 = -0.4251 X norm ¿ 5.5 5.6167 0.7552 = -0.2969 X norm ¿ 6.3 5.6167 0.7552 = 0.7801 X norm ¿ 6.2 5.6167 0.7552 = 0.6518 X norm ¿ 5.9 5.6167 0.7552 0.3356 Feature 2 Mean ( μ F2) = 3.0167 Standard Deviation ( σ F2) = 0.2749 X norm ¿ x 3.0167 0.2749 X norm ¿ .35 3.0167 0.2749 = 1.7503 X norm ¿ 3.0 3.0167 0.2749 = -0.6086 X norm ¿ 3.0 3.0167 0.2749 = -0.6086 X norm ¿ 3.3 3.0167 0.2749 = 1.1393 X norm ¿ 2.8 3.0167 0.2749 = -1.2234 X norm ¿ 2.7 3.0167 0.2749 = -1.4527 X norm ¿ 3.2 3.0167 0.2749 = 0.9947
6 X norm ¿ 3.0 3.0167 0.2749 = -0.6086 X norm ¿ 2.3 3.0167 0.2749 = -2.9102 X norm ¿ 2.9 3.0167 0.2749 = -0.8379 X norm ¿ 2.8 3.0167 0.2749 = -1.2234 X norm ¿ 3.0 3.0167 0.2749 = -0.6086 Visualization , visualize it using a scatter plot. Plot the points for each class with different colours or shape to observe the distribution of the data. Manual Implementation (New Instances) Normalization of New Instances | Feature 1 | Feature 2 | | 5.0 | 3.2 | | 6.1 | 2.8 |
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
7 | 4.8 | 3.1 | Distance Calculation for Normalized New Instances: For the first new instance (5.0, 3.2): Normalized Feature 1: (5.0−4.7)/(6.7−4.7)=0.3/2=0.15(5.0−4.7)/(6.7−4.7)=0.3/2=0.15 Normalized Feature 2: (3.2−2.3)/(3.5−2.3)=0.9/1.2=0.75(3.2−2.3)/(3.5−2.3)=0.9/1.2=0.75 For the second new instance (6.1, 2.8): Normalized Feature 1: (6.1−4.7)/(6.7−4.7)=1.4/2=0.7(6.1−4.7)/(6.7−4.7)=1.4/2=0.7 Normalized Feature 2: (2.8−2.3)/(3.5−2.3)=0.5/1.2=0.417(2.8−2.3)/(3.5−2.3)=0.5/1.2=0.417 For the third new instance (4.8, 3.1): Normalized Feature 1: (4.8−4.7)/(6.7−4.7)=0.1/2=0.05(4.8−4.7)/(6.7−4.7)=0.1/2=0.05 Normalized Feature 2: (3.1−2.3)/(3.5−2.3)=0.8/1.2=0.667(3.1−2.3)/(3.5−2.3)=0.8/1.2=0.667 Classification for k = 2: For instance 1 (0.15, 0.75): Nearest neighbors: Instance 7 (0, 0.75) and Instance 9 (0.35, 0) Majority class: Setosa For instance 2 (0.7, 0.417): Nearest neighbors: Instance 10 (0.8, 0.5) and Instance 12 (0.6, 0.583) Majority class: Virginica For instance 3 (0.05, 0.667): Nearest neighbors: Instance 1 (0.2, 1) and Instance 8 (0.35, 0.583) Majority class: Setosa Classification for k = 4:
8 For instance 1 (0.15, 0.75): Nearest neighbors: Instances 7, 9, 11, 8 Majority class: Setosa For instance 2 (0.7, 0.417): Nearest neighbors: Instances 10, 12, 3, 1 Majority class: Setosa (Note: Equal counts of Setosa and Virginica) For instance 3 (0.05, 0.667): Nearest neighbors: Instances 1, 8, 5, 6 Majority class: Setosa Rationale The selection of the parameter 'k' significantly impacts the outcomes in classification. When 'k' is set to a lower value, such as 2, it tends to become more susceptible to outliers or noise within the dataset, resulting in a variance in the classification outcomes. Conversely, employing a higher 'k' value, like 4, often yields more consistent classifications. However, this higher 'k' setting might excessively smooth the decision boundaries, consequently leading to misclassification in specific instances. In this specific case, the balanced classifications observed with 'k' set to 4 demonstrate a higher level of consistency within this dataset. Performance metrics After training the K-Nearest Neighbors (KNN) algorithm, the model was tested on 1000 instances, resulting in the following confusion matrix: Predicted Negative | Predicted Positive Actual Negative | 450 | 46 Actual Positive | 34 | 470 Based on the provided confusion Matrix Definitions: True Positive (TP): 470 True Negative (TN): 450 False Positive (FP): 46 False Negative (FN): 34
9 Recall= T P TP + FN = 470 470 + 34 = 0.932540 Precision (Positive Predictive Value): Precision= = T P TP + FP = 470 470 + 46 = 0.9109 Accuracy: Accuracy= T P + TN TP + TN + FP + FN = 470 + 450 470 + 450 + 46 + 34 = 920/1000 = 0.92 F1 Score: F1 Score= Precision ×Recall Precision + Recall = 0.9109 0.932540 0.9109 + 0.93254 = 0.46079649242 Conclusion One of the most basic methods in machine learning is the K-nearest neighbors (KNN) algorithm which has proven to be extremely useful for both classification and regression purposes. It belongs to the lazy learning group, since it adopts a “lazy” method in saving training examples and defers calculation till prediction is required. Unlike eagerness learning algorithms for instance decision tree and neural networks that build a model during the process of training, this technique avoids training models in its entirety altogether (Henderi et al., 2021). In essence, KNN rests upon similarity measures of mainly the distances to categorize novel cases. KNN finds out the k closest neighbors to one of the previously unseen datapoints that is picked according to some distance measure when asking for the prediction. These neighbors' class labels influence the classification of the new instance: it is labeling a new data point by belonging to the most popular among the k neighbors. The choice of a distance metric determines largely the algorithm performance. Among common distance measures are Euclidean, Manhattan, Minkowski, etc. These influence what kind of evaluation of distance between different points of data is made by KNN. This could be illustrated by the example that the Euclidian distance determines the straight line distance between any two points located in space and the Manhattan one finds the sum of absolute differences, which have to do with the same characteristics ( Pandey & Jain, 2017) . The understanding of KNN also includes knowing of some important parameters that include k, denoting the number in this case of neighbors being used in classifying the model. The selection of 'k' influences the model's bias- variance tradeoff: it is possible that smaller ‘k’ values could result into including noise present in the data as part of the model; conversely, larger ‘k’ values might average over larger areas and thus over-simplify the model. Given a specific Iris dataset (i.e., consisting of twelve records, each having two features and three different classes), the computational operation here will be done by KNN which will
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
10 determine those relations by finding distances within the feature space. In order to adjust features in such a way that one would not be biased towards another, the process can be performed through normalizing of the dataset by applying Min-Max scaling or Z-score normalization. Secondly, one may also need to investigate how ‘k’ values affect classification accuracy (Henderi et al., 2021). By considering ‘k’ values that vary between two and four, we are able to see the effect that the number of neighbors has on our model performance. Achieving more accurate but non-overfit prediction requires finding optimized “k” that balances biased and variances. Understanding that KNN is a lazy learner, knowing about distance metrics, selecting optimal ‘k’, appreciating the importance of feature normalizations in enhancing classification accuracy are all part of mastering KNN. This holistic view lays groundwork for the use of KNN algorithm towards solving many different classification problems. This study shows how KNN with normalization is used to classify Iris flowers. The model did a good job and showed that KNN is effective for basic classification tasks. Testing how well the model works and trying out different k values can make the model better for certain uses.
11 References Henderi, H., Wahyuningsih, T., & Rahwanto, E. (2021). Comparison of Min-Max Normalization and Z-Score Normalization in the K-nearest neighbor (kNN) Algorithm to Test the Accuracy of Types of Breast Cancer. International Journal of Informatics and Information Systems , 4 (1), 13-20. http://ijiis.org/index.php/IJIIS/article/view/73 Pandey, A., & Jain, A. (2017). Comparative analysis of the KNN algorithm using various normalization techniques. International Journal of Computer Network and Information Security , 11 (11), 36. https://mecs-press.net/ijcnis/ijcnis-v9-n11/IJCNIS-V9-N11-4.pdf