Lab 3

docx

School

Trine University *

*We aren’t endorsed by this school

Course

CO

Subject

English

Date

Nov 24, 2024

Type

docx

Pages

7

Uploaded by MasterScience9621

Report
Lab 3 Ashim Neupane MSDS-632-M50 Big Data Dr. Billy Chestnut 10/29/2023
Q 1 Entropy is a measure of disorder within a system. In binary classification, entropy is used to measure how well the data is split into two classes (Ruby & Yendapalli, 2020). The entropy of a classification is calculated by using the following formula: Entropy = -p*log2(p) - q*log2(q) where p is the probability of belonging to one class, and q is the probability of belonging to the other class. The entropy of a binary classification can take any value between 0 and 1. The minimum entropy is 0, which only occurs when the data is perfectly split into two classes, with the same probability of belonging to each. The maximum entropy is 1, which only occurs when the data is not split between the two classes, and so both classes have a probability of 0.5 of containing the data. Q 2 In a decision tree algorithm, attributes are picked to ‘split’ the data based on which variable provides the most information gain. This splitting strategy is called Greedy Algorithm which means it is goal oriented and moves towards a specific goal without considering the global optimum. The first step in picking an attribute is to calculate the Entropy of the dataset which is a measure of how chaotic or random the data points are spread in the dataset. Then the attribute which provides the best Information Gain is selected. Information Gain is calculated by
calculating how much the entropy has decreased after the split. The higher the Information Gain, the better the split is because it makes the data more organized. Once the attribute with the best Information Gain is chosen it is then split into two new branches, each for one subset of the attribute. This process is repeated again and again until a terminal leaf is created which means the branch can’t be further split into two. At each step, the algorithm calculates the information gain and chooses the attribute with the highest Information Gain. Q 3 The probability that John has swine flu is 95%. As the probability of swine flu occurring in 1 in 5,000 people is 5%, and the accuracy of the test is 99%, this means there is an overall chance of 95% that John has swine flu, given that his test results were positive. To calculate this, we use the formula: Probability (of having swine flu given positive test result) = (probability of having swine flu * accuracy of the test) / (probability of having swine flu * accuracy of the test + probability of not having swine flu * (1-accuracy of the test)). Plugging in the numbers: 5% * 99% / (5% * 99% + 95% * (1-99%)) = 95%. Therefore, while the test result is not conclusive, the probability that John has swine flu is quite high at 95%. The doctor should be informed so that John can be treated as soon as possible. Q 4 Naive Bayes classifiers are considered computationally efficient for high-dimensional problems due to their ability to leverage extremely fast multinomial probability calculations (Lakin, 2021). Naive Bayes classifiers only use the probabilities of each class to compute their
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
predictions instead of optimizing weights, which makes them an effective classification choice for large datasets. Unlike many other classifiers, Naive Bayes classifiers don't require intensive computing resources or excessive amounts of training. Instead, they estimate the probabilities of each class from a given data set, which allows them to quickly determine the best prediction without performing any optimization. This is beneficial for larger data sets that contain a large number of features. Also, since Naive Bayes classifiers assume that the features are conditionally independent, they can easily accommodate new features added to the data set. This is especially true in high-dimensional settings, since an increasing number of features can simply be added without any additional training. Q 5 The team should consider using a Decision Tree classifier in this situation. Decision tree classifiers work well on both categorical and numeric data, so they are a good choice for a dataset with many correlated variables. Additionally, Decision Trees can easily accommodate interactions between variables, which can be useful when dealing with many correlated variables (Ding, Cao & Liu, 2019). Finally, Decision Trees are advantageous because they are relatively easy to interpret; a data scientist can easily follow the logic of a Decision Tree to identify the most important variables and how they interact with each other to make predictions. All of these properties make Decision Trees an ideal choice for a classification problem with many correlated variables, many of which are categorical. Q 6
In this case, the team should consider using logistic regression. Logistic regression is effective when there are many correlated variables because it works by creating a linear relationship between several variables, allowing each one to contribute to the classification outcome (Shah et al., 2020). Additionally, this model calculates the probability of a certain label being assigned to a case and works well for binary classification problems. This feature in particular makes logistic regression a preferred choice for tasks that require probability output, which is necessary for this kind of problem. Therefore, logistic regression is an ideal classifier for the team to use. Q 7 The accuracy of the predictions made by a particular model can be evaluated using several metrics, such as True Positive, True Negative, False Positive, and False Negative rates. In this particular case, True Positive indicates the number of predicted "good" instances that were actually good (in this case, 671), while True Negative indicates the number of predicted "bad" instances that were actually bad (262). False Positive, on the other hand, indicates the number of instances that were predicted to be "good" but turned out to be bad (38), while False Negative indicates the number of instances that were predicted to be "bad" but actually turned out to be good (29). To measure the accuracy of a model's predictions, we calculate the True Positive rate (also known as the recall rate) by dividing the number of True Positive instances (671) by the sum of the True Positive and False Negative instances (671 + 29). This yields a value of 0.959, or approximately 96%. On the other hand, to measure the accuracy of a model's negative predictions, we can calculate the False Positive rate by dividing the number of False Positive
instances (38) by the sum of the False Positive and True Negative instances (38 + 262). This yields a value of 0.1267, or approximately 13%. Finally, to measure the accuracy of a model's classifications of "bad" instances, we can calculate the False Negative rate by dividing the number of False Negative instances (29) by the sum of the False Negative and True Positive instances (29 + 671). This yields a value of 0.0414, or approximately 4%.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
References Ding, C., Cao, X., & Liu, C. (2019). How does the station-area built environment influence Metrorail ridership? Using gradient boosting decision trees to identify non-linear thresholds. Journal of Transport Geography , 77 , 70-78. Lakin, S. M. (2021). Modern Considerations for the Use of Naive Bayes in the Supervised Classification of Genetic Sequence Data (Doctoral dissertation, Colorado State University). Ruby, U., & Yendapalli, V. (2020). Binary cross entropy with deep learning technique for image classification. Int. J. Adv. Trends Comput. Sci. Eng , 9 (10). Shah, K., Patel, H., Sanghvi, D., & Shah, M. (2020). A comparative analysis of logistic regression, random forest and KNN models for the text classification. Augmented Human Research , 5 , 1-16.