MIS 655 Topic 8 DQ 2
docx
keyboard_arrow_up
School
Grand Canyon University *
*We aren’t endorsed by this school
Course
MIS 655
Subject
Information Systems
Date
Feb 20, 2024
Type
docx
Pages
2
Uploaded by MasterTitanium11775
How does having more records to base a rule on affect the conclusion (i.e., prediction)? What is the effect of more data on sampling chance in the Naïve
Bayes classifier?
More data leads to more predictive power. For sophisticated models such as gradient boosted
trees and random forests, quality data and feature engineering reduce the errors drastically. But
simply having more data is not useful. The saying that businesses need a lot of data is a myth.
Large amounts of data afford simple models much more power; if you have 1 trillion data points,
outliers are easier to classify and the underlying distribution of that data is clearer. If you have 10
data points, this is probably not the case. You’ll have to perform more sophisticated
normalization and transformation routines on the data before it is useful. Researchers have
demonstrated that massive data can lead to lower estimation variance and hence better predictive
performance. More data increases the probability that it contains useful information, which is
advantageous.
However, not all data is always helpful. A good example is clickstream data utilized by e-com
companies where a user’s actions are monitored and analyzed. Such data includes parts of the
page that are clicked, keywords, cookie data, cursor positions and web page components that are
visible. This is a lot of data coming in rapidly, but only a portion is valuable in predicting a
user’s characteristics and preferences. The rest is noise.
The Naïve Bayes classifier is a simple and versatile classifier. Since the computations are cheap,
the Naive Bayes classifier works very efficiently for large datasets. However, Increasing the
number of features in a naive Bayes classifier does not always guarantee an improvement in
performance. While more features can potentially capture more information, they can also lead
to overfitting, increased computational complexity, and the inclusion of irrelevant or redundant
information. It's important to carefully consider the quality and relevance of the features being
added to ensure that they contribute positively to the classifier's performance. Regularization
techniques and feature selection methods can also be used to mitigate the potential downsides of
increasing the number of features.
Chawla, V. (2020). AIM. I
s More Data Always Better For Building Analytics Models?
https://analyticsindiamag.com/is-more-data-always-better-for-building-analytics-models/
UC Business Analytics R Programming Guide. (n.d.). Naïve Bayes Classifier
https://uc-r.github.io/naive_bayes
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help