16. Assuming that data mining techniques are to be used in the following cases, identify whether the task required is supervised or unsupervised learning. a. Deciding whether to issue a loan to an applicant based on demographic and financial data (with reference to a database of similar data on prior customers). Answer hint: This is supervised learning, because the database includes whether the loan was approved or not. b. In an online bookstore, making recommendations to customers concerning additional items to buy based on the buying patterns in prior transactions. Answer hint: This is unsupervised learning, because there is no apparent outcome (e.g., whether the recommendation was adopted or not). c. Identifying a network data packet as dangerous (virus, hacker attack) based on comparison to other packets whose threat status is known. d. Identifying segments of similar customers. e. Predicting whether a company will go bankrupt based on comparing its financial data to those of similar bankrupt and nonbankrupt firms. f. Estimating the repair time required for an aircraft based on a trouble ticket. g. Automated sorting of mail by zip code scanning. h. Printing of custom discount coupons at the conclusion of a grocery store checkout based on what you just bought and what others have bought previously.
16. Assuming that data mining techniques are to be used in the following cases, identify whether the task required is supervised or unsupervised learning.
a. Deciding whether to issue a loan to an applicant based on demographic and financial data (with reference to a
Answer hint: This is supervised learning, because the database includes whether the loan was approved or not.
b. In an online bookstore, making recommendations to customers concerning additional items to buy based on the buying patterns in prior transactions.
Answer hint: This is unsupervised learning, because there is no apparent outcome (e.g., whether the recommendation was adopted or not).
c. Identifying a network data packet as dangerous (virus, hacker attack) based on comparison to other packets whose threat status is known.
d. Identifying segments of similar customers.
e. Predicting whether a company will go bankrupt based on comparing its financial data to those of similar bankrupt and nonbankrupt firms.
f. Estimating the repair time required for an aircraft based on a trouble ticket.
g. Automated sorting of mail by zip code scanning.
h. Printing of custom discount coupons at the conclusion of a grocery store checkout based on what you just bought and what others have bought previously.
17. Describe the difference in roles assumed by the validation partition and the test partition.
18. Using the concept of overfitting, explain why when a model is fit to training data, zero error with those data is not necessarily good.
19. In fitting a model to classify prospects as purchasers or non-purchasers, a certain company drew the training data from internal data that include demographic and purchase information. Future data to be classified will be lists purchased from other sources, with demographic (but not purchase) data included. It was found that ‘‘refund issued’’ was a useful predictor in the training data. Why is this not an appropriate variable to include in the model?
Describe how you can normalize the data in Table 2.7.
Statistical distance between records can be measured in several ways. Consider Euclidean distance, measured as the square root of the sum of the squared differences. For the first two records in Table 2.7, it is
Can normalizing the data change which two records are farthest from each other in terms of Euclidean distance?
Two models are applied to a dataset that has been partitioned. Model A is considerably more accurate than model B on the training data, but slightly less accurate than model B on the validation data. Which model are you more likely to consider for final deployment?
Trending now
This is a popular solution!
Step by step
Solved in 2 steps