Class Exercises - Aug 31

pdf

School

George Washington University *

*We aren’t endorsed by this school

Course

4279

Subject

Industrial Engineering

Date

Feb 20, 2024

Type

pdf

Pages

3

Uploaded by ProfSteelCaribou34

Report
Class Exercises 1. Assuming that data mining techniques are to be used in the following cases, identify whether the task required is supervised or unsupervised learning. a. Deciding whether to issue a loan to an applicant based on demographic and financial data (with reference to a database of similar data on prior customers). supervised learning b/c spedfic target - demographic & finandidl data b. In an online bookstore, making recommendations to customers concerning additional items to buy based on the buying patterns in prior transactions. unsupervised learning bl classic vecoramendation , no specific tarpet c. Identifying a network data packet as dangerous (virus, hacker attack) based on comparison to other packets whose threat status is known. supervised d. Identifying segments of similar customers. unsupesvised e. Predicting whether a company will go bankrupt based on comparing its financial data to those of similar bankrupt and nonbankrupt firms. Supervised f. Estimating the repair time required for an aircraft based on a trouble ticket. supervised g. Automated sorting of mail by zip code scanning. supervised h. Printing of custom discount coupons at the conclusion of a grocery store checkout based on what you just bought and what others have bought previously.
Class Exercises 1. Using the concept of overfitting, explain why when a model is fit to training data, zero error with those data is not necessarily good. 2. Two models are applied to a dataset that has been partitioned. Model A is considerably more accurate than model B on the training data, but slightly less accurate than model B on the validation data. Which model are you more likely to consider for final deployment? 3. A dataset has 1000 records and 50 variables with 5% of the values missing, spread randomly throughout the records and variables. An analyst decides to remove records that have missing values. About how many records would you expect would be removed? 4. Standardize (normalize) the data below showing calculations. Confirm your results in JMP. TABLE 2.7 Age Income ($) 25 49,000 56 156,000 65 99,000 32 192,000 41 39,000 49 57,000 Class Exercises Linear Regression Models using JMP and R Data exploration and Partitioning 1. Linear Regression - West Roxbury Housing Data Fitting a regression model Partition data Predict error on Validation Data Partitioning data, and computing error on validation data in R 2. Use the Auto data set, and build models of different complexity with mpg as dependent variable, and horsepower as independent variable. Examine the error on the validation data set as the complexity of the model is increased.
3. The dataset ToyotaCorolla.jmp contains data on used cars on sale during the late summer of 2004 in the Netherlands. It has 1436 records containing details on 38 attributes, including Price, Age, Kilometers, HP, and other specifications. a. Explore the data using the data visualization (e.g., Graph > Scatterplot Matrix and Graph > Graph Builder) capabilities of JMP. Which of the pairs among the variables seem to be correlated? (Refer to the guides and videos at jmp.com/learn, under Graphical Displays and Summaries, for basic information on how to use these platforms.) b. We plan to analyze the data using various data mining techniques described in future chapters. Prepare the dataset for data mining techniques of supervised learning by creating partitions using the JMP Pro Make Validation Column utility (from the Cols menu). Use the following partitioning percentages: training (50%), validation (30%), and test (20%). Describe the roles that these partitions will play in modeling.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help