Statistical Learning: Homework 1 for MATH 4322 Fall 2023

1 Homework 1 - MATH 4322 Fall 2023 Dr. Cathy Poliak Instructions 1. Due date: August 31, 2023, 11:59 PM 2. Answer the questions fully for full credit. 3. Scan or Type your answers and submit only one file. (If you submit several files only the recent one uploaded will be graded). 4. Preferably save your file as PDF before uploading. 5. Submit in Canvas under Homework 1. 6. These questions are from An Introduction to Statistical Learning , second edition by James, et. al., chapter 2. 7. The information in the gray boxes are R code that you can use to answer the questions. Problem 1 Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide 𝑛 and . 𝑝 a) We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market. Regression Problem; Prediction; n = 52; p = 3 b) An online store is determining whether or not a customer will purchase additional items. This online store collected data from 1500 customers and looked at cost of initial purchase, if there was a special offer, type of item purchased, number of times the customer logged into their account, and if they purchased additional items. Classification Problem; Prediction; n = 1500; p = 5

1 2 Problem 2 This is an exercises about bias, variance and MSE. Suppose we have 𝑛 independent Bernoulli trails with true success probability . 𝑝 Consider two estimators of : 𝑝 𝑝 1 = 𝑝 where 𝑝 is the sample proportion of successes and 𝑝 2 = 1/2 , a fixed constant. a) Find the expected value and bias of each estimator. b) Find the variance of each estimator. c) Find the MSE of each estimator and compare them by plotting against the true . 𝑝 Use 𝑛 = 4. Comment on the comparison. Red : MSE_p1 Blue : MSE_p2 For most values p, p1 has a smaller MSE. When p near to 1/2, p2 has a smaller MSE. Problem 3 Describe the differences between a parametric and a non-parametric statistical learning ap- proach. What are the advantages of a parametric approach to regression or classification (as opposed to a non-parametric approach)? What are its disadvantages?

3 Parametric make assumptions about the model and non-parametric make no assumptions about the model; Parametric regression or classification offers the benefits of reducing the representation of function f to a small set of parameters, leading to a simpler model structure. Moreover, this approach demands fewer observations for effective modeling in comparison to non-parametric methods; Disadvantages of parametric is it might fail to accurately capture the underlying true functions, resulting in potential errors. Problem 4 This exercise involves the Auto data set in ISLR package. Make sure that the missing values have been removed from the data. (a) Which of the predictors are quantitative, and which are qualitative? Qualitative: name and origin Quantitative: mpg, cylinders, displacement, horsepower, weight, acceleration, and year (b) What is the range of each quantitative predictor? You can answer this using the summary() function. summary(Auto$mpg) : 9.00 to 46.60 summary(Auto$acceleration) : 8.00 to 24.80 summary(Auto$cylinders) : 3.000 to 8.000 summary(Auto$year) : 70.00 to 82.00 summary(Auto$displacement) : 68.0 to 455.0 summary(Auto$horsepower) : 46.0 to 230.0 summary(Auto$weight) : 1613 to 5140 (c) What is the mean and standard deviation of each quantitative predictor? (d) Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains? > auto.new = Auto[-c(10:85),]

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

4 (e) Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings. Clearly, there exists a noticeable correlation where automobiles featuring a greater count of cylinders tend to possess increased displacement, weight, and horsepower. This trend is also associated with reduced acceleration and miles per gallon (mpg) efficiency. The connection between mpg and factors such as displacement, weight, and horsepower demonstrate a certain level of predictability. (f) Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer. Yes, we observe variables displaying both positive and negative correlations with the mpg outcome. For instance, there seems to be a positive association between the year and mpg, indicating that as the year advances, mpg generally improves. On the other hand, there

5 appears to be a negative correlation between horsepower and mpg, suggesting that an increase in horsepower often corresponds to a decrease in mpg.

6 Problem 5 This exercise relates to the College data set, which can be found in the file College.csv attached to this homework set in Blackboard. It contains a number of variables for 777 different universities and colleges in the US. The variables are • Private : Public/private indicator • Apps : Number of applications received • Accept : Number of applicants accepted • Enroll : Number of new students enrolled • Top10perc : New students from top 10% of high school class • Top25perc : New students from top 25% of high school class • F.Undergrad : Number of full-time undergraduates • P.Undergrad : Number of part-time undergraduates • Outstate : Out-of-state tuition • Room.Board : Room and board costs • Books : Estimated book costs • Personal : Estimated personal spending • PhD : Percent of faculty with Ph.D.’s • Terminal : Percent of faculty with terminal degree • S.F.Ratio : Student/faculty ratio • perc.alumni : Percent of alumni who donate • Expend : Instructional expenditure per student • Grad.Rate : Graduation rate Before reading the data into R , it can be viewed in Excel or a text editor. a) Use the read.csv() function to read the data into R . Call the loaded data college . Make sure that you have the directory set to the correct location for the data. You can also import this data set into RStudio by using the Import Dataset → From Text drop down list in the Environment window. b) Look at the data using the View() function. You should notice that the first column is just the name of each university. We will not use this column as a variable but it may be handy to have these names for later. Try the following commands in R : rownames (college) <- college[, 1 ] college <- college[, - 1 ] View (college)

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

7 If you are getting an error make sure your data frame is named with a lowercase “c”. Give a brief description of what you see in the data frame. c) Use the summary() function to produce a numerical summary of the variables in the data set. Is there any variables that do not show a numerical summary? No, all the variables show the numerical summary. Type in the following in R : college $ Private <- as.factor (college $ Private) d) Use the pairs() function to produce a scatterplot matrix of the first five columns or variable of the dataset. Describe any relationships you see in these plots. pairs(college[,1:5]) ; There are positive correlation between Apps & Accept, Apps & Enroll, and Accept & Enroll. e) Use the plot() function to produce a plot of Outstate versus Private . What type of plot was produced? Give a description of the relationship. Hint: ‘Outstate is in the y-axis . plot(college$Outstate ~ college$Private, xlab = "Private", ylab = "Outstate") It produced a boxplots, Private have more out of state students.

8 f) Create a new qualitative variable, called Elite , by 𝑏𝑖𝑛𝑛𝑖𝑛𝑔 the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%. Type in the following in R : Elite <- rep ( "No" , nrow (college)) #this gives a column of No's for the same number of rows Elite[college $ Top10perc > 50 ] <- "Yes" #changes to Yes if top 10% is greater than 50 Elite <- as.factor (Elite) college <- data.frame (college,Elite) #adds Elite as a column Use the summary() function to see how many elite universities there are. There are 78 elite universities.

9 Problem 6 This exercise involves the Boston housing data set. (a) To begin, load in the Boston data set. The Boston data set is part of the ISLR2 library. You may have to install the ISLR2 library then call for this library. library (ISLR2) Now the data set is contained in the object Boston. Boston Read about the data set: ?Boston How many rows are in this data set? How many columns? What do the rows and columns represent? 506 rows and 13 columns; 506 rows represent sample size = 506 (housing values in 506 suburbs of Boston); 13 columns represent 13 variables of data set: crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, lstat, and medv (b) Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings. pairs(Boston) age(x) vs nox(y) It shows that as the level of occupation rises, there is a corresponding increase in pollution. Medv(x) vs lstat(y) Indicates that individuals with a lower socioeconomic status tend to have homes with a lower average value.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

1 0 (c) Are any of the predictors associated with per capita crime rate? If so, explain the relationship. Yes, crim has a negative linear relationship with medv and dis. And crim has a positive linear relationship with indus, nox, rad, and tax. For example, crim vs dis. It appears that there is a higher occurrence of crimes in proximity to employment centers. crim vs dis (d) Do any of the census tracts of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor. Range of crime rate is between 0.00632 to 88.91620; 8 of the census tracts of Boston appear to have particularly high crime rates. Range of tax rate is between 187.0 to 711.0; 137 of the census tracts of Boston appear to have particularly high tax rates.

1 1 Range of pupil-teacher ratios is between 12.60 to 22.00; 183 of the census tracts of Boston appear to have particularly high pupil-teacher ratios. (e) How many of the census tracts in this data set bound the Charles river? 35 of census tracts in this data set bound the Charles River. (f) What is the median pupil-teacher ratio among the towns in this data set? The median pupil-teacher ratio among the towns in this data set is 19.05 (g) Which census tract of Boston has lowest median value of owner occupied homes? What are the values of the other predictors for that census tract, and how do those values compare to the overall ranges for those predictors? Comment on your findings. There are two census tract of Boston, 399 and 406, which have the lowest median value of owner occupied homes; Both of the census tract are not within the highest crime. Both are low level of investment in these census tracts is reflected by the minimal development within the city. Both census tracts are not located alongside the river. Both nox are in upper quartile, which is due to the suburbs’ proximity to highways. Both average number of rooms per dwelling is in the lower quartile, implying smaller living spaces. Both rad are at the maximum, indicating that these areas are located on or near the highways. Also, both of these pupil-teacher ratio are maximum, which indicates potential underinvestment in education resources. (h) In this data set, how many of the census tracts average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the census tracts that average more than eight rooms per dwelling. Among the 13 tracts where the average number of rooms per dwelling exceeds 8, the crime rate is notably low. The pupil-teacher ratio falls within a certain range compared to the variable's wider range, and there is a significantly high average number of rooms per dwelling. With the exception of a single tract, property tax rates are low. Additionally, the proportion of non-retail business acres per town is very low, apart from two tracts,

1 2 implying a predominance of residential areas. These tracts appear to be situated away from highways. The majority of houses were constructed prior to 1940, although a few exceptions exist.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

homework-1

Related Documents