HW2_assignment

pdf

School

University of Illinois, Urbana Champaign *

*We aren’t endorsed by this school

Course

480

Subject

Computer Science

Date

Feb 20, 2024

Type

pdf

Pages

Uploaded by SargentCheetahPerson730

Homework 2 Assignment This assignment contains TWO data analyses. Each question is 10 points. Predicting House Prices ## Read in the data homes <- read.csv ( "homes2004.csv" ) # conditional vs marginal value par ( mfrow= c ( 1 , 2 )) # 1 row, 2 columns of plots hist (homes $ VALUE, col= "grey" , xlab= "home value" , main= "" ) plot (VALUE ~ factor (BATHS), col= rainbow ( 8 ), data= homes[homes $ BATHS < 8 ,], xlab= "number of bathrooms" , ylab= "home value" ) home value Frequency 0 1000000 0 2000 6000 0 2 4 6 0 1000000 2000000 number of bathrooms home value # You can try some quick plots. Do more to build your intuition! #par(mfrow=c(1,2)) #plot(VALUE ~ STATE, data=homes, # col=rainbow(nlevels(homes$STATE)), # ylim=c(0,10ˆ6), cex.axis=.65) #plot(gt20dwn ~ FRSTHO, data=homes, # col=c(1,3), xlab="Buyer ' s First Home?", 1

# ylab="Greater than 20% down") Question 1 Regress log price onto all variables but mortgage. What is the R2? How many coefficients are used in this model and how many are significant at 10% FDR? Re-run regression with only the significant covariates, and compare R2 to the full model. library (knitr) # library for nice R markdown output # regress log(PRICE) on everything except AMMORT pricey <- glm ( log (LPRICE) ~ . - AMMORT, data= homes) # extract pvalues pvals <- summary (pricey) $ coef[ - 1 , 4 ] # example: those variable insignificant at alpha=0.05 names (pvals)[pvals > . 05 ] # you ' ll want to replace .05 with your FDR cutoff # you can use the ` -AMMORT ' type syntax to drop variables Question 2 Fit a regression for whether the buyer had more than 20 percent down (onto everything but AMMORT and LPRICE). Interpret effects for Pennsylvania state, 1st home buyers and the number of bathrooms. Add and describe an interaction between 1st home-buyers and the number of baths. # create a var for downpayment being greater than 20% homes $ gt20dwn <- factor ( 0.2 < (homes $ LPRICE - homes $ AMMORT) / homes $ LPRICE) Question 3 Focus only on a subset of homes worth > 100k. Train the full model from Question 1 on this subset. Predict the left-out homes using this model. What is the out-of-sample fit (i.e. R 2 )? Explain why you get this value. subset <- which (homes $ VALUE > 100000 ) # Use the code `` deviance.R" to compute OOS deviance source ( "deviance.R" ) # Null model has just one mean parameter ybar <- mean ( log (homes $ LPRICE[ - subset])) D0 <- deviance ( y= log (homes $ LPRICE[ - subset]), pred= ybar) # - don ' t forget family="binomial"! # - use +A*B in forumula to add A interacting with B 2

Amazon Reviews We will use the same datasets (review_subset.csv, word_freq.csv and words.csv) as in Assignment 1. data <- read.table ( "Review_subset.csv" , header= TRUE ) words <- read.table ( "words.csv" ) words <- words[, 1 ] doc_word <- read.table ( "word_freq.csv" ) names (doc_word) <- c ( "Review ID" , "Word ID" , "Times Word" ) Question 4 We want to build a predictor of customer ratings from product reviews and product attributes. For these questions, you will fit a LASSO path of logistic regression using a binary outcome: Y = 1 for 5 stars (1) Y = 0 for less than 5 stars . (2) Fit a LASSO model with only product categories. The start code prepares a sparse design matrix of 142 product categories. What is the in-sample R2 for the AICc slice of the LASSO path? Why did we use standardize FALSE? # Let ' s define the binary outcome # Y=1 if the rating was 5 stars # Y=0 otherwise Y <- as.numeric (data $ Score == 5 ) # (a) Use only product category as a predictor library (gamlr) ## Loading required package: Matrix source ( "naref.R" ) # Cast the product category as a factor data $ Prod_Category <- as.factor (data $ Prod_Category) class (data $ Prod_Category) [1] “factor” # Since product category is a factor, we want to relevel it for the LASSO. # We want each coefficient to be an intercept for each factor level rather than a contrast. # Check the extra slides at the end of the lecture. # look inside naref.R. This function relevels the factors for us. data $ Prod_Category <- naref (data $ Prod_Category) # Create a design matrix using only products 3

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

products <- data.frame (data $ Prod_Category) x_cat <- sparse.model.matrix ( ~ ., data= products)[, - 1 ] # Sparse matrix, storing 0 ' s as . ' s # Remember that we removed intercept so that each category # is standalone, not a contrast relative to the baseline category colnames (x_cat) <- levels (data $ Prod_Category)[ - 1 ] # let ' s call the columns of the sparse design matrix as the product categories # Let ' s fit the LASSO with just the product categories lasso1 <- gamlr (x_cat, y= Y, standardize= FALSE , family= "binomial" , lambda.min.ratio= 1e-3 ) Question 5 Fit a LASSO model with both product categories and the review content (i.e. the frequency of occurrence of words). Use AICc to select lambda. How many words were selected as predictive of a 5 star review? Which 10 words have the most positive effect on odds of a 5 star review? What is the interpretation of the coefficient for the word ‘discount’? # Fit a LASSO with all 142 product categories and 1125 words spm <- sparseMatrix ( i= doc_word[, 1 ], j= doc_word[, 2 ], x= doc_word[, 3 ], dimnames= list ( id= 1 : nrow (data), words= words)) dim (spm) # 13319 reviews using 1125 words [1] 13319 1125 x_cat2 <- cbind (x_cat,spm) lasso2 <- gamlr (x_cat2, y= Y, lambda.min.ratio= 1e-3 , family= "binomial" ) Question 6 Continue with the model from Question 5. Run cross-validation to obtain the best lambda value that minimizes OOS deviance. How many coefficients are nonzero then? How many are nonzero under the 1se rule? cv.fit <- cv.gamlr (x_cat2, y= Y, lambda.min.ratio= 1e-3 , family= "binomial" , verb= TRUE ) fold 1,2,3,4,5,done. 4

Related Documents

Screenshot 2024-01-31 at 5.10.21 PM.png

Activity 4.docx

HW2p_uub0hQq.pdf

README.docx

Weekly Textbook Assignment 5- BAER.docx

Learning Profile .docx

Phase_1_Portfolio_Milestone.docx

4.docx

Researching your Topic apa style.docx

5-2.docx

CL730 Module 5 Cali Lesson.pdf

MCCG146 - Week 6 Quality Audit.docx

Recommended textbooks for you

Programming with Microsoft Visual Basic 2017

Computer Science

ISBN:9781337102124

Author:Diane Zak

Publisher:Cengage Learning

COMPREHENSIVE MICROSOFT OFFICE 365 EXCE

Computer Science

ISBN:9780357392676

Author:FREUND, Steven

Publisher:CENGAGE L

Np Ms Office 365/Excel 2016 I Ntermed

Computer Science

ISBN:9781337508841

Author:Carey

Publisher:Cengage

Programming Logic & Design Comprehensive

Computer Science

ISBN:9781337669405

Author:FARRELL

Publisher:Cengage

CMPTR

Computer Science

ISBN:9781337681872

Author:PINARD

Publisher:Cengage

Oracle 12c: SQL

Computer Science

ISBN:9781305251038

Author:Joan Casteel

Publisher:Cengage Learning

SEE MORE TEXTBOOKS

Recommended textbooks for you

Programming with Microsoft Visual Basic 2017
Computer Science
ISBN:9781337102124
Author:Diane Zak
Publisher:Cengage Learning
COMPREHENSIVE MICROSOFT OFFICE 365 EXCE
Computer Science
ISBN:9780357392676
Author:FREUND, Steven
Publisher:CENGAGE L
Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:9781337508841
Author:Carey
Publisher:Cengage
Programming Logic & Design Comprehensive
Computer Science
ISBN:9781337669405
Author:FARRELL
Publisher:Cengage
CMPTR
Computer Science
ISBN:9781337681872
Author:PINARD
Publisher:Cengage
Oracle 12c: SQL
Computer Science
ISBN:9781305251038
Author:Joan Casteel
Publisher:Cengage Learning