Lab 4

Rmd

School

Pasadena City College *

*We aren’t endorsed by this school

Course

012

Subject

Statistics

Date

Feb 20, 2024

Type

Rmd

Pages

5

Uploaded by BailiffStrawBoar21

Report
--- title: "Lab 4" author: "Kimberly Boswell" date: "2024-01-31" output: pdf_document --- ## Special Symbols When you are copying certain variables like `-` and `^` it may throw an error, so make sure to type out your code. Secondly, elements like the \#, \%, and \$ may also throw errors when knitted, so be sure to put a backslash in front. These symbols can be. used for other purposes in Markdown, so the backslash instructs R to print it as is. ### Mathematical Notation in Markdown An advantage of markdown is its use as a plain-text editor, R console and a LaTeX compiler. If you'd like to include any sort of mathematical equations in markdown, you can use several things: (i) a pair of dollar signs (inline) (ii) a pair of 2 dollar signs (new line/line break) (iii) align environment Say you want to write xi in subscript form, inline it would look like $x_i$ or $\ alpha$ or the sum of xi's in a new line: $$ \sum\limits_{i=1}^n x_i $$ As you type, you should see the preview, which should let you know if your formatting is correct. If there was a series of equations, we could use the align environment. Adding an asterisk removes the numbering. Say you wanted to label only one element in the document, you could use the 'notag' command. We use the ampersand for alignment \begin{align} y &= mx+c \notag \\ & = b_2 x + b_1 \end{align} If you wanted to write a hypothesis test for whether the marginal effect of the explanatory variable of a standard linear model is 0 or not, how could you write this out? ## Continuation of Food Dataset Processing Yet again, we will start by using the food dataset to illustrate, and then segue into a novel dataset for further practice. ```{r message = F} library(POE5Rdata) library(margins) library(tseries) data(food) reg <- lm(food_exp ~ income, food) ``` ## Marginal Effect For this, we can use the margins package (if you don't have it installed, that must be done first using `install.packages("margins")`). Thereafter, you can use
`library(margins)` to call it. This gives us the average marginal effect. Try it on reg and see: ```{r} margins(reg) ``` Now, if we wanted to find the confidence interval for the margin, ```{r} margins_summary(reg) ``` If we store the margins output as an object, we're able to plot it. ```{r} m <- margins(reg) plot(m) ``` We can also find the marginal effect at a particular point. This is especially helpful for nonlinear regression models. Finding the marginal effect of income on food expenditure for a quadratic model, where income is \$2000. ```{r} margins(reg, at = list(income= 20)) margins_summary(reg, at = list(income= 20), level = 0.9) ``` ## Finding p values. Recall that our test statistic is: $$ tstat = \dfrac{Estimate - Parameter}{s.e.(Estimate)}$$ If we use the critical value method, we would use the qt function in conjunction with $\alpha$ to find our rejection region. If instead, we were to use the p-value method, we would need to find the probability of obtaining that test statistic or larger. If our test statistic was 2.37 at 38 degrees of freedom, what is the p- value for a: (i) right-tailed test (i) two-tailed test ## Prediction v Confidence Intervals for Linear Combination of parameters The predicted value $\hat{y}$ is a random variable, since it depends on the sample; therefore, we can calculate a confidence interval and test hypothesis about it, provided we can determine its distribution and variance. Note that there are two types of intervals. ```{r predict} predict(reg, data.frame(income=20),interval="confidence",level=0.95) predict(reg, data.frame(income=20),interval="prediction",level=0.95) ``` What do you notice about these two outputs, both the estimate and the bounds?
```{r} income <- seq(from=min(food$income), to=max(food$income)) ypredict <- predict(reg, newdata=data.frame(income), interval="confidence") yforecast <- predict(reg, newdata=data.frame(income), interval="predict") matplot(income, cbind(ypredict[,1], ypredict[,2], ypredict[,3], yforecast[,2], yforecast[,3]), type ="l", lty=c(1, 2, 2, 3, 3), col=c("black", "red", "red", "blue", "blue"), ylab="food expenditure", xlab="income") points(food$income, food$food_exp) legend("topleft", cex = 0.7, legend=c("E[y|x]", "lwr_pred", "upr_pred", "lwr_forcst","upr_forcst"), lty=c(1, 2, 2, 3, 3), col=c("black", "red", "red", "blue", "blue") ) ``` ## Goodness of Fit The command for the output for the $R^2$ is r.squared, and we can use the selection operator, i.e. ```{r rsq} summary(reg)$r.squared ``` Remember you are able to check `attributes` to see what elements are at your disposal. ## Important Output from R for the exam You will need the output from the following commands for your midterm `summary`, `anova`, `vcov`. ```{r refresher} summary(reg) anova(reg) vcov(reg) ``` ## Checking for Normality or Residuals and Other Diagnostics. We can either use diagrams or supplement with a statistical test. ```{r resid} hist(resid(reg), probability = T) lines(density(resid(reg)), col = "blue") plot(reg) jarque.bera.test(resid(reg)) ```
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
#Influential Observations ### AUX. You may be interested in exploring outliers, high leverage, and or influential observations worth removing. - Outlier: has a large residual - High Leverage: the x value is far from its mean - Influential obs: causes large changes to the coefficients of the model Why are we doing this? To spot influential or anomalous observations. What methods do we use for this? + Observation and Reasoning: If the value seems illogical, like if it's an age and its value is 0 or > 100. By observing the boxplot. If there is a cluster of values outside the boxplot, then you may not want to eliminate them. You'll want to check the Cook's distance first, and if the outlier has a high Cooks Distance (see criteria below) then we consider eliminating it. **Make sure to mention in your final report or analysis that you removed any outliers.** ```{r eval=FALSE} #Identifying outliers in the boxplot quartiles <- quantile(data$variable, probs=c(.25, .75), na.rm = FALSE) IQR <- IQR(data$variable) UQ <- quartiles[2]+1.5*IQR LQ <- quartiles[1]-1.5*IQR outliers_removed <- subset(data, data$variable > LQ & data$variable < UQ) #OR we could also use the feature within boxplot to do this outliers <- boxplot(data$variable, plot=FALSE)$out outliers_removed <- data[-which(data$variable %in% outliers),] ``` + Cook's distance: takes into account both leverage and residual Criteria for elimination: a. Either being greater than three times the average Cooks distance or b. being greater than k/N (k= #of predictors; N - n.obs) c. being greater than 4/N d. being higher than 1 ```{r eval = FALSE} #How to find Cook's Distance. library(ggplot2) data(diamonds) model <- lm(y ~ x, data = diamonds) #find Cook's distance for each observation in the dataset cooksD <- cooks.distance(model) plot(model, which = 4) #change to 1 to show the residual plot, for part(6)
#How to identify influential points using different criteria influential_obs <- as.numeric(names(cooksD)[(cooksD > (2*(p+1)/n))]) influential_obs1 <- as.numeric(names(cooksD)[(cooksD > (3 * mean(cooksD)))]) influential_obs2 <- as.numeric(names(cooksD)[(cooksD > 1))]) #New data frame with influential observations removed outliers_removed <- outliers[-influential_obs, ] ``` You might want to restimate your model in 2 after removing these observations to see its effect. If you have a lot of influential observations, this could mean a problem with your regression model. So hold off on eliminating these points. For any observations that are eliminated, be sure to include your explanation for removal. Alternatively, ```{r} hats <- as.data.frame(hatvalues(reg)) plot(hatvalues(reg), type = 'h') ``` ## Importing Files from other sources