Classwork #11-EmptyModel-TrumpVote-R

pdf

School

California State University, Los Angeles *

*We aren’t endorsed by this school

Course

3020

Subject

Statistics

Date

Apr 3, 2024

Type

pdf

Pages

8

Uploaded by PostMalonFalling

Report
Classwork #11-EmptyModel-TrumpVote-R March 28, 2024 1 Classwork #11: Does Vegetable Eating Really Predict TrumpVote20? [1]: # This code will load the R packages we will use suppressPackageStartupMessages ({ library (coursekata) }) # Updated USStates data with election data USStates <- read.csv ( "https://docs.google.com/spreadsheets/d/e/ 2PACX-1vSEc6kO1zrL_3Jlc_cA7cMgk6E2xcIjuUbTL50y-0ENwWby36EFj1MpWZLVKud8YMTtqb1zsef_a8Ss/ pub?gid=1275513973&single=true&output=csv" , header = TRUE ) We’re going to revisit the hypothesis: TrumpVote20 = FiveVegetables + Other Stuff . 1.1 1.0 - Questioning the Data We have looked at the USStates data set before and found a fairly surprising result: that states with relatively more vegetable eaters had relatively fewer votes for Trump. We’re going to try to be a little bit more skeptical today about this model of the DGP. As a reminder, here is the list of variables in USStates . State Name of state HouseholdIncome Mean household income (in dollars) IQ Mean IQ score of residents McCainVote Percentage of votes for John McCain in 2008 Presidential election Region Area of the country: MW=Midwest, NE=Northeast, S=South, or W=West Population Number of residents (in millions) EighthGradeMath Average score on standardized test administered to 8th graders HighSchool Percentage of high school graduates GSP Gross State Product (dollars per capita) FiveVegetables Percentage of residents who eat at least five servings of fruits/vegetables per day Smokers Percentage of residents who smoke PhysicalActivity Percentage of residents who have competed in a physical activity in past month Obese Percentage of residents classified as obese College Percentage of residents with college degrees 1
NonWhite Percentage of residents who are not white HeavyDrinkers Percentage of residents who drink heavily TrumpVote16 Percentage of votes for Donald Trump in 2016 Presidential election TrumpVote20 Percentage of votes for Donald Trump in 2020 Presidential election BidenVote20 Percentage of votes for Joe Biden in 2020 Presidential election 1.1 - In order to get the FiveVegetables percentages, samples of people in all 50 states were asked this question: “Did you eat at least 5 servings of fruits and vegetables yesterday?” What would be your answer to that question? 1.2 - Could people lie? Could people try to tell the truth and still get it wrong? How would that be an example of “measurement error” (the idea that the data are “off” from what actually happened)? How could you reduce the measurement error that results from lying? 1.2 2.0 - Explaining Variation 2.1 - Take a look at TrumpVote20 = FiveVegetables + Other Stuff with a visualization. 2.2 - What are some reasons (from the data) for suspecting that FiveVegetables really does explain some of the variation in TrumpVote20 ? How would we write this as a word equation? [ ]: There ' s a n e g a t i v e c o r r e l a t i o n b e t w e e n T r u m p v o t e s a n d p e o p l e w h o e a t f i v e v e g a t a b l e s 2.3 - Does every state fit this pattern? What are some reasons (from the data) for suspecting that FiveVegetables DOES NOT explain some of the variation in TrumpVote20 ? How would we write this as a word equation? No, not every state fits the pattern, the fact that not every state follows this trend makes us suspect that five vegetables doesn’t explain some of the variation in Trumpvote20 2.4 - Is it possible to have gotten this pattern of data by chance? Write a word equation that represents this possibility. Yes it is possibl to have gotten this pattern of data by chanc. TrumpVote20= OtherStuff 1.3 3.0 - TrumpVote20, Maybe It’s All Just Other Stuff 3.1 - If we didn’t know anything about a state or we wondered if this FiveVegetables thing was just a random fluke, what should we predict a random state’s TrumpVote20 to be? 3.2 - We could add your prediction (it could be the mean or median or any other number) into this scatterplot. For now, let’s try adding the mean. Note that just as the gf_vline() function adds vertical lines to the x-axis, gf_hline() adds horizontal lines to the y-axis. [6]: TrumpVote_stats <- favstats ( ~ TrumpVote20, data = USStates) TrumpVote_stats $ mean # Add the empty model to this scatterplot gf_point (TrumpVote20 ~ FiveVegetables, data = USStates, size = 3 , alpha = . 8 ) %>% gf_hline (yintercept = TrumpVote_stats $ mean) 2
50.0838 3.3 - Why is the mean represented as a line? Why not a single dot? No matter what x value is im going to always get the mean of 50.8 3.4 - Fit the empty model for TrumpVote20 . Then use this model to predict each state’s proportion of Trump votes. We’ll then plot the predictions right on top of our original scatterplot. [8]: # How do we fit the empty model? EmptyModel <- lm (TrumpVote20 ~ NULL , data = USStates) # How do we generate the predictions from it? USStates $ Prediction <- predict (EmptyModel) # This will plot the predictions from the empty model gf_point (TrumpVote20 ~ FiveVegetables, data = USStates, size = 3 , alpha = . 8 ) %>% gf_point (Prediction ~ FiveVegetables, size = 3 , alpha = . 1 , color = "orange" ) 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
3.5 - Now we will start writing the word equation for the empty model as TrumpVote20 = Mean + Other Stuff (instead of the old way: TrumpVote20 = Other Stuff ). In the cell below, insert our mean into the equation (i.e., replace “Mean” with the prediction of our empty model): 𝑇𝑟𝑢??𝑉 ?𝑡𝑒20 = 𝑀𝑒𝑎? + 𝑂𝑡ℎ𝑒𝑟𝑆𝑡𝑢𝑓𝑓 1.4 4.0 - Simulating a Random Data Generating Process 4.1 - Remember the gummy bear launches and NumLifts experiment? How did we “simulate” a random data generating process? Why is that a “random” process? Which R function acts like that? 4.2 - One pattern we saw in our data is that generally high FiveVegetable states also have low TrumpVote20 . But the definition of “random” includes the idea that high numbers don’t system- atically go with low numbers. Instead, randomness means that high numbers could go with low, medium, OR high numbers! If we shuffled TrumpVote20 in this dataframe, would we generate data that looks just like our empirical sample? With R, we don’t just have to wonder. We can actually do it. Run the code below a few times. What is it doing? [19]: gf_point ( shuffle (TrumpVote20) ~ FiveVegetables, data = USStates) 4
4.3 - Could we put the shuffle around FiveVegetables ? Try it. [17]: gf_point (TrumpVote20 ~ shuffle (FiveVegetables), data = USStates) 5
4.4 - One of the scatterplots below is the empirical sample. Does it look any different from the shuffled scatterplots? What are you looking for that is different in the empirical sample? 4.5 - Do you think the likelihood of getting a pattern of data like the empirical sample from a random process is high? Low? Medium? Explain your reasoning. I think the likelilhood is low since all the times we sampled were not similar to our orignial pattern. We never saw a pattern like our empirical one. 1.5 5.0 - Connecting the Empty Model and Shuffle 5.1 - If we shuffled TrumpVote20 or FiveVegetables in this data frame, would we estimate a different empty model? In other words, would the empty model change? 5.2 - With R, we can try it and see what happens. Explain what each line of code is doing. [20]: # 5.3 - What’s this about? USStates $ shuffled_Trump <- shuffle (USStates $ TrumpVote20) [15]: # 5.4 - What’s this about? shuffled_Trump_stats <- favstats ( ~ shuffled_Trump, data = USStates) TrumpVote_stats <- favstats ( ~ TrumpVote20, data = USStates) [13]: # 5.5 - What’s this about? 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
gf_point (shuffled_Trump ~ FiveVegetables, data = USStates, color = "dodgerblue" ) %>% gf_hline (yintercept = ~ mean, data = shuffled_Trump_stats) [14]: # 5.6 - What’s this about? gf_point (TrumpVote20 ~ FiveVegetables, data = USStates) %>% gf_hline (yintercept = ~ mean, data = TrumpVote_stats) 7
5.7 - Why is the mean the same on both graphs? Is that just a coincidence? What if you ran the shuffle again, would it calculate a different mean? What if you shuffled FiveVegetables – would that result in a different mean for TrumpVote20 ? [ ]: the mean is the same because it does not change based on the order of the numbers 5.8 - Why is the empty model a stand-in for a DGP of randomness? The empty model is not affected by any other varaible. It is not affected by fiveveg. We are not making a prediction for each state based on another varaible, we are just using the mean. 8