Station-2-Notebook

pdf

School

Pennsylvania State University *

*We aren’t endorsed by this school

Course

300

Subject

Statistics

Date

Jan 9, 2024

Type

pdf

Pages

15

Uploaded by DeaconPencilApe14

Report
Station 2 Allan Julian R Notebooks This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code. Try executing this chunk by clicking the Run button (sideways green arrow) within the chunk or by placing your cursor inside it and pressing Ctrl+Enter . TASK–Run the code below: plot (cars) 5 10 15 20 25 0 20 40 60 80 100 120 speed dist When you save the notebook, an HTML file containing the code and output will be saved alongside it. Click Preview (it may be under the Knit dropdown button), or press Ctrl+Shift+K to preview the HTML file). TASK–create an HTML file from this notebook: The preview shows you a rendered HTML copy of the contents of the editor. Note that Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed. Packages R is incredibly flexible and has tools for almost anything you want to do. Many of these tools are contained in packages that must be loaded into your workspace. The first time you want to use a package you have to 1
install it on your computer using the command ‘install.packages(“package_name”)’ and then load it into your workspace using the command ‘library(package_name)’. After the first time, you only need to call the library and you do not need to use install.packages. Our class will typically use the following packages: mosaic ggformula Stat2Data Lock5Data tidyverse tinytex TASK–Under the Packages tab in the session window, scroll through and check to see which packages of those above are not yet installed. For any that aren’t, install them by clicking the Install tab and entering their names in the popup window. Note that these names are case-sensitive . After installing the packages, you can load them for use by either checking their boxes in the Packages tab or by using the library command below for each package you want to use in the current session. Here is an example using the mosaic package. library (mosaic) ## Registered S3 method overwritten by ' mosaic ' : ## method from ## fortify.SpatialPolygonsDataFrame ggplot2 ## ## The ' mosaic ' package masks several functions from core packages in order to add ## additional features. The original behavior of these functions should not be affected by this. ## ## Attaching package: ' mosaic ' ## The following objects are masked from ' package:dplyr ' : ## ## count, do, tally ## The following object is masked from ' package:Matrix ' : ## ## mean ## The following object is masked from ' package:ggplot2 ' : ## ## stat ## The following objects are masked from ' package:stats ' : ## ## binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test, ## quantile, sd, t.test, var ## The following objects are masked from ' package:base ' : ## ## max, mean, min, prod, range, sample, sum library (tidyverse) # loads the mosaic package into your workspace ## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 -- ## v forcats 1.0.0 v stringr 1.5.0 ## v lubridate 1.9.2 v tibble 3.2.1 2
## v purrr 1.0.2 v tidyr 1.3.0 ## v readr 2.1.4 ## -- Conflicts ------------------------------------------ tidyverse_conflicts() -- ## x mosaic::count() masks dplyr::count() ## x purrr::cross() masks mosaic::cross() ## x mosaic::do() masks dplyr::do() ## x tidyr::expand() masks Matrix::expand() ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() ## x tidyr::pack() masks Matrix::pack() ## x mosaic::stat() masks ggplot2::stat() ## x mosaic::tally() masks dplyr::tally() ## x tidyr::unpack() masks Matrix::unpack() ## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors TASK–Repeat the above steps to install and load ggformula, Stat2Data, Lock5Data, tinytex, and tidyverse packages. At this point you should have all six packages loaded and ready to use. Use data from an R package A great thing about R is that not only does it allow you to analyze data, but you can also access tons of datasets that come cleaned and formatted within R packages. Two of the packages you loaded above, Stat2Data and Lock5Data, are primarily datasets. TASK–Use the code below to load the dataset diamonds, which is part of the ggplot2 package and included in tidyverse. data (diamonds) Inspecting the data source Now you’re ready to learn a little bit about the diamonds data set. TASK: Edit the bullet list to add a short description in your own words describing what each function does. glimpse() : this function. . . head() : this function. . . names() : this function. . . # Inspecting the data source glimpse (diamonds) ## Rows: 53,940 ## Columns: 10 ## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.~ ## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver~ ## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,~ ## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, ~ ## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64~ ## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58~ ## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34~ ## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.~ ## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.~ ## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.~ 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
head (diamonds) ## # A tibble: 6 x 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 ## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 ## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 ## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 names (diamonds) ## [1] "carat" "cut" "color" "clarity" "depth" "table" "price" ## [8] "x" "y" "z" Some Data Prep The following is a little bit of data wrangling to get the source data in shape for our purposes. You can ignore the details and just run this code for now. # Recode & filter (no edits needed) recoded <- # make a new dataset called recoded diamonds %>% # by starting with the diamonds data filter (color == "D" | color == "J" ) %>% # and filtering observations to keep only colors D and J mutate ( col = as.character (color)) # tell R some specifics about how to record the variable color. Basically, we’re going to do a bit of exploration of variables that impact cost of diamonds. Even if you haven’t used R before, you might be able to tell from the code that we start with the diamonds data, filtered (i.e. restricted) our data set to only include the diamonds that are either color “D” or “J”, created a new categorical variable called “col”, and stored the whole thing in a new data set called “recoded”. Statisticians should always know something about the data domain in order to be useful. Wikipedia is usually a good place to start: https://en.wikipedia.org/wiki/Diamond_color. Exploratory Data Analysis For the purposes of our class, it’s useful to learn a model-centric approach to R. The pseudo-code below is going to be our foundation for the rest of the class: function( Y ~ X, data = DataSetName ) Here’s a short description of each part in the pseudo-code above: function is an R function that dictates something you want to do with your data e.g. mean calculates the mean t.test performs a two sample t-test lm fits a linear regression model Y is the outcome of interest (response variable) X is some explanatory variable or you can use “1” as a placeholder if there is no explanatory variable DataSetName is the name of a data set loaded into the R environment Always start with clear research questions. Our question for this exercise: 1. How do diamond prices compare for “D” and “J” colored diamonds? 4
The purpose of the exploratory data analysis (EDA) is to learn as much as you can about your research question before doing any fancy statistical modeling. We basically want to try and answer the research question with EDA if possible, and then we use statistical models to formally accommodate variability in the data and calculate the uncertainty of our conclusions. Mean price by color Use the R code chunk to calculate the mean price by color just like the article. Summarize your observations below the code chunk. Don’t forget, we did some data wrangling above and made a new data set called “recoded.” Use the “recoded” data for the rest of this analysis. mean (price ~ col, data = recoded) ## D J ## 3169.954 5323.818 TASK–Share your observations: The “J” colored diamonds have a higher mean price, we do not know if the color is causing this difference, maybe there is some other feature on some diamonds that happen to be “J” colored that makes it pricier and pulls the mean. Other summary statistics by color Use the R code chunk to calculate the other summary statistics for price of each diamond color using favstats() . Summarize your observations below the code chunk. TASK–Produce the required code: favstats (price ~ col, data = recoded) ## col min Q1 median Q3 max mean sd n missing ## 1 D 357 911.0 1838 4213.5 18693 3169.954 3356.591 6775 0 ## 2 J 335 1860.5 4234 7695.0 18710 5323.818 4438.187 2808 0 TASK–Share your observations: The median price of a “J” colored diamond is greater than that of “D”, the mean of “J” and “D” are both greater than their medians, that means that the price distribution of both colors is skewed to the right. The J colored one’s prices seem to be more spread out than “D” given that the standard deviation is greater 2-3 Basic plots of the data Make side-by-side boxplots along with a scatterplot using the R code shown below. TASK–Run the code to make a boxplot and then modify the code to make a scatterplot: # make a boxplot of price by color gf_boxplot (price ~ col, data = recoded) 5
0 5000 10000 15000 D J col price # make a scatter plot of price versus carat gf_point (price ~ carat, data = recoded) 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0 5000 10000 15000 0 1 2 3 4 5 carat price TASK–Share your observations: For both colors, there is a positive correlation between the carats and the price of the diamond, irrespective of color. The D colored diamonds have a bigger skewwness as indicated by the amount of outliers in the boxplot Multivariable relationships The world is often too complicated to be understood by studying one or two variables at a time. Color is certainly not the only variable that impacts the value of a diamond. You may have noticed in the Wikipedia article that color is only one of the “4 C’s” that influence the value of a diamond. Adjusting for other variables Create a scatter plot of price vs carat (a measure of diamond weight) that colors each plotted point according the color of the diamond it represents. Write a few sentences below the code chunk to explain what you’ve observed from this plot. TASK–modify the code to produce the scatterplot: gf_point (price ~ carat, color = ~ color, data = recoded) 7
0 5000 10000 15000 0 1 2 3 4 5 carat price color D J TASK–Share your observations: There are more higher carat “J” colored diamonds than “D” ones. Despite this, some J’s with higher carat are priced the same as some D with lower carat. D colored diamonds are seem not to follow a linear correlation at all, it looks more quadratic Use data from an outside source It’s important to be able to download a .csv file from an outside source (typically Canvas) and then load it in R for use. There are two ways to do so: 1. Use file.choose() to interactively locate the file and tell R where it is 2. Include the file path in your R code. Create a folder in your documents for Stat 300 data and save the .csv file to that location. Doing this for all our data files makes it easier, not only is it easier to run your code again later without having to locate the file every time, but your file path will be the same if you have all your data in the same location. For assignments in this class you will typically be provided with code that includes a ‘file.choose’ statement, which you can of course modify or leave as is. **TASK–Download the file US _states.csv from Canvas and use the code below to load the data into your environment** To knit to PDF, you’ll need to change the read_csv() function to include the actual file path. Including ‘state_path’ above gives you the file path after you used the interactive function file.choose() to locate your file. Copy and paste your file path into the read_csv() function (replace state_path) in the code chunk above. Also you want to delete the line that uses the file.choose() function and delete the call for state_path in the code chunk. You can literally delete them or put a # symbol in front of the line to tell R not to run 8
that line but keep it there as a note. The final code that runs should be one line like this: state_data <- read_csv(“your specific file path copied here”) Alternatively, you can often find data files you need hosted by your instructor. Here’s an example with the states data: state_data <- read_csv ( "https://sites.psu.edu/sar320/files/2023/08/US-_States.csv" ) ## Rows: 50 Columns: 32 ## -- Column specification -------------------------------------------------------- ## Delimiter: "," ## chr (9): State, Region, TwoParents, ObamaRomney, TrumpClinton, BidenTrump, ... ## dbl (23): HouseholdIncome, EighthGradeMath, HighSchool, College, IQ, GSP, Sm... ## ## i Use ` spec() ` to retrieve the full column specification for this data. ## i Specify the column types or set ` show_col_types = FALSE ` to quiet this message. TASK–Use head() and glimpse() to explore the format of the data a bit glimpse (state_data) ## Rows: 50 ## Columns: 32 ## $ State <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Californ~ ## $ HouseholdIncome <dbl> 43.253, 70.760, 49.774, 40.768, 61.094, 58.433, 69.46~ ## $ Region <chr> "S", "W", "W", "S", "W", "W", "NE", "NE", "S", "S", "~ ## $ EighthGradeMath <dbl> 269.2, 281.6, 279.7, 277.9, 275.9, 289.7, 285.2, 282.~ ## $ HighSchool <dbl> 84.9, 92.8, 85.6, 87.1, 84.1, 89.5, 91.0, 86.9, 87.1,~ ## $ College <dbl> 24.9, 24.7, 25.5, 22.4, 31.4, 37.0, 39.8, 31.7, 26.5,~ ## $ IQ <dbl> 95.7, 99.0, 97.4, 97.5, 95.5, 101.6, 103.1, 100.4, 98~ ## $ GSP <dbl> 32.615, 61.156, 35.195, 31.837, 46.029, 46.242, 54.92~ ## $ Smokers <dbl> 21.5, 22.6, 16.3, 25.9, 12.5, 17.7, 15.5, 19.6, 16.8,~ ## $ PhysicalActivity <dbl> 45.4, 55.3, 51.9, 41.2, 56.3, 60.4, 50.9, 49.7, 50.2,~ ## $ Obese <dbl> 32.4, 28.4, 26.8, 34.6, 24.1, 21.3, 25.0, 31.1, 26.4,~ ## $ HeavyDrinkers <dbl> 4.3, 8.2, 6.3, 5.0, 6.4, 6.7, 6.3, 6.6, 7.2, 4.7, 7.6~ ## $ Electoral <dbl> 9, 3, 11, 6, 55, 9, 7, 3, 29, 16, 4, 4, 20, 11, 6, 6,~ ## $ TwoParents <chr> "Under65%", "Over65%", "Under65%", "Under65%", "Over6~ ## $ StudentSpending <dbl> 8.755, 18.175, 7.208, 9.394, 9.220, 8.647, 16.631, 13~ ## $ Insured <dbl> 78.8, 79.8, 74.7, 71.7, 79.7, 80.0, 87.7, 85.7, 70.9,~ ## $ RomneyVote <dbl> 60.54582, 54.80158, 53.65453, 60.56694, 37.12038, 46.~ ## $ ObamaVote <dbl> 38.35903, 40.81266, 44.58977, 36.87899, 60.23896, 51.~ ## $ ObamaRomney <chr> "Romney", "Romney", "Romney", "Romney", "Obama", "Oba~ ## $ TrumpVote <dbl> 62.08309, 51.28151, 48.67162, 60.57410, 31.61711, 43.~ ## $ ClintonVote <dbl> 34.35795, 36.55087, 45.12602, 33.65312, 61.72640, 48.~ ## $ TrumpClinton <chr> "Trump", "Trump", "Trump", "Trump", "Clinton", "Clint~ ## $ Turnout2018 <dbl> 47.3, 54.6, 49.1, 41.4, 49.6, 63.0, 54.4, 51.4, 54.9,~ ## $ Turnout2020 <dbl> 63.1, 68.8, 65.9, 56.1, 68.5, 76.4, 71.5, 70.7, 71.7,~ ## $ BidenTrump <chr> "Trump", "Trump", "Biden", "Trump", "Biden", "Biden",~ ## $ cases <dbl> 1466503, 277007, 2237208, 921796, 10978212, 1620407, ~ ## $ deaths <dbl> 20046, 1296, 30982, 11851, 94623, 13833, 11180, 3039,~ ## $ Population2010 <dbl> 4.779736, 0.710231, 6.392017, 2.915918, 37.253956, 5.~ ## $ Population2019 <dbl> 4.903185, 0.731545, 7.278717, 3.017804, 39.512223, 5.~ ## $ EnoughVeg <chr> "Under75%", "Over75%", "Over75%", "Under75%", "Over75~ ## $ EnoughFruit <chr> "Under60%", "Over60%", "Over60%", "Under60%", "Over60~ ## $ PropWhite <chr> "Under75%", "Under75%", "Over75%", "Over75%", "Under7~ This data has one row for each of the states, and each variable is recorded for the entire state. For example, 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
that variable ‘HeavyDrinkers’ gives the percentage of residents in a state that are heavy drinkers. According to these data 5.9% of Pennsylvania residents are heavy drinkers. TASK–Refer to the US States Merged Data codebook and select some variables below: Quantitative variable 1: HouseholdIncome Quantitative variable 2: IQ Categorical variable 1: BidenTrump Categorical variable 2: Region EDA for State data TASK–Determine how many states are in each category of a categorical variable: tally ( ~ Region, data = state_data) ## Region ## MW NE S W ## 13 11 13 13 TASK–create a barchart for one categorical variable: gf_bar ( ~ Region, data = state_data) 0 5 10 MW NE S W Region count TASK–Determine how many states are in combinations of categories for two categorical variables: tally (Region ~ BidenTrump, data = state_data) ## BidenTrump ## Region Biden Trump 10
## MW 4 9 ## NE 11 0 ## S 2 11 ## W 8 5 TASK–create a side-by-side barchart for two categorical variables: gf_bar ( ~ Region, fill = ~ BidenTrump, data = state_data, position = position_dodge ()) 0 3 6 9 MW NE S W Region count BidenTrump Biden Trump TASK–create a histogram for one of your quantitative variables: gf_histogram ( ~ IQ, data = state_data) 11
0 1 2 3 4 96 99 102 105 IQ count TASK–create side-by-side boxplots for one quantitative and one categorical variable: gf_boxplot (HouseholdIncome ~ Region, data = state_data) 12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
40 50 60 70 MW NE S W Region HouseholdIncome TASK–create a scatterplot for two quantitative variables: gf_point (HouseholdIncome ~ IQ, data = state_data) 13
40 50 60 70 96 99 102 IQ HouseholdIncome TASK–create a scatterplot for two quantitative variables, using a categorical variable for color: gf_point (HouseholdIncome ~ IQ, color = ~ BidenTrump, data = state_data) 14
40 50 60 70 96 99 102 IQ HouseholdIncome BidenTrump Biden Trump library (tinytex) TASK–share something interesting you learned from these figures. Anything you learned must be in context Biden voters seem not only to have a higer household income, but also IQ, there are some outliers in both but generally speaking that seems to be the case Make sure you save your work. You will need it for your first homework assignment. 15
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help