HW 1 solutions

pdf

School

University of Texas, Rio Grande Valley *

*We aren’t endorsed by this school

Course

AUDITING

Subject

Statistics

Date

Nov 24, 2024

Type

pdf

Pages

14

Uploaded by LieutenantResolve12096

Report
GV900 Homework 1 Fall 2023 Homework 1: Answer Key Question 1 Load the ggplot2 library and then the midwest dataset. Display the column names for the midwest dataset. [3 marks] library(ggplot2) data( "midwest" ) # OR data(midwest) is also fine ?midwest names(midwest) [1] “PID” “county” “state” [4] “area” “poptotal” “popdensity” [7] “popwhite” “popblack” “popamerindian” [10] “popasian” “popother” “percwhite” [13] “percblack” “percamerindan” “percasian” [16] “percother” “popadults” “perchsd” [19] “percollege” “percprof” “poppovertyknown” [22] “percpovertyknown” “percbelowpoverty” “percchildbelowpovert” [25] “per- cadultpoverty” “percelderlypoverty” “inmetro” [28] “category”
GV900 Homework 1 Fall 2023 # or colnames(midwest) is also fine Question 2 What type of variable is state? How can you tell? How many different states are there in this dataset? What is the modal state in the dataset? [2 marks] # create a frequency table for variable state table(midwest$state) ## ## IL IN MI OH WI ## 102 92 83 88 72 The variable ‘state’ is a nominal/categorical variable, as there is no inherent order among the states and no numbers attached to them either. Within this dataset, there are five distinct states. The modal state, meaning the state with the highest frequency, is IL (Illinois). Question 3 Look at the summary statistics for the variable popwhite. Just based on looking at these numbers, describe the likely distribution of the variable. That is, in your own words and in 2-3 sentences, describe the range, and the likely skewness of the variable (if any). Make sure to explain why you think the variable is/isn’t skewed a certain way based on just these summary statistics numbers [4 marks]
GV900 Homework 1 Fall 2023 summary(midwest$popwhite) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 416 18630 34471 81840 72968 3204947 range(midwest$popwhite) ## [1] 416 3204947 # for the range, it's also fine to use max(midwest$popwhite) - min(midwest$pop Based on the summary statistics, we can observe that the variable has a substan- tial range, extending from a minimum number of 416 white people in a county to a maximum value of 3,204,947. This wide range indicates a high degree of vari- ability in the data. The median (34,471) is notably less than the mean (81,840), suggesting a right-skewed distribution, as the mean is pulled towards the higher values by extremely high data points. Question 4 Now, make a histogram for the popwhite variable. In doing so, make sure that you change the default binwidth, color, and axes labels, and that you give your histogram a relevant title. Also make sure that you save your histogram as an object rather than displaying it directly. If you were to describe the distribution of this variable based on the histogram, would your description change in any way from the previous answer? If yes, in what way? If no, why not? Explain why you think this variable is skewed or not, i.e., think about what the variable
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
GV900 Homework 1 Fall 2023 measures and explain why it probably looks the way it does. (Note that this last part is not about how you can tell whether the variable is skewed or not but, rather, asks you think about what the variable measures and why it makes sense, or doesn’t make sense, that the variable is skewed or not in the way that it is.) [6 marks] # create histogram for 'popwhite' popwhite_hist <- ggplot( data = midwest, aes( x = popwhite)) + geom_histogram( binwidth = 100000 , color = "purple" , fill = "skyblue" ) + labs( x = "Number of white people" , y = "Frequency" , title = "Histogram of white people in every county" ) popwhite_hist
GV900 Homework 1 Fall 2023 0 100 200 300 0e+00 1e+06 2e+06 3e+06 Number of white people Frequency Histogram of white people in every county Observing the histogram, I would draw a similar conclusion to before that the variable is very much right-skewed. As the figure shows, most counties have up to 200,000 white people with some having more but still under approximately 1 million. There are a few large outliers, especially the maximum value, which gives the distribution a very long tail. Thus the decription of the variable would be similar to before. As for why the variable looks this way, that is likely because it is a raw count of the number of white people in a given county without taking into account the size of the county or the overall population. In other words, geographically larger counties or densely populated counties (such as counties containing big cities) will have a higher number of people living there and also a high number of white people living there. However, there won’t be very many such counties as not every county has a huge urban center in it so the dataset
GV900 Homework 1 Fall 2023 will have a handful of such observations with the other observations having much lower numbers of white people living there. Question 5 Make and display a density plot for the variable called percollege. In doing so, make sure you change the axes labels, add a title, and use a different color than the default settings. What kind of variable is percollege? Now, describe the variable using the density plot, i.e., think about the range, skewness, what types of counties seem to be the most common based on this figure etc? Based on just this figure, is the mean or median of this variable likely to be higher? How can you tell? Which of the two (i.e., mean or median) do you think is the more representative summary statistic for this variable, and why? [7 marks] # create density plot for 'percollege' percollege_density <- ggplot( data = midwest, aes( x = percollege)) + geom_density( color = "purple" , fill = "skyblue" ) + labs( x = "Percent college educated" , y = "Density" , title = "Density of college educated (percentage)" ) percollege_density
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
GV900 Homework 1 Fall 2023 0.000 0.025 0.050 0.075 10 20 30 40 50 Percent college educated Density Density of college educated (percentage) ‘Percollege’ represents the percentage of college-educated individuals in each county. Therefore, the variable is quantitative and continuous. The variable exhibits a right-skewed distribution, which we can see from the long right tail. The percentage of people in a county with college degrees ranges from approx- imately 5% to close to 50%. From the plot, it seems that most counties have approximately 15% people with college education. Given the heavy right-skew, the mean is likely to be higher than the median as it gets pulled in the direction of the tail. In this skewed scenario, the median is a more suitable summary statistic for describing the central tendency of the variable because the median is not influenced by outliers in the way the mean is.
GV900 Homework 1 Fall 2023 Question 6 For this last question, you will not use the midwest dataset. a Create, store and display a sequence from 0 to 1000 in increments of 20. Next, create, store and display a sequence of the same length as the first vector you have created that goes from 2000 to 4000. Finally, create, store and display a vector where you subtract the second vector from the first, i.e., vector 1 - vector 2. [3 marks] # Create the first vector seq_a <- seq( from = 0 , to = 1000 , by = 20 ) seq_a [1] 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 [16] 300 320 340 360 380 400 420 440 460 480 500 520 540 560 580 [31] 600 620 640 660 680 700 720 740 760 780 800 820 840 860 880 [46] 900 920 940 960 980 1000 # Create the second vector seq_b <- seq( from = 2000 , to = 4000 , length.out = 51 ) # It is also fine to use length.out = length(seq_a) seq_b [1] 2000 2040 2080 2120 2160 2200 2240 2280 2320 2360 2400 2440 2480 2520 2560 [16] 2600 2640 2680 2720 2760 2800 2840 2880 2920 2960 3000 3040 3080 3120 3160 [31] 3200 3240 3280 3320 3360 3400 3440 3480 3520 3560 3600 3640 3680 3720 3760 [46] 3800 3840 3880 3920 3960 4000
GV900 Homework 1 Fall 2023 # Add vectors seq_c <- seq_a - seq_b seq_c [1] -2000 -2020 -2040 -2060 -2080 -2100 -2120 -2140 -2160 -2180 -2200 -2220 [13] -2240 -2260 -2280 -2300 -2320 -2340 -2360 -2380 -2400 -2420 -2440 -2460 [25] -2480 -2500 -2520 -2540 -2560 -2580 -2600 -2620 -2640 -2660 -2680 -2700 [37] -2720 -2740 -2760 -2780 -2800 -2820 -2840 -2860 -2880 -2900 -2920 -2940 [49] -2960 -2980 -3000 b Say I want to conduct a survey among 150 respondents where one of the ques- tions I ask them is about their education level. Suggest an ordinal variable that I can create to measure this; note that the variable should have at least three categories and no more than five categories. Justify your answer, i.e., explain why your suggested measure is a reasonable one. [2 marks] • Less than High School (or equivalent) • High School (or equivalent) • Bachelor’s Degree • Master’s Degree • Doctorate or Professional Degree I have suggested a five category ordinal variable based on common education milestones around the world. Each category represents a higher education level than the previous one and ensures that all possible education levels are covered.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
GV900 Homework 1 Fall 2023 It also ensures no overlap between categories meaning that no individual will be left out and no one will be counted twice. That is, for example, someone with no formal education or only a few years of it would fit into the first category as long as they have not completed a high school degree or equivalent in that country. Anyone who has completed their high school education, or equivalent, but does not have a college degree will go into the next category of ‘High School’. Simi- larly, those whose highest education degree is a Bachelor’s degree or equivalent will go into that category, and so on. I have chosen to use common milestones also because it makes it easier to code individuals and sort them into a clear category easily and quickly. c Now, using the categories you suggested in the previous part of the question, create a hypothetical variable where you randomly sample from the suggested values making sure that you change the default probabilities in some way. You can use whatever probabilities you would like. Display the final output of this variable. [3 marks] # Generate the categorical variable and # allocate random counts to each category. education <- sample( x = c( "Less than High School" , "High School" , "Bachelor's Degree" ,
GV900 Homework 1 Fall 2023 "Master's Degree" , "Doctorate or Professional Degree" ), size = 150 , prob = c( 0.1 , 0.2 , 0.3 , 0.3 , 0.1 ), replace = TRUE ) # Display education [1] “Master’s Degree” “Master’s Degree” [3] “Less than High School” “Master’s Degree” [5] “High School” “Doctorate or Professional Degree” [7] “High School” “Doc- torate or Professional Degree” [9] “Master’s Degree” “Master’s Degree” [11] “High School” “Doctorate or Professional Degree” [13] “Bachelor’s Degree” “Bachelor’s Degree” [15] “Less than High School” “Master’s Degree” [17] “Bachelor’s Degree” “Master’s Degree” [19] “Master’s Degree” “Bachelor’s Degree” [21] “High School” “Less than High School” [23] “Master’s Degree” “Master’s Degree” [25] “Master’s Degree” “Master’s Degree” [27] “Master’s Degree” “Less than High School” [29] “Bachelor’s Degree” “Less than High School” [31] “Bachelor’s Degree” “Less than High School”
GV900 Homework 1 Fall 2023 [33] “High School” “Doctorate or Professional Degree” [35] “Doctorate or Pro- fessional Degree” “Master’s Degree” [37] “High School” “Master’s Degree” [39] “Doctorate or Professional Degree” “Master’s Degree” [41] “Less than High School” “Bachelor’s Degree” [43] “Master’s Degree” “Bachelor’s Degree” [45] “High School” “Bachelor’s Degree” [47] “Less than High School” “High School” [49] “Bachelor’s Degree” “Bachelor’s Degree” [51] “Master’s Degree” “High School” [53] “Bachelor’s Degree” “Master’s Degree” [55] “Master’s Degree” “Bachelor’s Degree” [57] “High School” “Master’s Degree” [59] “High School” “Master’s Degree” [61] “Master’s Degree” “High School” [63] “Doctorate or Professional Degree” “Bachelor’s Degree” [65] “Bachelor’s Degree” “High School” [67] “Bachelor’s Degree” “Bachelor’s Degree” [69] “Master’s Degree” “Less than High School” [71] “Bachelor’s Degree” “Bachelor’s Degree” [73] “High School” “Master’s Degree” [75] “Master’s Degree” “High School” [77] “Less than High School” “Bachelor’s Degree” [79] “Bachelor’s Degree” “High School” [81] “Less than High School” “Master’s Degree” [83] “Bachelor’s Degree” “Bachelor’s Degree”
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
GV900 Homework 1 Fall 2023 [85] “Master’s Degree” “Less than High School” [87] “Bachelor’s Degree” “High School” [89] “Master’s Degree” “Less than High School” [91] “Master’s Degree” “Bachelor’s Degree” [93] “High School” “Master’s Degree” [95] “Doctorate or Professional Degree” “Bachelor’s Degree” [97] “Bachelor’s Degree” “Doctorate or Professional Degree” [99] “Master’s De- gree” “Bachelor’s Degree” [101] “Doctorate or Professional Degree” “Master’s Degree” [103] “Master’s Degree” “Bachelor’s Degree” [105] “Bachelor’s Degree” “High School” [107] “High School” “Master’s Degree” [109] “Less than High School” “Bachelor’s Degree” [111] “Doctorate or Professional Degree” “Master’s Degree” [113] “Master’s Degree” “High School” [115] “Bachelor’s Degree” “Bachelor’s Degree” [117] “Bachelor’s Degree” “Bachelor’s Degree” [119] “Master’s Degree” “Master’s Degree” [121] “High School” “Bachelor’s Degree” [123] “High School” “Bachelor’s Degree” [125] “Bachelor’s Degree” “Bachelor’s Degree” [127] “Master’s Degree” “Bachelor’s Degree” [129] “Master’s Degree” “High School” [131] “Bachelor’s Degree” “High School” [133] “Bachelor’s Degree” “Bachelor’s Degree” [135] “Bachelor’s Degree” “Bachelor’s Degree”
GV900 Homework 1 Fall 2023 [137] “Doctorate or Professional Degree” “Bachelor’s Degree” [139] “Bachelor’s Degree” “Bachelor’s Degree” [141] “High School” “Less than High School” [143] “Bachelor’s Degree” “Bachelor’s Degree” [145] “Less than High School” “Master’s Degree” [147] “Master’s Degree” “Less than High School” [149] “Master’s Degree” “Less than High School”