HW 1 solutions

.pdf

School

University of Texas, Rio Grande Valley *

*We aren’t endorsed by this school

Course

AUDITING

Subject

Statistics

Date

Nov 24, 2024

Type

pdf

Pages

14

Uploaded by LieutenantResolve12096

GV900 Homework 1 Fall 2023 Homework 1: Answer Key Question 1 Load the ggplot2 library and then the midwest dataset. Display the column names for the midwest dataset. [3 marks] library(ggplot2) data( "midwest" ) # OR data(midwest) is also fine ?midwest names(midwest) [1] “PID” “county” “state” [4] “area” “poptotal” “popdensity” [7] “popwhite” “popblack” “popamerindian” [10] “popasian” “popother” “percwhite” [13] “percblack” “percamerindan” “percasian” [16] “percother” “popadults” “perchsd” [19] “percollege” “percprof” “poppovertyknown” [22] “percpovertyknown” “percbelowpoverty” “percchildbelowpovert” [25] “per- cadultpoverty” “percelderlypoverty” “inmetro” [28] “category”
GV900 Homework 1 Fall 2023 # or colnames(midwest) is also fine Question 2 What type of variable is state? How can you tell? How many different states are there in this dataset? What is the modal state in the dataset? [2 marks] # create a frequency table for variable state table(midwest$state) ## ## IL IN MI OH WI ## 102 92 83 88 72 The variable ‘state’ is a nominal/categorical variable, as there is no inherent order among the states and no numbers attached to them either. Within this dataset, there are five distinct states. The modal state, meaning the state with the highest frequency, is IL (Illinois). Question 3 Look at the summary statistics for the variable popwhite. Just based on looking at these numbers, describe the likely distribution of the variable. That is, in your own words and in 2-3 sentences, describe the range, and the likely skewness of the variable (if any). Make sure to explain why you think the variable is/isn’t skewed a certain way based on just these summary statistics numbers [4 marks]
GV900 Homework 1 Fall 2023 summary(midwest$popwhite) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 416 18630 34471 81840 72968 3204947 range(midwest$popwhite) ## [1] 416 3204947 # for the range, it's also fine to use max(midwest$popwhite) - min(midwest$pop Based on the summary statistics, we can observe that the variable has a substan- tial range, extending from a minimum number of 416 white people in a county to a maximum value of 3,204,947. This wide range indicates a high degree of vari- ability in the data. The median (34,471) is notably less than the mean (81,840), suggesting a right-skewed distribution, as the mean is pulled towards the higher values by extremely high data points. Question 4 Now, make a histogram for the popwhite variable. In doing so, make sure that you change the default binwidth, color, and axes labels, and that you give your histogram a relevant title. Also make sure that you save your histogram as an object rather than displaying it directly. If you were to describe the distribution of this variable based on the histogram, would your description change in any way from the previous answer? If yes, in what way? If no, why not? Explain why you think this variable is skewed or not, i.e., think about what the variable
GV900 Homework 1 Fall 2023 measures and explain why it probably looks the way it does. (Note that this last part is not about how you can tell whether the variable is skewed or not but, rather, asks you think about what the variable measures and why it makes sense, or doesn’t make sense, that the variable is skewed or not in the way that it is.) [6 marks] # create histogram for 'popwhite' popwhite_hist <- ggplot( data = midwest, aes( x = popwhite)) + geom_histogram( binwidth = 100000 , color = "purple" , fill = "skyblue" ) + labs( x = "Number of white people" , y = "Frequency" , title = "Histogram of white people in every county" ) popwhite_hist
GV900 Homework 1 Fall 2023 0 100 200 300 0e+00 1e+06 2e+06 3e+06 Number of white people Frequency Histogram of white people in every county Observing the histogram, I would draw a similar conclusion to before that the variable is very much right-skewed. As the figure shows, most counties have up to 200,000 white people with some having more but still under approximately 1 million. There are a few large outliers, especially the maximum value, which gives the distribution a very long tail. Thus the decription of the variable would be similar to before. As for why the variable looks this way, that is likely because it is a raw count of the number of white people in a given county without taking into account the size of the county or the overall population. In other words, geographically larger counties or densely populated counties (such as counties containing big cities) will have a higher number of people living there and also a high number of white people living there. However, there won’t be very many such counties as not every county has a huge urban center in it so the dataset
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help