Unit 4 Homework_ Data Distributions

pdf

School

CUNY Queens College *

*We aren’t endorsed by this school

Course

205

Subject

Economics

Date

Jan 9, 2024

Type

pdf

Pages

4

Uploaded by DrComputer9650

Report
Unit 4 Homework: Data Distributions (42 points) Zander Guadalupe For this homework, we will use R built-in data. R comes with several built-in data sets related to the 50 states of the United States of America. Professor XU has combined these data sets into a single CSV file named “us_states.csv”. Below is a list of variables in this data: name: the full state names. abb: 2-letter abbreviations for the state names. region: the geographic region (Northeast, South, North Central, West) that each state belongs to. division: the geographic division (New England, Middle Atlantic, South Atlantic, East South Central, West South Central, East North Central, West North Central, Mountain, and Pacific) that each state belongs to. population: population estimate as of July 1, 1975. income: per capita income (1974). illiteracy: illiteracy (1970, percent of population). life_exp: life expectancy in years (1969–71). murder: murder and non-negligent manslaughter rate per 100,000 population (1976). hs_grad: percent high-school graduates (1970). frost: mean number of days with minimum temperature below freezing (1931–1960) in a capital or large city. area: land area in square miles. Exercise 1: Reviewing Variable Labels and Values (14 points) Let’s start by taking a look at the structure of the U.S. states dataset and what’s included in it. To do this, we use the str() command. Question 1. How many observations are there in this data set? How many variables? (2 points) 50 observations, 12 variables Question 2. Which variables are nominal? Which variables are interval or ratio? (12 points) Nominal: illiteracy, life_exp, murder, hs_grad Interval: population, income, frost, area
Exercise 2: Percentages in Tables and Charts (8 points) In class and in your readings, we’ve covered different measures of dispersion. The most basic is the percentage, which we can read from tables and from charts. In this exercise, we’ll go one step further to characterize the dispersion in a distribution. Here’s a frequency table for the variable geographic division in the U.S. states dataset, followed by one that reports on percentage and a bar chart. Question 1. What is the mode of geographic division, and what is the percentage of states that are located in this division? (4 points) Mountain and south Atlantic. 16% Question 2. Which geographic division includes the least number of states? What is the percentage of states that are located in this division? (2 points) Middle atlantic. 6% Question 3. How would you describe the distribution of geographic divisions in terms of dispersion? Low, medium, or high dispersion? Justify your answer. (2 points) I would say medium because all of the percentages seem to be around the same range. Exercise 3. Medians and Quartiles (10 points) The summary() command provides summary statistics on continuous/numeric variables and reports on the minimum and maximum, the quartiles, and the mean and median. Here we call this command for the variable income: We’ll pair this text output with a histogram of income so that we can visualize the shape of the distribution. Finally, we’ll also look at a boxplot for the same variable. You can do this by changing the geom_histogram argument in the command line to geom_boxplot . Because there is only one variable to examine here, R gives us a sideways rendering of the boxplot instead of an up and down one. Use all three pieces of information–the summary output, the histogram, and the boxplot to answer the questions in this section.
Question 1. Describe the distribution of per capita income. Use the following values in your discussion: range, interquartile range, mean, median, skew, and outliers. (6 points – one for each correct depiction of the keyword) Data range- 3271 Interquartile range - 443 Mean- 4436 Median- 4516 There is only one outlier which is almost over 6000. So it is skewed to the left. Question 2. Compare what you learn about the distribution of per capita income from the histogram and the boxplot. Which one do you find more helpful in summarizing the information and why? (4 points—either answer is acceptable as long as they back it up) I prefer the box plot because it's more informative and can draw my conclusion faster. Even though there are states pulling it to the left there is still definitely an outlier at the end of the box plot. Exercise 4: Using medians and distributions to compare states in different geographic divisions (10 points) In this next exercise, we will again use boxplots, this time to compare the per capita income for different geographic divisions. Question 1. Use the text output on means and standard deviations. Which geographic division had the highest mean per capita income in 1974? Looking at the standard deviation of its mean, how would you describe the dispersion of its per capita income relative to other geographic divisions? (2 points) Pacific had the highest mean (5183) per capita income in 1974. Looking at the standard deviation (654) of its mean which is very large as compared to other, the dispersion of its per capita in-come relative to other geographic divisions is high i.e. is data points vary much larger from its mean and shape of distribution is right skewed. Question 2. Now compare the income distributions using the boxplot. Which geographic division had the lowest median per capita income in 1974? How was the dispersion of the income distribution in this geographic division compared to other divisions? (3 points) East South center has the lowest median. The shape of distribution is left skewed. Question 3. Based on the text output and the boxplot, which geographic division had the highest level of income inequality in 1974? And why? (3 points) The South Atlantic has the largest income inequality. Because it has the largest boxplot. Question 4. Judging by the distribution of per capita income, which geographic division would you choose to live in? And why? (2 points for the reasoning only.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Students can choose any division, but they only receive points for good reasoning.) I would prefer to live in West North Central because the average income of this region is second largest which is good. Also it has lower standard deviation and the size of the box is small which means there is less income inequality and there are no outliers.