Problem Set 1

docx

School

University of Texas, Dallas *

*We aren’t endorsed by this school

Course

6359

Subject

Statistics

Date

Apr 3, 2024

Type

docx

Pages

8

Uploaded by CaptainChimpanzeeMaster771

Report
Question 1 Part a For luxury hotels: Mean = Sum of all values/Number of values Therefore, Mean = (275+280+220+250+220+225)/6 = 245 Standard Deviation = σ = √ (Σ (xi - μ)² / N) Where: σ represents the standard deviation. Σ represents the summation symbol, which means you should sum over all data points. xi represents each data point. μ represents the mean (average) of the dataset. N represents the total number of data points. Here μ=245 and N=6 Therefore, Standard Deviation = √ (((275-245)²+ (280-245)²+ (220-245)²+ (250-245)²+ (220-245)²+ (225-245)²)/6) = 25.81 For budget hotels: Mean = Sum of all values/Number of values Therefore, Mean = (70+70+69+65+62+75+70+70+60)/9 = 67.89 Standard Deviation = σ = √ (Σ (xi - μ)² / N) Where: σ represents the standard deviation. Σ represents the summation symbol, which means you should sum over all data points. xi represents each data point. μ represents the mean (average) of the dataset. N represents the total number of data points. Here μ=67.89 and N=9 Therefore, Standard Deviation = √ (((70-67.89)²+ (70-67.89)²+ (69-67.89)²+ (65-67.89)²+ (62-67.89)²+ (75- 67.89)²+ (70-67.89)²+ (70-67.89)²+ (60-67.89)²)/9) = 4.55 Part b Luxury hotels display more significant price variability than budget hotels due to two main factors:
1. Diverse Luxury Amenities : Luxury hotels offer a wide range of amenities, leading to variable pricing based on the specific amenities included. In contrast, budget hotels provide more standardized services. 2. Dynamic Influences : Pricing in luxury hotels is influenced by factors like occupancy rates and special events, causing rates to fluctuate. Budget hotels typically maintain more stable pricing structures. Part c R code: Hotels <- data.frame( + Price = c (275, 280, 220, 250, 220, 225, 70, 70, 69, 65, 62, 75, 70, 70, 60), + Type = c(rep('Luxury',6), rep ('Budget', 9) ) ) > head(Hotels) > luxury_hotels <- Hotels$Price[Hotels$Type == "Luxury"] > budget_hotels <- Hotels$Price[Hotels$Type == "Budget"] > # Calculate the mean and standard deviation for Luxury and Budget hotels > mean_luxury <- mean(luxury_hotels) > std_dev_luxury <- sd(luxury_hotels) > mean_budget <- mean(budget_hotels) > std_dev_budget <- sd(budget_hotels) Output: Part d R code: boxplot <- ggplot(Hotels, aes(x = Type, y = Price)) + + geom_boxplot() + + labs( + x = "Hotel Type",
+ y = "Price", + title = "Price Comparison Between Luxury and Budget Hotels" + ) > print(boxplot) Output: Question 2 Introducing a $1,000 bonus to each employee's salary does not alter the standard deviation. This is because the standard deviation reflects the degree of dispersion among data points within a dataset. When a fixed amount is uniformly added to every data point, it merely shifts the entire dataset by the same value, preserving the relative differences between data points. Consequently, the spread or variability of the data remains unaffected by this uniform adjustment. Question 3 Discrete Variables: Gender Gender is a discrete categorical variable as it represents distinct categories or groups (e.g., male and female), but it's not measured on a continuous scale. Ethnicity Ethnicity is also a discrete categorical variable, with distinct categories (e.g., African American, Asian, Hispanic) and no continuous measurement. Heart Rate (bpm - beats per minute) Heart Rate is a discrete numerical variable because it represents a count of heartbeats within a specific time interval. Heart rate values are typically whole numbers (integers).
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Continuous Variables: Age (years) Age is a continuous numerical variable. It is measured continuously, and there can be decimal values between measurements. Age can vary continuously within a certain range. Height (meters) Height is a continuous numerical variable. It is measured on a continuous scale, and like age, there can be decimal values between measurements. Height can vary continuously within a certain range. Weight (kilograms) Weight is a continuous numerical variable. It is measured continuously, and there can be decimal values between measurements. Weight can vary continuously within a certain range. Blood Pressure (mmHg) Blood Pressure is a continuous numerical variable. It is measured on a continuous scale with decimal values, and it can vary continuously within a specific range of values. Question 4 Part a The boxplots clearly show that all three sites share a similar median value for the percentage of chemical Z, which is approximately 7%. What sets them apart is the variability or spread in the values. Based on the information presented in the boxplots, it's evident that sites I and III exhibit a wider spread of data for the percentage values of chemical Z. In contrast, site II demonstrates a much narrower spread, suggesting that most of its values cluster closely around the median value. Part b Subpart I The likely source of this information can be attributed to Site III. While it's conceivable that values beyond the range of the sampled data could exist, the calculations using the available data suggest approximate minimum and maximum total percentages for the three chemicals, as demonstrated in the table below. Notably, Site III is the only location where the sum of the minimum and maximum values falls within the range of 20.5. Subpart II Chemical Y appears to be the most beneficial due to the distinct distribution of its total weight percentages across the three sites. The distributions of Chemicals X and Z, on the other hand, exhibit significant overlap. Question 5 Part a To inspect the nature of the given data, we have the following commands in R:
gss2014 <- readRDS("C:/Users/prath/Downloads/gss2014 (1).rds") > education <- gss2014$EDUC > mean_education <- mean(education, na.rm = TRUE) > median_education <- median(education, na.rm = TRUE) > sd_education <- sd(education, na.rm = TRUE) The above commands give the following results: In summary, based on these results, it can be inferred that, on average, American adults in the dataset had a reasonably high level of education (mean = 13.699). However, there is some variability in education levels, as indicated by the standard deviation (3.07128), meaning that the dataset includes individuals with a range of education levels. The median (14) provides insight into the central tendency of the distribution, showing that approximately half of the individuals had education levels below 14 and the other half had education levels above 14. Part b In the previous part, we used education <- gss2014$EDUC to subset the required data about education years. Continuing the code, we can further write: education <- na.omit(education) hist(education, breaks = seq(0, max(education) + 10, by = 10), + main = "Education Levels in 2014", + xlab = "Education Level", ylab = "Frequency", col = "blue") We use na.omit() to remove any rows with missing values from the "EDUC" column.
Question 6 Part a Calculating the mean and median of INCOME: mean_income <- mean(gss2014$INCOME, na.rm = TRUE) > median_income <- median(gss2014$INCOME, na.rm = TRUE) This gives us the following result: Therefore, Mean income = 10.96*10000 (Approximately) Median income = 120000
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Checking whether the annual income distribution is symmetric: hist(gss2014$INCOME, main = "Distribution of Annual Incomes in 2014", + xlab = "Annual Income", ylab = "Frequency", col = "blue") This gives us the following result (graph): The histogram clearly shows us the annual income distribution is not symmetric. Part b R code: >income_range <- max(gss2014$INCOME, na.rm = TRUE) - min(gss2014$INCOME, na.rm = TRUE) > income_variance <- var(gss2014$INCOME, na.rm = TRUE) > income_sd <- sqrt(income_variance)
Part c Create a scatter plot: plot(gss2014$EDUC, gss2014$INCOME, + xlab = "Years of Education (EDUC)", + ylab = "Annual Income (INCOME)", + main = "Scatter Plot: Income vs. Education") Result: Calculation of correlation coefficient: correlation_coefficient <- cor(gss2014$EDUC, gss2014$INCOME, use = "complete.obs") complete.obs": When you specify use = "complete.obs", the cor() function will calculate the correlation coefficient only using pairs of observations (data points) where both variables being correlated have non-missing values. Any rows in your data frame where either INCOME or EDUC has missing values will be excluded from the calculation. Now, since the|correlation coefficient|is <0.3, the variables are said to be weakly correlated .