ST314D F23-Data-Analysis-03

docx

School

Oregon State University, Corvallis *

*We aren’t endorsed by this school

Course

314

Subject

Statistics

Date

Apr 3, 2024

Type

docx

Pages

Uploaded by lucekimb

© F2023 Intellectual Property of Kelsi Espinoza ST314 Data Analysis #3 Question 1. (6 Points) Suppose a local Corvallis survey asks: “Do you think that adults that refuse to get the 2023’s flu vaccine should be subject to a fine?” a. (2 points) How might only offering “yes” or “no” as possible answers impact the responses the researchers receive? Some respondents might feel unsure or conflicted about the issue. By offering only two polar options, these respondents will choose a side, potentially skewing the results. b. (2 points) In general, what could be some issues with this question and how it is worded? I think the question presupposes that a fine is the primary or only solution to consider. This narrows the scope of the conversation and might not capture the full range of public opinion on how to address those who don't get the flu vaccine. c. (2 points) Suppose the survey first asked, “When was the last time that you got a flu vaccine?” How might the response change if this was asked before asking the question about fines? Respondents will first reflect on their personal behavior. Those who have recently gotten a flu vaccine might feel more justified or inclined to support fines for others who haven't. Conversely, those who haven't gotten the vaccine recently might feel defensive and be less likely to support fines. Question 2. (9 Points) Investigators from PublicOpinons.com want to explore national voter opinions on providing healthcare to all (US) Americans following the pandemic. They post, from the PublicOpinons.com’s Snapchat account, an advertisement asking followers the following question: “Should everyone, regardless of employment status, have access to healthcare? Visit our website at PublicOpinions.com and fill out the survey by 11:59pm (Pacific time) and we’ll post the results to Snapchat.” a. (3 points) What population are the investigators are trying to sample, and do they obtain a representative sample from this sampling scheme? Explain. The investigators are trying to sample national voter opinions on the issue of providing healthcare to all Americans following the pandemic. However, the sampling scheme does not necessarily provide a representative sample of the national voter population because Snapchat users may not be representative of all voters in terms of age, demographics, political leanings, technology usage, etc b. (2 points) Specifically, what type of sampling scheme has been used?

© F2023 Intellectual Property of Kelsi Espinoza Convenience sampling c. (2 points) What type of sampling bias(s) may have occur? Explain. Demographic bias: Snapchat users tend to be younger. This could result in an underrepresentation of older voters' opinions. Non-response bias : Not everyone who sees the Snapchat ad will respond. The opinions of those who choose not to respond might be systematically different from those who do. d. (2 points) What else could be a source of issue when it comes to this scenario? Explain. Time constraints: The survey has a deadline of 11:59 pm Pacific time. This time constraint might discourage some people from other time zones or those who see the ad late from participating. Use the following for Question 3 and Question 4: In this section, use the R script, DA3_One_Variable_Plots_and_Summary_Stats.R and the ST314 student survey dataset, ST314_SIS_F23.csv , to explore one categorical variable and one quantitative variable of your choice (excluding “subject” and “emails”). Download the R script and the dataset, open the R script and follow the command instructions. Check out the dataset legend to see what variables represent. Then answer the following questions: Question 3. (5 points) Categorical Variable a. (1 point) Choose a categorical variable to explore. Which variable did you choose? Note: “subject” is off-limits because this was my example in the R code. Choose a different categorical variable. temp_pref b. (2 point) Paste the table of counts and bar chart for the categorical variable of your choosing. Include color and appropriate title/labels.

© F2023 Intellectual Property of Kelsi Espinoza c. (2 point) Briefly, describe the distribution in context. Recall, categorical variables are summarized by counts and/or percents. The distribution between “ too cold” and “ too hot” is in favor of too cold. With too cold at about 70 votes, it’s more than double than too hot, which is at around 23 . Question 4. (15 points) Quantitative Variable a. (1 point) Choose a quantitative variable to explore. Which variable did you choose? Note: “emails” is off limits because it was the example in the R code. Choose a different quantitative variable. States b. (2 point) Create a histogram of the variable. Include color and an appropriate title on your plot. Paste plot. c. (2 point) Create a boxplot of the variable. Include color and an appropriate title on your plot. Paste plot. d. (1 point) Which plot do you prefer to visualize the variable? Why?

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

© F2023 Intellectual Property of Kelsi Espinoza I prefer the histogram. It provides a clearer view of changes over time and distinctly displays the results. e. (2 points) Construct a table that displays: the mean, standard deviation, minimum, 1 st quartile, median, 3 rd quartile, maximum, and IQR. Do not just copy and paste from R console. Measur e Mea n Standard Deviation Minimu m 1st Quartile (Q1) Median (Q2) 3rd Quartile (Q3) Maximu m IQR Value 11.65 6.61 1 7 10 14 49 7 (Q3 - Q1) f. (3 points) Use the plots and summary statistics to describe the data in the context of the problem. Include the shape, center and spread in your description. State whether there are any outliers. Shape: The data seems to be slightly right-skewed with most values clustering on the left side, and a few higher values extending to the right. Center: The mean is 11.65, and the median is 10, which indicates that the data is slightly pulled up by higher values since the mean is slightly larger than the median. Spread: The data varies from a minimum of 1 to a maximum of 49. The interquartile range, which captures the middle 50% of the data, stretches from 7 to 14, a spread of 7. Outliers : Given that the IQR is 7, any data point below Q1 - 1.5IQR (7 - 10.5 = -3.5) or above Q3 + 1.5IQR (14 + 10.5 = 24.5) would be considered an outlier. Therefore, the values: 49, 30, 31, 29, etc., are outliers in this case. g. (2 points) Given the shape of the data, which measure of center: the mean, the median, or either, would be a more appropriate to represent the center of the data? Explain your reasoning. Considering the shape of the data, which is slightly right-skewed and has outliers on the right side, the median would be a more appropriate measure of center. This is because the median is more resistant to outliers and skews than the mean.

© F2023 Intellectual Property of Kelsi Espinoza h. (2 points) Given the shape of the data, which measure of spread: the standard deviation, the IQR, or either, would be a more appropriate to represent the spread of the data? Explain your reasoning. Given the shape of the data and the presence of outliers, the IQR would be a more appropriate measure of spread for this data set. The IQR measures the spread of the middle 50% of the data, making it less sensitive to outliers.

ST314D F23-Data-Analysis-03

Related Documents