assignment-4

docx

School

Brown University *

*We aren’t endorsed by this school

Course

2680

Subject

Statistics

Date

Feb 20, 2024

Type

docx

Pages

Uploaded by CommodoreWater17495

GPHP 2000: Assignment #4: Hypothesis Testing Sonam Christopher Due: December 3, 2022 Homework Policies: All answers must be in complete sentences and all graphs must be properly labeled. For the PDF Version of this assignment: PDF For the R Markdown Version of this assignment: RMarkdown Turning the Homework in: Assignment #4 Link We have some clinical data from a diabetes study. This is a bit larger than some datasets you have been using in the course so far but more realistic of what you might need. This data comes from UCI Machine Learning Repository . I removed some of the columns to make it more manageable. For some questions below, you will be responsible for filtering out some of the data. library (dplyr) ## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union library (ggplot2) library (tidyverse) ## ── Attaching packages ## ─────────────────────────────────────── ## tidyverse 1.3.2 ── ## ✔ tibble 3.1.8 ✔ purrr 0.3.5 ## ✔ tidyr 1.2.1 ✔ stringr 1.4.1 ## ✔ readr 2.1.3 ✔ forcats 0.5.2 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──

## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() library (purrr) library (rsample) library (boot) library (bootstrap) library (reshape) ## ## Attaching package: 'reshape' ## ## The following objects are masked from 'package:tidyr': ## ## expand, smiths ## ## The following object is masked from 'package:dplyr': ## ## rename library (tidyr) load ( "~/Desktop/R Studio/diabetes.rda" ) 1. (2 points) How many observations are in this data? ## There are 101766 oberservations in the data. diabetes %>% group_size () ## [1] 101766 2. (2 points) How many patients are in this data? ## There are 71,518 patients in this dataset. diabetes %>% group_by (patient_nbr) %>% n_groups () ## [1] 71518 3. (4 points) What is the highest number of visits for a patient? (Use dplyr tools and only print this one patient line.) ## The highest number of visits for a patient are 40 visits. To be honest, as a nurse I have seen a lot higher than this. diabetes %>% group_by (patient_nbr) %>% count (patient_nbr) %>% arrange ( desc (n)) %>% head ( n= 1 ) ## # A tibble: 1 × 2 ## # Groups: patient_nbr [1] ## patient_nbr n ## <fct> <int> ## 1 88785891 40 When a patient is admitted to a hospital, their admission is generally categorized according to the severity or urgency of their symptoms (e.g., “Elective”, “Emergency”, etc.). We are interested in the different types of hospital admissions for diabetic patients. We will use the

data in the file diabetes data clean.csv to answer several questions related to hospital admissions. Use the appropriate statistical tests, plots, or tables to address the questions below. 4. (10 points) Does it appear that admissions type differs by gender? – Display appropriate tables or plots – Perform a hypothesis test. – Interpret the results. Men and women can perceive and respond to health and wellness differently. Looking at the data itself and doing a simple chart of the different types of admissions based on gender we see that based on the numbers there isn’t much of a different in the number of admissions between men and women except with emergency, elective and urgent admissions. Both emergency and urgent categories sound similar but I am assuming that emergency is admittance through the emergency room and urgent is admission from urgent care but both reflect the nature that the admission was unexpected. Looking at the percentages we see the difference in emergency admissions are higher among women by about 2% and the elective admission were higher by 1% among men. We our hypothesis that we are testing would be the Null: There is no difference among admission types between men and women. Alternative: There is a difference between admission types among men and women. When we do the Chi squared test on this categorical data we have x-squared of 38.533 and when taken the square root is more than 6, so this is not a normal distribution since the according to lectures, standard deviations more than 3 are rarely seen and with this data we have more than 6. Looking at the p value of 0.0001256, it is very small, not the smallest but quite small so we can assume that the differences among admission type by gender is not due to randomness but we need to look at the data and sample design more. By the p value alone, we cannot surmise that our hypotheis is true. diabetes1 <- diabetes %>% filter (gender == "Male" | gender == "Female" ) %>% group_by (gender, admission_type) %>% summarise ( n= n ()) %> % mutate ( freq = (n / sum (n)) * 100 ) ## `summarise()` has grouped output by 'gender'. You can override using the ## `.groups` argument.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

ggplot (diabetes1, aes (admission_type, y = freq, fill = gender)) + geom_bar ( position = position_stack (), stat = "identity" ) + geom_text ( aes ( label= paste0 ( round (freq, 0 ), "%" )), position = position_stack ( vjust= 0.5 ), size = 3 ) + scale_y_continuous ( labels = function (x) paste0 (x, "%" )) ggplot (diabetes1, aes (admission_type, freq, fill= gender)) + geom_col ( position = "dodge" )

ggplot (diabetes1, aes (gender, n, fill= admission_type)) + geom_bar ( position= position_stack (), stat = "identity" )

diabetes2 <- table (diabetes $ gender, diabetes $ admission_type) chisq.test (diabetes2, correct= F) ## Warning in chisq.test(diabetes2, correct = F): Chi-squared approximation may be ## incorrect ## ## Pearson's Chi-squared test ## ## data: diabetes2 ## X-squared = 38.533, df = 12, p-value = 0.0001256 5. (10 points) Consider the hospital admissions of Elective and Emergency categories. Do males and females differ with respect to these? – Display appropriate tables or plots – Perform a hypothesis test. – Interpret the results.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Both men and women had higher emergency admissions than elective admissions, with women being admitted slightly higher than men emergently and men coming for elective procedures slightly more than women. So for a test with 1 degree of freedom, a standard normal with more than 5 standard deviations away and with a p value of 1.361e-08 which is really small, we can eliminate that the results are due to random error and there a difference between the genders and admission types of Emergency and Elective. diabetes3 <- diabetes %>% filter (admission_type == "Elective" | admission_type == "Emergency" ) %>% filter (gender == "Male" | gender == "Female" ) %>% droplevels () diabetessum <- diabetes3 %>% group_by (gender, admission_type) %>% summarise ( n= n ()) %>% mutate ( freq = (n / sum (n)) * 100 ) ## `summarise()` has grouped output by 'gender'. You can override using the ## `.groups` argument. ggplot (diabetessum, aes (gender, n, fill= admission_type)) + geom_bar ( position= position_stack (), stat = "identity" ) + geom_text ( aes ( label= paste0 ( round (freq, 0 ), "%" )), position = position_stack ( vjust = 0.5 ), size= 3 ) + scale_y_continuous ( labels = function (x) paste0 (x, "%" ))

ggplot (diabetessum, aes (admission_type, n, fill = gender)) + geom_bar ( position = position_stack (), stat = "identity" ) + geom_text ( aes ( label= paste0 ( round (freq, 0 ), "%" )), position = position_stack ( vjust= 0.5 ), size = 3 ) + scale_y_continuous ( labels = function (x) paste0 (x, "%" ))

D <- table (diabetes3 $ gender, diabetes3 $ admission_type) chisq.test (D, correct = FALSE ) ## ## Pearson's Chi-squared test ## ## data: D ## X-squared = 32.243, df = 1, p-value = 1.361e-08 6. (10 points) We are interested whether patients differ in the number of medications based on whether their admission type was Elective or Emergency . Using the appropriate statistical test, determine whether or not patients coming in electively or in the case of an emergency tend to receive the same number of medications. ## From the chi squared test we see that with the small p value there is a relationhip based on the admission type. The df of 74 means it is close to a standard normal distribution and we are getting more confident to reject the null hypothsis that there is no difference in number of medications based on the elective or emergent types of admissions also shown by the small p value. The average number of medications was 16.0218 and with our confience interval of [1] 15.96426 [1] 16.07942 we can be confident that 95% of the intervals like this will cover the mean of medications. We also have very large chi squared result meaning we have a high probability of a difference between the two groups.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

qnorm ( 0.025 ) ## [1] -1.959964 qnorm ( 0.975 ) ## [1] 1.959964 admitmed <- diabetes %>% group_by (admission_type, num_medications) %>% filter (admission_type == "Elective" | admission_type == "Emergency" ) %>% droplevels () NumMED <- table (admitmed $ admission_type, admitmed $ num_medications) chisq.test (NumMED, correct = FALSE ) ## Warning in chisq.test(NumMED, correct = FALSE): Chi-squared approximation may be ## incorrect ## ## Pearson's Chi-squared test ## ## data: NumMED ## X-squared = 2441.7, df = 74, p-value < 2.2e-16 meanmed <- mean (admitmed $ num_medications) std.dev <- sd (admitmed $ num_medications) n <- length (admitmed $ num_medications) low = meanmed - 2.26 * std.dev / sqrt (n) high = meanmed + 2.26 * std.dev / sqrt (n) low ## [1] 16.15776 high ## [1] 16.29569 7. (8 points) What are the assumptions of the statistical test we use in 6? Are we confident that these assumptions are satisfied? ## From the chi squared test above we see that with the small p value there is a relationhip based on the admission type. The average number of medications was 16.0218 and with our confience interval of [1] 15.96426 [1] 16.07942 we can be confident that 95% of the intervals like this will cover the mean of medications. We also have very large chi squared result meaning we have a high probability of a difference between the two groups. – Use bootstrap confidence intervals and compare them to the hypothesis test and t-distribution confidence intervals in 6.

Our T distribution confidence intervals are 16.15776 - 16.29569 with a mean of 16.22. With bootstrapping our confidence intervals we get confidence intervals of 16.168 - 16.285 which is very close to our t distribution confidence intervals. admitmed1 <- diabetes %>% filter (admission_type == "Elective" | admission_type == "Emergency" ) %>% select (num_medications) admitmed1 %>% summarise ( mean (num_medications, na.rm = TRUE )) ## # A tibble: 1 × 1 ## `mean(num_medications, na.rm = TRUE)` ## <dbl> ## 1 16.2 bt_data <- bootstraps (admitmed1, times = 1000 ) bt_data $ splits[[ 1 ]] ## <Analysis/Assess/Total> ## <72859/26738/72859> get_mean <- function (split) { split_data <- analysis (split) split_mean <- mean (split_data $ num_medications, na.rm = TRUE ) } bt_data $ bt_means <- map_dbl (bt_data $ splits, get_mean) btmed_ci <- round ( quantile (bt_data $ bt_means, c ( 0.025 , 0.975 )), 3 ) btmed_ci ## 2.5% 97.5% ## 16.170 16.283 ggplot (bt_data, aes ( x = bt_means)) + geom_line ( stat = "density" )

Please display your code at the end of this entire question so you may receive partial credit on it. 8. (15 points) Note that the data in diabetes data has patients who had repeat visits to the hospital. We know this to be true because patient_nbr is the unique patient identifier, and there are instances of recurrence. We would like to see whether the number of medications changes over time. In particular, for those patients who had 2 (or more) visits, we want to see whether the number of medications tends to be different from the first visit to second visit. Use the appropriate statistical test to investigate this.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

The t statistic is -5.1459 with degrees of freedom on 16772, and a p value of 2.692e-07. The 95% confidence interval is -0.5204836 to -0.2333470 and interestingly our t value and the mean difference values are both negative meaning it is less than the value associated with the null hypothesis meaning there is more of a probability that the null hypotheis is right, the number of medications don’t significantly change between the 1st and 2nd visit. If ran more test and tests and tested between the first maybe the 6th or 10th visit we would more of an increase. diabetes5 <- diabetes %>% arrange (encounter_id) diabetes_wide <- diabetes5 %>% add_count (patient_nbr) %>% filter (n > 1 ) %>% group_by (patient_nbr) %>% summarise ( n_medsA = (num_medications[ 1 ]), n_medsB = (num_medications[ 2 ])) t.test (diabetes_wide $ n_medsA, diabetes_wide $ n_medsB, paired= TRUE ) ## ## Paired t-test ## ## data: diabetes_wide$n_medsA and diabetes_wide$n_medsB ## t = -5.1459, df = 16772, p-value = 2.692e-07 ## alternative hypothesis: true mean difference is not equal to 0 ## 95 percent confidence interval: ## -0.5204836 -0.2333470 ## sample estimates: ## mean difference ## -0.3769153 pivot_wider(values_from = n_medsA, names_from = n_medsB) ``` Note that this problem requires a fair bit of data wrangling to get the data prepared. So consider these hints: • We can assume encounter id is a variable that uniquely identifies each hospital admission. • We further assume that encounter IDs are only increasing, so if a given patient has two encounters, the first ID will always be less than the second ID. • The shape of this data set is typically called “long format”; the patients have repeat measurement recorded as new rows. You will want to get these data in to “wide format”. There are good examples you can find by Google-ing. • Consider using the dplyr package. It’s not necessary, but it might help. • The data reshaping is definitely tricky, don’t feel bad if you struggle a bit. It’s important to get practice with this, because for many of us, data cleaning and reshaping is a big part of the work we have to do prior to model fitting.