Hughes,Madeleine_HSCI_190_Homework6

docx

School

Queens University *

*We aren’t endorsed by this school

Course

HSCI 190

Subject

Statistics

Date

Feb 20, 2024

Type

docx

Pages

7

Uploaded by AmbassadorNeutronWolverine39

Report
Module 06 Homework Problems Part 1 – Consolidating Concepts (5 marks) 1. Describe when you would use a simple linear regression and when you would use a Pearson correlation coefficient. Provide your own examples of each (do not use examples from the module). (1 mark) Both simple linear regressions and Pearson correlation coefficients explore the relationship between two variables, however, regressions allow for investigation of the influence of one variable on another, where correlations do not. - I would use a simple linear regression, so long as its assumptions are met (i.e., the variables are scale, the residuals are approximately normally distributed, there are no outliers, there is a linear relationship between both variables, and the data is homoscedastic) to explore how a given explanatory variable impacts, or predicts, a corresponding response variable – for example, how Canada’s greenhouse gas emissions (explanatory, or independent, variable) impact global temperature (response, or dependent, variable). - I would use a Pearson correlation coefficient, so long as its assumptions are met (i.e., the variables are scale, the variables are normally distributed, the variables are paired, there are no outliers, the variables exhibit linearity, and the data is homoscedastic) to measure the degree of relationship between linearly related variables – for example, the relationship between the test results and study time of first year Health Sciences students at Queen’s. 2. What are residuals? How do they relate to the least squares method? (1 mark) Residuals are representative of the distance an observed response variable lies from the regression line, or in other words, the difference between the observed value and the predicted value. The least squares method is directly related to residuals, given that such method choses a regression for values of the Y intercept and slope which minimize the sum of squared residuals. 3. What is machine learning? Explain how machine learning relates to the correlations and regression analyses covered in Module 06. (2 marks) Machine learning is a branch of artificial intelligence, or a statistical method, which focuses on using data and algorithms to find relationships within patterns. Regression and correlation are fundamental techniques in this field; they focus on exploring statistical relationships between variables. Machine learning algorithms can leverage these concepts to predict outcomes from complex patterns in data, which can be applied in a number of fields of work. 1
4. What is the replication crisis? Why is it important to consider in health sciences research? (1 mark) The replication crisis refers to the inability to reproduce scientific results. Some of these issues can be traced back to scientific methodology training. In health sciences, the replication crisis is particularly important due to its direct impact on medical practice and patient care. Replicability enforces trust in the validity of research; if findings aren’t replicable, there’s a risk that patients might receive treatments that are ineffective, unnecessary, or potentially harmful. From an innovative perspective, lack of replicability hinders scientific progress that can lead to developing new medical treatments, inventions, and further understanding of disease. Part 2 – Application Using Statistical Software (15 marks) Scenario: A researcher is interested in exploring how the number of comorbidities a patient experiences is related to healthcare utilization, namely the number of emergency department visits. The researcher decides to do a retrospective chart review to explore the relationship between the number of comorbidities a patient experiences and the number of emergency department visits attended by the given patient (N=70). Question #1: What is the relationship between number of comorbidities and number of emergency department visits? (5 marks) Answer the question using the M06_Homework Data file (Dataset1) in OnQ. A) State your hypothesis statements, significance level, and correlation coefficient interpretation guidelines you plan to use for this question (if applicable). Null hypothesis: No correlation exists between the number of comorbidities and the number of emergency department visits. Alternative hypothesis: A correlation exists between the number of comorbidities and the number of emergency department visits. Significance level: a = 0.05 The correlation coefficient interpretation guidelines are derived from Evans (1996). B) Describe the statistical analysis you chose and why you chose that test. Identify the response and explanatory variables (if applicable). Spearman’s Rank Correlation was chosen for statistical analysis given that the assumptions of the Pearson’s Correlation were not met. As we are looking to explore the 2
relationship between the number of comorbidities and the number of emergency departments visits, rather than how the number of comorbidities impacts the number of emergency department visits, a correlation is best suited in determining the strength and magnitude of the relationship. C) Describe the assumptions of this statistical test. Describe how you tested the assumptions and provide any relevant statistical output. - The data is scale – as the observations are measurable and consist of real numbers, this assumption has been met. - Neither variable is approximately normally distributed – normality was tested for each level by conducting a Shapiro-Wilk normality test; the p values were found as <.001 and <.001, respectively. As these values are all less than the significance level of 0.05, this assumption has not been met, however, Spearman’s Correlation doesn’t require variables to come from a normal distribution and can thus be used for statistical analysis. - The variables are paired and independent – each participant corresponds to an observation within each data set, yet the participants have no effect on one another. - Both variables exhibit linearity – a linear relationship exists between the number of comorbidities and the number of emergency department visits. This was checked by visualization; the relationship was found as somewhat monotonic. - Both variables exhibit homoscedasticity – equal variance throughout the plot of each variable was determined by visualization. - Outliers were checked by visualization. None were identified. D) Use SPSS to conduct the analysis, report your findings including the appropriate statistical values (e.g., correlation coefficient, p value, beta coefficients, and regression equation, as applicable). The Spearman’s correlation coefficient was found as .506, and the p value as <.001. E) Provide a brief summary of findings (i.e., interpretation). To explore to relationship between the number of comorbidities and the number of emergency department visits (N=70), a Spearman’s correlation was conducted at the significance level of 0.05. The assumptions of Spearman’s correlation were checked. The data was not determined to be approximately distributed. A two-way scatter plot was used to confirm the data possessed a linear relationship and homoscedasticity. The paired and independent variables are of scale measurement. No outliers were identified. Statistical significance was considered at p = 0.05. All analyses were conducted using IBM’s SPSS Version 25. 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
The results of Spearman’s correlation suggest a moderate and statistically significant relationship between the number of comorbidities and the number of emergency department visits [r s = .506, p <.001; 1996 Evans correlation guidelines]. Given that the p value is less than the significance level of .05, I would reject the null hypothesis and conclude that there is a correlation between the variables. Scenario: Patients with severely inflamed gallbladders often have longer operation times and increased risk of various complications (Madni et al., 2018). The Parking grading scale of cholecystitis was developed to identify severe gallbladder inflammation using intraoperative images. A five-tiered grading list was developed and is listed below: Fifty intraoperative images were taken before dissection and provided to 11 surgeons who rated severity of the gallbladder using the above system. For this question, a subset of this data was used (e.g., data from two surgeons). The goal of this study was to determine how reliable this grading system was across surgeons. In clinical practice, the scale will be used by a single surgeon to identify severe gallbladder inflammation in a patient. Source: Madni, T. D., et al. (2018). The Parkland grading scale for cholecystitis. The American Journal of Surgery, 215 , 625-630. Question #2: Is the assessment tool reliable? (5 marks) Answer the question using the M06_ Homework Data file (Dataset2) in OnQ. A) State your hypothesis statements and correlation coefficient interpretation guidelines (if applicable). Null hypothesis: No correlation exists between the ratings of the surgeons (i.e., the system is unreliable). 4
Alternative hypothesis: A correlation exists between the ratings of the surgeons (i.e., the system is reliable). Significance level: a = 0.05 The correlation coefficient interpretation guidelines are derived from Koo & Li (2016). B) Describe the statistical analysis you chose and why you chose that test. Identify the type of reliability you are testing. A two-way random interclass correlation coefficient analysis for consistency was chosen for statistical analysis. As we are looking to evaluate inter-rater reliability, an ICC allows for the determination of the variability between two or more raters measuring the same event – in this case, two surgeons measuring the severity of gallbladder inflammation. C) Use SPSS to conduct the analysis, report your findings including the appropriate statistical values (e.g., correlation coefficient, p value, beta coefficients, and regression equation, as applicable). The Interclass correlation coefficient was found as .807, and the p value as <.001. D) Provide a brief summary of findings (i.e., interpretation). To explore the inter-rater reliability of the Parkland grading scale of cholecystitis applied by two surgeons to patients (N=50), a two-way random effects model for consistency was conducted, using a significance level of 0.05. All analyses were conducted using IBM SPSS Version 25. The results suggest the raters had a good agreement that was statistically significant [ICC = .807, p < .001; guidelines Koo & Li 2016]. Given that the p value is less than .05, I would reject the null hypothesis and conclude there is a relationship between the variables - the system is reliable. Scenario: There has been an abundance of research exploring amount of exercise and its physiological effects, however there is a lack of research on sedentary time. You are given data on 50 adults living in Kingston. The dataset consists of each adult’s average daily sedentary time (in hours, measured using an accelerometer for a week) and their HbA1c (%, measured at the end of the week). We want to see if sedentary time affects HbA1c in adults. Question #3: Does sedentary time affect HbA1c? (5 marks) Answer the question using the M06_ Homework Data file (Dataset 3) in OnQ. A) State your hypothesis statements and significance level. 5
Null hypothesis: There is no effect of sedentary time on HbA1c in adults. Alternative hypothesis: There is an effect of sedentary time on HbA1c in adults. Significance level: a = 0.05 B) Describe the statistical analysis you chose and why you chose that test. Identify the response and explanatory variables (if applicable). A simple linear regression was chosen for statistical analysis. As we are looking to identify some degree of causation with regard to the effect of sedentary time on HbA1c, regression analyses allow for the exploration of whether the explanatory variable predicts the response variable. Response variable: sedentary time (in hours, measured using an accelerometer for a week) Explanatory variable: HbA1c (%, measured at the end of the week) C) Describe the assumptions of this statistical test. Describe how you tested the assumptions and provide any relevant statistical output. - The data is scale – as the observations are measurable and consist of real numbers, this assumption has been met. - Residuals are approximately normally distributed – normality was checked by visualization of a normal probability plot. - Both variables exhibit linearity – a linear relationship exists between sedentary time and HbA1c. This was checked by visualization; the relationship was found as monotonic. - Both variables exhibit homoscedasticity – the data was found to be equally spread around the line of best fit. This was checked by visualization. Outliers were checked by visual inspection. None were identified. D) Use SPSS to conduct the analysis, report your findings including the appropriate statistical values (e.g., correlation coefficient, p value, beta coefficients, and regression equation, as applicable). The coefficient of determination was found as .154, the p value as .005, and the beta coefficients as follows: B 0 = 5.192 and B 1 = .237. The regression equation can thus be written as: 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Predicted HbA1c = 5.192 + 0.237(Sedentary time) E) Provide a brief summary of findings (i.e., interpretation). A simple linear regression was conducted to explore the impact of sedentary time on HbA1c in a study with adults from Kingston (N=50). No outliers were identified. A two-way scatter plot with a regression line of best fit was used to confirm linearity and homoscedasticity. Normality of the residuals was confirmed by visualization of a normal probability plot. Statistical significance was considered at p = 0.05. All analyses were conducted using IBM SPSS Version 25. The results of the linear regression suggest that sedentary time was a statistically significant predictor of HbA1c [F(1,48) = 8.752, p = 0.005], accounting for 15.4% of the explained variability in HbA1c (r 2 = .154). The regression equation is as follows: Predicted HbA1c = 5.192 + 0.237(Sedentary time) I would, therefore, reject the null hypothesis and conclude that there is a relationship of causation between the two variables (i.e., there is an effect of sedentary time on HbA1c in Kingston adults). 7