Project1BMES

pdf

School

Harvard University *

*We aren’t endorsed by this school

Course

MISC

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

12

Uploaded by PrivateStar23629

Report
Data Analysis Project 1 Fall 2022 Instructions Write a MATLAB script to solve the following questions. Insert figures, tables, and answers directly into this document. Note: You do not need to verify your model assumptions in this project. Deliverables: 1. This document with your figures, answers, and explanations inserted. 2. A MATLAB script with the code used to answer all questions. Background The Framingham Heart Study (Levy, 1999) has collected cardiovascular risk factor data and long-term follow-up on almost 5000 residents of the town of Framingham, Massachusetts. Refer to the “Framingham Documentation” pdf file for detailed information on the data set. These questions are modeled after analyses in DOI:10.1017/CBO9780511575884. Analysis 1 Perform linear regression to show the relationship between systolic blood pressure (SYSBP) and body mass index (BMI). Note that many patients were measured more than one time. Include only Period 1 measurements in your model. An exploratory analysis and assumption-checking indicated that the relationship between log(SYSBP) and log(BMI) comes closer to meeting the assumptions of a linear model than does the relationship between SBP and BMI . 1. Create two regression models. Regress log(SYSBP) against log(BMI) for men and women. Note that SYSBP is the variable plotted on the y-axis and BMI is on the x-axis. Remove all empty (non-number) entries of BMI from your data before creating models [you can use function ‘isnan’ to explore empty entries and remove from the dataset].
2 SYSBP in Men SYSBP in Women
3 Fill in the following table with your results: Table 1. Regression Coefficients for linear models. Sex Number of subjects n i β 0 β 1 Men 1944 4.07 0.246 Women 2476 3.60 0.397 2. Use the function ‘coefCI’ to find a 95% CI on β 1 values. Table 2. Confidence intervals on β 1 values Sex Lower Bound Upper Bound Men 0.199 0.292 Women 0.359 0.435 3. Use these CIs to determine if the slopes of the two best-fit lines are equal. [with a t- test?] The confidence intervals do not overlap, not the same slope 4. Use your models to predict the following: a. SYSBP of a man with BMI = 33. 121.82 mmHg b. SYSBP of a woman with BMI = 33. 167.10 mmHg Hint: Make sure your answers pass the straight-face test. 5. Plot both sets of data. Use the subplot command to include them both in a single figure. The left-hand subplot should contain a log-log scale plot of the data for men with two regression lines, one for men and one for women (to highlight any differences in slope). The right-hand subplot should contain a log-log scale plot of the data for women with the regression line for women. Make sure that your regression line is linear.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4 Analysis 2 We will now create a multiple regression model that seeks to predict SYSBP based on additional variables -- BMI, AGE and total serum cholesterol (TOTCHOL)-- included in the Framingham heart study. Include only Period 1 measurements in your model. 1. A first step is to create several linear models of various variables that are possibly related to SYSBP. We will not be using the log transform for these models. Make three separate simple regression models. Remove all empty (non-number) entries for each variable category from your data before creating models [you can use function ‘isnan’ to explore empty entries and remove from the dataset]. log(System Blood Pressure) (mmHg) log(System Blood Pressure) (mmHg)
5 SYSBP SYSBP
6 a. SYSBP against BMI – do NOT separate the data by sex. b. SYSBP against age c. SYSBP against serum total cholesterol Fill in the following table: Table 3. Results of three separate linear regression models Model β 1 P-value for β 1 R 2 a 1.7885 2.6635e^-111 0.108 b 1.0297 1.156e^-168 0.159 c 0.10041 9.2731e^-41 0.04 As each of the models has a small R^2 value and a much smaller p-value the regression of these models are significant. 2. Note that each individual model’s R 2 value is low, indicating that no single variable is doing a good job of explaining the total variability in the data. A multivariate model may address this issue and provide better predictions. Create a multiple regression model of SYSBP against BMI, age, and serum cholesterol. Fill in the following table: Table 4. Estimated coefficients for multiple regression model. SYSBP
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
7 Estimate SE tStat pValue Intercept 40.836 2.5988 15.713 3.6591e^-54 BMI 1.4945 0.072966 20.482 4.318e^-89 Age 0.87486 0.035413 24.705 3.1646e^-126 SCL 0.041165 0.0068569 6.0034 2.0889e^-9 3. Comment on the significance of regression for each of the three regressors. The p-value of the individual regression is of a small enough value to indicate their individual significant. However with a p-value of 5.26e^-260 on the multivar regression and an R^2 value of 0.241 the regression can be evaluated as significant. 4. Predict the value of SYSBP for an individual with a BMI of 33, an age of 55 years, and a cholesterol level of 288 mg/dL. Include a 95% prediction interval. 150.1270 mmHg [1.488403743127118e+02 1.514136354776900e+02] Extra Credit: Create a new multiple regression model to predict SYSBP using the Framingham data set. All regressors must be significant and your adjusted R 2 must be above 0.25 to get credit. Show the results of your model in a table similar to table 4. Do not include diastolic blood pressure as a regressor. It is measured at the same time as SYSBP, so if you know one you know both. Diastolic BP is therefore not a valuable regressor. Regressors Hypertension after follow up Time death after follow up P value was low significant regressors Estimate SE tStat pValue Intercept 147.86 1.418 104.25 0 TIMEHYP -0.004 8.337e-05 -48.224 0 TIMEDTH -0.000732 0.00013 -5.63 1.91e-08 Glucose 0.054 0.011 4.80 1.643e-06 Adjusted R^2 = 0.428
8 close all ; clearvars; clc data = xlsread( 'framingham.xls' ); men= data(1:1944,:); women= data(1945:4434,:); menQ1 = men; womenQ1 = women; %% Remove male NaNs for BMI for i =1:1939 if (isnan(menQ1(i,9))) menQ1(i,:) = []; end disp(i) end %% Remove female NaNs for BMI for i =1:2476 if (isnan(womenQ1(i,9))) womenQ1(i,:) = []; end end %% plot men SYSBP and BMI SYSBPmen = log(menQ1(:,5)); BMImen = log(menQ1(:,9)); figure(1) pm = fitlm(BMImen,SYSBPmen) plot(pm); xlabel( 'BMI in Men' ); ylabel( 'SYSBP in Men' ); title( 'BMI vs SYSBP in Men' ) CIM = coefCI(pm); %% plot women SYSBP and BMI SYSBPwomen = log(womenQ1(:,5)); BMIwomen = log(womenQ1(:,9)); figure(2) pw = fitlm(BMIwomen,SYSBPwomen) plot(pw); xlabel( 'BMI in Women' ); ylabel( 'SYSBP in Women' ); title( 'BMI vs SYSBP in Women' ); CIW = coefCI(pw); %% Q5
9 men33 = feval(pm,33); women33 = feval(pw,33); %% Q6 Plot figure(3) subplot(1,2,1) hold on l = plot(pm); set(l, 'color' , 'b' ); hold off hold on h = plot(pw); delete(h(1)); title( "Male BMI vs. System Blood Pressure with Female Regression Model" ) ylabel( "log(System Blood Pressure) (mmHg)" ) xlabel( "log(BMI) (kg/m^2)" ) legend( "Male Data Points" , "Male Linear Regression Model" , "Lower Confidence Bound" , "Upper Confidence Bound" , "Female Linear Regression Model" ) hold off subplot(1,2,2) l2 = plot(pw); title( "Female log(BMI) vs. log(System Blood Pressure)" ) ylabel( "log(System Blood Pressure) (mmHg)" ) xlabel( "log(BMI) (kg/m^2)" ) close all ; clearvars; clc data = xlsread( 'framingham.xls' ); men= data(1:1944,:); women= data(1945:4434,:); dataQ2 = data; %% Remove NaNs for BMI for i =1:4415 if (isnan(dataQ2(i,9))) dataQ2(i,:) = []; end end %% a. SYSBP against BMI – do NOT separate the data by sex. SYSBP = dataQ2(:,5); BMI = dataQ2(:,9); figure(1) pm = fitlm(BMI,SYSBP) plot(pm); xlabel( 'BMI' ); ylabel( 'SYSBP' );
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
10 title( 'BMI vs SYSBP' ) CIM = coefCI(pm); dataQ2B = data; %% Remove NaNs for Age for i =1:4415 if (isnan(dataQ2B(i,4))) dataQ2B(i,:) = []; end end %% b. SYSBP against age SYSBPB = dataQ2B(:,5); Age = dataQ2B(:,4); figure(2) pmB = fitlm(Age,SYSBPB) plot(pmB); xlabel( 'Age' ); ylabel( 'SYSBP' ); title( 'Age vs SYSBP' ) CIMB = coefCI(pmB); dataQ2C = data; %% c. SYSBP against serum total cholesterol for i =1:4386 if (isnan(dataQ2C(i,3))) dataQ2C(i,:) = []; end end SYSBPC = dataQ2C(:,5); Totchol = dataQ2C(:,3); figure(3) pmC = fitlm(Totchol,SYSBPC) plot(pmC); xlabel( 'Total Cholesterol' ); ylabel( 'SYSBP' ); title( 'Total Cholesterol vs SYSBP' ) CIMC = coefCI(pmC); %% Multiple Linear Regression dataQ2_2 = data; for i =1:4415 if (isnan(dataQ2_2(i,9))) dataQ2_2(i,:) = []; end end
11 dataQ2_2 = data; for i =1:4415 if (isnan(dataQ2_2(i,4))) dataQ2_2(i,:) = []; end end dataQ2_2 = data; for i =1:4386 if (isnan(dataQ2_2(i,3))) dataQ2_2(i,:) = []; end end SYSBP2 = dataQ2_2(:,5); BMI2 = dataQ2_2(:,9); AGE2 = dataQ2_2(:,4); TOTCHOL2 = dataQ2_2(:,3); array1 = [SYSBP2 BMI2 AGE2 TOTCHOL2]; tbl1 = array2table(array1, "VariableNames" ,[ "SYSBP" , "BMI" , "Age" , "Cholesterol" ]); lmdl = fitlm(tbl1, 'SYSBP~BMI+Age+Cholesterol' ) CIM2 = coefCI(lmdl); %% a BMI of 33, an age of 55 years, and a cholesterol level of 288 mg/dL. Include a 95% prediction interval. feval(lmdl,33, 55, 288) CIM3 = coefCI(lmdl); xallnew = [33, 55, 288] [yhat CI] = predict(lmdl, xallnew); %CI95 = [mean(A) - 1.96*(std(A)./sqrt(numel(A))), mean(A) + 1.96*(std(A)./sqrt(numel(A)))] %Extra credit EC= data; %TimedHYP for i = 1:4434 if (isnan(EC(i,39))) EC(i,:) = []; end end %Glucose for i = 1:4067 if (isnan(EC(i,13))) EC(i,:) = []; end end %TIMEDTH for i = 1:4067
12 if (isnan(EC(i,38))) EC(i,:) = []; end end ESYS = EC(:,5); EHYP = EC(:,39); EDTH = EC(:,38); EG = EC(:,13); EC_array = [ESYS EHYP EDTH EG]; tblEC = array2table(EC_array, "VariableNames" , [ "SYSBP" , "EHYP" , "EDTH" , "Glucose" ]); ECmdl = fitlm(tblEC, 'SYSBP~EHYP+EDTH+Glucose' )
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help