Project1BMES
pdf
keyboard_arrow_up
School
Harvard University *
*We aren’t endorsed by this school
Course
MISC
Subject
Statistics
Date
Feb 20, 2024
Type
Pages
12
Uploaded by PrivateStar23629
Data Analysis Project 1 Fall 2022 Instructions Write a MATLAB script to solve the following questions. Insert figures, tables, and answers directly into this document. Note: You do not need to verify your model assumptions in this project. Deliverables: 1.
This document with your figures, answers, and explanations inserted. 2.
A MATLAB script with the code used to answer all questions. Background The Framingham Heart Study (Levy, 1999) has collected cardiovascular risk factor data and long-term follow-up on almost 5000 residents of the town of Framingham, Massachusetts. Refer to the “Framingham Documentation” pdf file for detailed information on the data set. These questions are modeled after analyses in DOI:10.1017/CBO9780511575884.
Analysis 1 Perform linear regression to show the relationship between systolic blood pressure (SYSBP) and body mass index (BMI). Note that many patients were measured more than one time. Include only Period 1 measurements in your model. An exploratory analysis and assumption-checking indicated that the relationship between log(SYSBP) and log(BMI) comes closer to meeting the assumptions of a linear model than does the relationship between SBP and BMI . 1.
Create two regression models. Regress log(SYSBP) against log(BMI) for men and women. Note that SYSBP is the variable plotted on the y-axis and BMI is on the x-axis. Remove all empty (non-number) entries of BMI from your data before creating models [you can use function ‘isnan’ to explore empty entries and remove from the dataset].
2 SYSBP in Men
SYSBP in Women
3 Fill in the following table with your results: Table 1. Regression Coefficients for linear models. Sex Number of subjects n
i
β
0
β
1
Men 1944 4.07 0.246 Women 2476 3.60 0.397 2.
Use the function ‘coefCI’ to find a 95% CI on β
1 values. Table 2. Confidence intervals on β
1 values Sex Lower Bound Upper Bound Men 0.199 0.292 Women 0.359 0.435 3.
Use these CIs to determine if the slopes of the two best-fit lines are equal. [with a t-
test?] The confidence intervals do not overlap, not the same slope 4.
Use your models to predict the following: a.
SYSBP of a man with BMI = 33. 121.82 mmHg b.
SYSBP of a woman with BMI = 33. 167.10 mmHg Hint: Make sure your answers pass the straight-face test. 5.
Plot both sets of data. Use the subplot command to include them both in a single figure. The left-hand subplot should contain a log-log scale plot of the data for men with two regression lines, one for men and one for women (to highlight any differences in slope). The right-hand subplot should contain a log-log scale plot of the data for women with the regression line for women. Make sure that your regression line is linear.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
4 Analysis 2 We will now create a multiple regression model that seeks to predict SYSBP based on additional variables -- BMI, AGE and total serum cholesterol (TOTCHOL)-- included in the Framingham heart study. Include only Period 1 measurements in your model. 1.
A first step is to create several linear models of various variables that are possibly related to SYSBP. We will not be using the log transform for these models. Make three separate simple regression models. Remove all empty (non-number) entries for each variable category from your data before creating models [you can use function ‘isnan’ to explore empty entries and remove from the dataset]. log(System Blood Pressure) (mmHg)
log(System Blood Pressure) (mmHg)
5 SYSBP
SYSBP
6 a.
SYSBP against BMI – do NOT separate the data by sex. b.
SYSBP against age c.
SYSBP against serum total cholesterol Fill in the following table: Table 3. Results of three separate linear regression models Model β
1
P-value for β
1
R
2
a 1.7885 2.6635e^-111 0.108 b 1.0297 1.156e^-168 0.159 c 0.10041 9.2731e^-41 0.04 As each of the models has a small R^2 value and a much smaller p-value the regression of these models are significant. 2.
Note that each individual model’s R
2
value is low, indicating that no single variable is doing a good job of explaining the total variability in the data. A multivariate model may address this issue and provide better predictions. Create a multiple regression model of SYSBP against BMI, age, and serum cholesterol. Fill in the following table: Table 4. Estimated coefficients for multiple regression model. SYSBP
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
7 Estimate SE tStat pValue Intercept 40.836 2.5988 15.713 3.6591e^-54 BMI 1.4945 0.072966 20.482 4.318e^-89 Age 0.87486 0.035413 24.705 3.1646e^-126 SCL 0.041165 0.0068569 6.0034 2.0889e^-9 3.
Comment on the significance of regression for each of the three regressors. The p-value of the individual regression is of a small enough value to indicate their individual significant. However with a p-value of 5.26e^-260 on the multivar regression and an R^2 value of 0.241 the regression can be evaluated as significant. 4.
Predict the value of SYSBP for an individual with a BMI of 33, an age of 55 years, and a cholesterol level of 288 mg/dL. Include a 95% prediction interval. 150.1270 mmHg [1.488403743127118e+02 1.514136354776900e+02] Extra Credit: Create a new multiple regression model to predict SYSBP using the Framingham data set. All regressors must be significant and your adjusted R
2
must be above 0.25 to get credit. Show the results of your model in a table similar to table 4. Do not include diastolic blood pressure as a regressor. It is measured at the same time as SYSBP, so if you know one you know both. Diastolic BP is therefore not a valuable regressor. Regressors Hypertension after follow up Time death after follow up P value was low significant regressors Estimate SE tStat pValue Intercept 147.86 1.418 104.25 0 TIMEHYP -0.004 8.337e-05 -48.224 0 TIMEDTH -0.000732 0.00013 -5.63 1.91e-08 Glucose 0.054 0.011 4.80 1.643e-06 Adjusted R^2 = 0.428
8 close all
; clearvars; clc data = xlsread(
'framingham.xls'
); men= data(1:1944,:); women= data(1945:4434,:); menQ1 = men; womenQ1 = women; %% Remove male NaNs for BMI for i =1:1939 if
(isnan(menQ1(i,9))) menQ1(i,:) = []; end disp(i) end %% Remove female NaNs for BMI for i =1:2476 if
(isnan(womenQ1(i,9))) womenQ1(i,:) = []; end end %% plot men SYSBP and BMI SYSBPmen = log(menQ1(:,5)); BMImen = log(menQ1(:,9)); figure(1) pm = fitlm(BMImen,SYSBPmen) plot(pm); xlabel(
'BMI in Men'
); ylabel(
'SYSBP in Men'
); title(
'BMI vs SYSBP in Men'
) CIM = coefCI(pm); %% plot women SYSBP and BMI SYSBPwomen = log(womenQ1(:,5)); BMIwomen = log(womenQ1(:,9)); figure(2) pw = fitlm(BMIwomen,SYSBPwomen) plot(pw); xlabel(
'BMI in Women'
); ylabel(
'SYSBP in Women'
); title(
'BMI vs SYSBP in Women'
); CIW = coefCI(pw); %% Q5
9 men33 = feval(pm,33); women33 = feval(pw,33); %% Q6 Plot figure(3) subplot(1,2,1) hold on l = plot(pm); set(l,
'color'
,
'b'
); hold off hold on h = plot(pw); delete(h(1)); title(
"Male BMI vs. System Blood Pressure with Female Regression Model"
) ylabel(
"log(System Blood Pressure) (mmHg)"
) xlabel(
"log(BMI) (kg/m^2)"
) legend(
"Male Data Points"
,
"Male Linear Regression Model"
, "Lower Confidence Bound"
,
"Upper Confidence Bound"
, "Female Linear Regression Model"
) hold off subplot(1,2,2) l2 = plot(pw); title(
"Female log(BMI) vs. log(System Blood Pressure)"
) ylabel(
"log(System Blood Pressure) (mmHg)"
) xlabel(
"log(BMI) (kg/m^2)"
) close all
; clearvars; clc data = xlsread(
'framingham.xls'
); men= data(1:1944,:); women= data(1945:4434,:); dataQ2 = data; %% Remove NaNs for BMI for i =1:4415 if
(isnan(dataQ2(i,9))) dataQ2(i,:) = []; end end %% a. SYSBP against BMI – do NOT separate the data by sex. SYSBP = dataQ2(:,5); BMI = dataQ2(:,9); figure(1) pm = fitlm(BMI,SYSBP) plot(pm); xlabel(
'BMI'
); ylabel(
'SYSBP'
);
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
10 title(
'BMI vs SYSBP'
) CIM = coefCI(pm); dataQ2B = data; %% Remove NaNs for Age for i =1:4415 if
(isnan(dataQ2B(i,4))) dataQ2B(i,:) = []; end end %% b. SYSBP against age SYSBPB = dataQ2B(:,5); Age = dataQ2B(:,4); figure(2) pmB = fitlm(Age,SYSBPB) plot(pmB); xlabel(
'Age'
); ylabel(
'SYSBP'
); title(
'Age vs SYSBP'
) CIMB = coefCI(pmB); dataQ2C = data; %% c. SYSBP against serum total cholesterol for i =1:4386 if
(isnan(dataQ2C(i,3))) dataQ2C(i,:) = []; end end SYSBPC = dataQ2C(:,5); Totchol = dataQ2C(:,3); figure(3) pmC = fitlm(Totchol,SYSBPC) plot(pmC); xlabel(
'Total Cholesterol'
); ylabel(
'SYSBP'
); title(
'Total Cholesterol vs SYSBP'
) CIMC = coefCI(pmC); %% Multiple Linear Regression dataQ2_2 = data; for i =1:4415 if
(isnan(dataQ2_2(i,9))) dataQ2_2(i,:) = []; end end
11 dataQ2_2 = data; for i =1:4415 if
(isnan(dataQ2_2(i,4))) dataQ2_2(i,:) = []; end end dataQ2_2 = data; for i =1:4386 if
(isnan(dataQ2_2(i,3))) dataQ2_2(i,:) = []; end end SYSBP2 = dataQ2_2(:,5); BMI2 = dataQ2_2(:,9); AGE2 = dataQ2_2(:,4); TOTCHOL2 = dataQ2_2(:,3); array1 = [SYSBP2 BMI2 AGE2 TOTCHOL2]; tbl1 = array2table(array1, "VariableNames"
,[
"SYSBP"
, "BMI"
, "Age"
, "Cholesterol"
]); lmdl = fitlm(tbl1,
'SYSBP~BMI+Age+Cholesterol'
) CIM2 = coefCI(lmdl); %% a BMI of 33, an age of 55 years, and a cholesterol level of 288 mg/dL. Include a 95% prediction interval. feval(lmdl,33, 55, 288) CIM3 = coefCI(lmdl); xallnew = [33, 55, 288] [yhat CI] = predict(lmdl, xallnew); %CI95 = [mean(A) - 1.96*(std(A)./sqrt(numel(A))), mean(A) + 1.96*(std(A)./sqrt(numel(A)))] %Extra credit EC= data; %TimedHYP for i = 1:4434 if (isnan(EC(i,39))) EC(i,:) = []; end end %Glucose for i = 1:4067 if (isnan(EC(i,13))) EC(i,:) = []; end end %TIMEDTH for i = 1:4067
12 if (isnan(EC(i,38))) EC(i,:) = []; end end ESYS = EC(:,5); EHYP = EC(:,39); EDTH = EC(:,38); EG = EC(:,13); EC_array = [ESYS EHYP EDTH EG]; tblEC = array2table(EC_array, "VariableNames"
, [
"SYSBP"
, "EHYP"
, "EDTH"
, "Glucose"
]); ECmdl = fitlm(tblEC,
'SYSBP~EHYP+EDTH+Glucose'
)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Documents
Related Questions
The quadratic model for the given data is wrong.
arrow_forward
How can you evaluate the accuracy of a forecast model?
arrow_forward
Now monitor the process. An additional ten days of data have been collected, see table labeled “1st 10 Days of Monitoring Reservation Processing Time” in the Data File.
Develop Xbar and R charts for the 1st 10 days of monitoring. Plot the data for the 1st 10 days on the Xbar and R charts.
Is the process in control? If the control chart indicates an out-of-control process, note which days, the pattern, and whether it is the Xbar or R chart.
Based on the X-bar and R Charts that you developed for the 1st 10 days of data, is the process in control?
Group of answer choices
No. The X-bar and R Charts are both out of control.
No. The X-bar Chart is in control, but the R Chart is out of control.
No. The R Chart is in control, but the X-bar Chart is out of control.
Yes. The X-bar and R Charts are both in control.
arrow_forward
Background information: Allison collected additional days of data to monitor the process.
Steps to monitor using the control charts:
Now monitor the process. An additional ten days of data have been collected, see table labeled “1st 10 Days of Monitoring Reservation Processing Time” in the Data File.
Develop Xbar and R charts for the 1st 10 days of monitoring. Plot the data for the 1st 10 days on the Xbar and R charts.
Is the process in control? If the control chart indicates an out-of-control process, note which days, the pattern, and whether it is the Xbar or R chart.
Now that we have set up the control charts using enough data from a stable process, the 30 days of data, we will monitor the process. While monitoring the process, what will we use as the upper control limit for the R (range) Chart to compare against our new range values? Enter your response to three decimal places. You do not need to include the units (minutes), ONLY the numeric value.
USE EXCELL DATA TO GET…
arrow_forward
Please show and explain all work in an easy-to-read format!!!!! Please and thank you!!!!
arrow_forward
Thank you for any feedback on this one.
arrow_forward
Continue monitoring the process. A second ten days of data have been collected, see table labeled “2nd 10 Days of Monitoring Reservation Processing Time” in the Data File.
Develop Xbar and R charts for the 2nd 10 days of monitoring. Plot the data for the 2nd 10 days on the Xbar and R charts.
Is the reservation process for the 2nd 10 days of monitoring in control? If the control chart indicates an out-of-control process, note which days, the pattern, and whether it is the Xbar or R chart.
Based on the X-bar and R Charts that you developed for the 2nd 10 days of data, is the process in control?
Group of answer choices
No. The X-bar and R Charts are both out of control.
No. The X-bar Chart is in control, but the R Chart is out of control.
No. The R Chart is in control, but the X-bar Chart is out of control.
Yes. The X-bar and R Charts are both in control.
arrow_forward
Please answer as much of the question as possible. Also, please list the solution in the same format as the provided screenshot.
Thank you
arrow_forward
Please help with these questions.
arrow_forward
tion 2 of 15
Last summer, the Smith family drove through seven different states and visited various popular landmarks. The prices of gasoline
in dollars per gallon varied from state to state and are listed below.
$2.34, $2.75, $2.48, $3.58, $2.87, $2.53, $3.31
Click to download the data in your preferred format.
CrunchIt! CSV Excel JMP Mac Text Minitab PC Text R SPSS TI Calc
Calculate the range of the price of gas. Give your solution to the nearest cent.
range:
dollars per gallon
DELL
&
4.
7
8.
arrow_forward
Please help me answer these and understand
arrow_forward
Create the regression equations based on the research model below!
arrow_forward
Part 2. Refer to the Excel file Cereal data set to complete the following tasks. All results and explanations need to be reported within this Word document after each question. Make sure to use complete sentences when explaining your results. Your results should be formatted and edited.
Data Set: Cereals
The data set shows the name of different brands of cereals, the manufacturers, the total calories, proteins, sugar, fat, potassium, sodium, location of the shelf in the supermarket, etc. The amount of sugar, protein, etc., is measured in grams (g).
Exercise 1:
A. Construct a frequency distribution and a bar graph for the cereal manufactures (mfr). Include the relative frequencies. Edit and format the graph and include appropriate labels for the horizontal and vertical axes. Describe your findings in the context of the problem (Include which manufacturer produces the most cereals and least number of cereals in the cereal market).
N = Nabisco, K = Kellog’s, Q = Quaker Oats…
arrow_forward
(Please do not give solution in image format thanku)
arrow_forward
can anyone help with this please? Thank you.
arrow_forward
Recently, management at Oak Tree Golf Course received a few complaints about the condition of the greens. Several players complained that the greens are too fast. Rather than react to the comments of just a few, the Golf Association conducted a survey of 100 male and
100 female golfers. The survey results are summarized here.
Excel File: data02-31.xlsx
Male Golfers
Green Condition
Gender Too Fast
Male
Handicap
Under 15
15 or more
25
25
a. Complete the crosstabulation shown below.
Green Condition
Female
Too Fast
10
Fine
Fine
40
Female Golfers
Total
Green Condition
Handicap
Under 15
15 or more
Too Fast
1
Fine
9
39 51
Total
Which group shows the highest percentage saying that the greens are too fast?
- Select your answer -
b. Refer to the initial crosstabulations. For those players with low handicaps (better players), which group (male or female) shows the highest percentage saying the greens are too fast?
For the low handicappers, the - Select your answer - have a higher percentage who…
arrow_forward
Fully discuss whether we should omit a predictor from the modeling stage if it does not reflects any connection with the target variable in the EDA stage, and why.
arrow_forward
Please, share an example to cover these:
I. The context of the study (typically no more than 4 slides).
General information about the topic.
The general hypothesis of interest.
Specify H0 and H1 for the tests of major interest.
The goals of the study, i.e., what it would contribute to knowledge in the field. Also, relate it to other research already done in the field.
II. Describe the study design.
Describe the dependent variable, and specify any potential problems one might encounter in measuring this dependent variable.
For an experiment,
Was blocking used?
Describe the treatment factor(s), (if any) whether they are fixed or random, and the levels. Explain why they are used in your experiment.
III. Describe the population from which the subjects will be selected. Describe the sampling procedure, and for an experiment describe the random allocation of subjects to treatments.
These first three sections (I, II, and III, listed above) are due on the fifteenth lecture.
IV.…
arrow_forward
I have attached the picture for review. Please answer section A, B, C.
arrow_forward
is it possible to transform scale data into an ordinal or nominal data and vice versa? Explain and give an example to support your answer
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you

MATLAB: An Introduction with Applications
Statistics
ISBN:9781119256830
Author:Amos Gilat
Publisher:John Wiley & Sons Inc

Probability and Statistics for Engineering and th...
Statistics
ISBN:9781305251809
Author:Jay L. Devore
Publisher:Cengage Learning

Statistics for The Behavioral Sciences (MindTap C...
Statistics
ISBN:9781305504912
Author:Frederick J Gravetter, Larry B. Wallnau
Publisher:Cengage Learning

Elementary Statistics: Picturing the World (7th E...
Statistics
ISBN:9780134683416
Author:Ron Larson, Betsy Farber
Publisher:PEARSON

The Basic Practice of Statistics
Statistics
ISBN:9781319042578
Author:David S. Moore, William I. Notz, Michael A. Fligner
Publisher:W. H. Freeman

Introduction to the Practice of Statistics
Statistics
ISBN:9781319013387
Author:David S. Moore, George P. McCabe, Bruce A. Craig
Publisher:W. H. Freeman
Related Questions
- The quadratic model for the given data is wrong.arrow_forwardHow can you evaluate the accuracy of a forecast model?arrow_forwardNow monitor the process. An additional ten days of data have been collected, see table labeled “1st 10 Days of Monitoring Reservation Processing Time” in the Data File. Develop Xbar and R charts for the 1st 10 days of monitoring. Plot the data for the 1st 10 days on the Xbar and R charts. Is the process in control? If the control chart indicates an out-of-control process, note which days, the pattern, and whether it is the Xbar or R chart. Based on the X-bar and R Charts that you developed for the 1st 10 days of data, is the process in control? Group of answer choices No. The X-bar and R Charts are both out of control. No. The X-bar Chart is in control, but the R Chart is out of control. No. The R Chart is in control, but the X-bar Chart is out of control. Yes. The X-bar and R Charts are both in control.arrow_forward
- Background information: Allison collected additional days of data to monitor the process. Steps to monitor using the control charts: Now monitor the process. An additional ten days of data have been collected, see table labeled “1st 10 Days of Monitoring Reservation Processing Time” in the Data File. Develop Xbar and R charts for the 1st 10 days of monitoring. Plot the data for the 1st 10 days on the Xbar and R charts. Is the process in control? If the control chart indicates an out-of-control process, note which days, the pattern, and whether it is the Xbar or R chart. Now that we have set up the control charts using enough data from a stable process, the 30 days of data, we will monitor the process. While monitoring the process, what will we use as the upper control limit for the R (range) Chart to compare against our new range values? Enter your response to three decimal places. You do not need to include the units (minutes), ONLY the numeric value. USE EXCELL DATA TO GET…arrow_forwardPlease show and explain all work in an easy-to-read format!!!!! Please and thank you!!!!arrow_forwardThank you for any feedback on this one.arrow_forward
- Continue monitoring the process. A second ten days of data have been collected, see table labeled “2nd 10 Days of Monitoring Reservation Processing Time” in the Data File. Develop Xbar and R charts for the 2nd 10 days of monitoring. Plot the data for the 2nd 10 days on the Xbar and R charts. Is the reservation process for the 2nd 10 days of monitoring in control? If the control chart indicates an out-of-control process, note which days, the pattern, and whether it is the Xbar or R chart. Based on the X-bar and R Charts that you developed for the 2nd 10 days of data, is the process in control? Group of answer choices No. The X-bar and R Charts are both out of control. No. The X-bar Chart is in control, but the R Chart is out of control. No. The R Chart is in control, but the X-bar Chart is out of control. Yes. The X-bar and R Charts are both in control.arrow_forwardPlease answer as much of the question as possible. Also, please list the solution in the same format as the provided screenshot. Thank youarrow_forwardPlease help with these questions.arrow_forward
- tion 2 of 15 Last summer, the Smith family drove through seven different states and visited various popular landmarks. The prices of gasoline in dollars per gallon varied from state to state and are listed below. $2.34, $2.75, $2.48, $3.58, $2.87, $2.53, $3.31 Click to download the data in your preferred format. CrunchIt! CSV Excel JMP Mac Text Minitab PC Text R SPSS TI Calc Calculate the range of the price of gas. Give your solution to the nearest cent. range: dollars per gallon DELL & 4. 7 8.arrow_forwardPlease help me answer these and understandarrow_forwardCreate the regression equations based on the research model below!arrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you
- MATLAB: An Introduction with ApplicationsStatisticsISBN:9781119256830Author:Amos GilatPublisher:John Wiley & Sons IncProbability and Statistics for Engineering and th...StatisticsISBN:9781305251809Author:Jay L. DevorePublisher:Cengage LearningStatistics for The Behavioral Sciences (MindTap C...StatisticsISBN:9781305504912Author:Frederick J Gravetter, Larry B. WallnauPublisher:Cengage Learning
- Elementary Statistics: Picturing the World (7th E...StatisticsISBN:9780134683416Author:Ron Larson, Betsy FarberPublisher:PEARSONThe Basic Practice of StatisticsStatisticsISBN:9781319042578Author:David S. Moore, William I. Notz, Michael A. FlignerPublisher:W. H. FreemanIntroduction to the Practice of StatisticsStatisticsISBN:9781319013387Author:David S. Moore, George P. McCabe, Bruce A. CraigPublisher:W. H. Freeman

MATLAB: An Introduction with Applications
Statistics
ISBN:9781119256830
Author:Amos Gilat
Publisher:John Wiley & Sons Inc

Probability and Statistics for Engineering and th...
Statistics
ISBN:9781305251809
Author:Jay L. Devore
Publisher:Cengage Learning

Statistics for The Behavioral Sciences (MindTap C...
Statistics
ISBN:9781305504912
Author:Frederick J Gravetter, Larry B. Wallnau
Publisher:Cengage Learning

Elementary Statistics: Picturing the World (7th E...
Statistics
ISBN:9780134683416
Author:Ron Larson, Betsy Farber
Publisher:PEARSON

The Basic Practice of Statistics
Statistics
ISBN:9781319042578
Author:David S. Moore, William I. Notz, Michael A. Fligner
Publisher:W. H. Freeman

Introduction to the Practice of Statistics
Statistics
ISBN:9781319013387
Author:David S. Moore, George P. McCabe, Bruce A. Craig
Publisher:W. H. Freeman