SAS Assignment #1

docx

School

Texas A&M University, Corpus Christi *

*We aren’t endorsed by this school

Course

5315

Subject

Statistics

Date

Feb 20, 2024

Type

docx

Pages

14

Uploaded by CaptainElementEel34

Report
Framingham Heart Study: Data Preparation Industry Aligned Activity Purpose This activity focuses on preparing data from the Framingham Heart Study for future statistical analyses as well as exploring data through descriptive statistics. SAS Software This activity can be performed using any SAS programming environment, including SAS Studio in SAS OnDemand for Academics . Industry Alignment This activity aligns with the healthcare industry. It uses data from a clinical study conducted to identify characteristics contributing to cardiovascular disease.
Framingham Heart Study: Data Preparation Industry Applied Activity Table of Contents Framingham Heart Study: Data Preparation 1 Purpose 1 SAS Software 1 Industry Alignment 1 Activity Notes and Requirements 3 Learning Objectives 3 Estimated Completion Time 3 Experience Level 3 Prerequisite Knowledge 3 Software 3 Content Knowledge 3 Additional Notes 3 Data Source 3 Introduction 3 Description of Variables 4 Framingham Heart Study: Data Preparation Activity 5 Part 1: Understanding the Variables 5 Part 2: Creating New Variables and Subsetting the Data 8 Appendix 12 Appendix A: Access Software 12 Appendix B: Helpful Documentation 12 Appendix C: Recommended Learning 12 2
Activity Notes and Requirements Learning Objectives This activity provides practice with skills such as: Implementing data changes and manipulations Preparing data for future possible statistical analyses Exploring data through descriptive statistics including: o Understanding variables and their values within the data o Recognizing the need for changes in the data Estimated Completion Time This activity will take students approximately 3 hours to complete. Experience Level To complete this activity students should have the following levels of experience: Intermediate skill in SAS programming Beginner skill in statistics Prerequisite Knowledge Software Students should have experience with the following: Foundations of programming with the SAS Data Step including using functions and if/then/else conditional statements. SAS descriptive procedures such as PROC PRINT, PROC CONTENTS, PROC FREQ, PROC MEANS, and PROC UNIVARIATE. Content Knowledge Students should have experience/knowledge with the following concepts: Descriptive statistics such as mean, median, counts, and percentages Conditional if/then/else logic Additional Notes This activity pairs well with the following activities that you will complete:: Framingham Heart Study: Descriptive Analysis, Industry Applied Activity Framingham Heart Study: Statistical Analysis, Industry Applied Activity Data Source Introduction This activity uses the HEART dataset in the SASHELP library. To access the SASHELP library in SAS, select   View Explorer . In the Explorer window, select   Libraries Sashelp . The data came from the landmark Framingham Heart Study ( https://framinghamheartstudy.org/ ). The purpose of the Framingham Heart Study was to identify characteristics contributing to
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Framingham Heart Study: Data Preparation Industry Applied Activity cardiovascular disease. Important links between cardiovascular disease and high blood pressure, high cholesterol levels, cigarette smoking, and many other health factors were first established using its data. The original cohort of the Framingham Heart study consisted of 5,209 men and women between the ages of 28 and 62 living in Framingham, Massachusetts. The first visit of data collection for participants in this cohort occurred between 1948 and 1953, and participants were assessed every two years thereafter through April 2014—almost 7 decades! The complete Framingham Heart Study data consists of hundreds of datasets taken over time at 32 biennial exams and has led to over 3000 (wow!) published journal articles. To simplify analyses for illustrative purposes, the SASHELP.HEART dataset includes a snapshot of selected primary study variables taken at one of the biennial exams. Description of Variables The variables used for this exercise are: Variable Description Status Alive or dead DeathCause Cause of death AgeCHDdiag Age at which CHD was diagnosed Sex Male or female AgeAtStart Age at the entry into the Framingham Heart Study Height Height in inches Weight Weight in pounds Diastolic Diastolic blood pressure Systolic Systolic blood pressure MRW Metropolitan Relative Weight Smoking Number of packs of cigarettes smoked per week AgeatDeath Age at death Cholesterol Total cholesterol Chol_Status Total cholesterol categorized into groups BP_Status Diastolic and systolic blood pressure categorized into groups Weight_Status Height and weight categorized into groups Smoking_Statu s Number of packs of cigarettes smoked per week categorized into groups 4
Framingham Heart Study: Data Preparation Industry Applied Activity Framingham Heart Study: Data Preparation Activity This activity is comprised of two parts. Part one outlines how to explore the data to understand the variables for analysis. Part two outlines how to prepare the data for future analyses by creating new variables and subsetting the data. Part 1: Understanding the Variables Deciding an appropriate path for analysis often requires many steps. An important first step is exploring and examining the data. An initial exploratory data analysis provides understanding of the meaning of study variables and can provide crucial clues into data preparations needed before analyzing the data. 1. Open and examine the SASHELP.HEART dataset and its variables. Familiarize yourself with the context and meanings behind the variables and their values. a. How many observations are in the dataset? There is a total of 5209 observations in the SASHELP.HEART dataset. b. How many variables are in the dataset? How many are numeric? How many are character? There are 17 total variables in the dataset. 10 of the variables are numeric, the remaining 7 are character. Exploring the assigned values of character variables can demonstrate patterns and inherent orderings. The default ordering of levels in SAS is alphabetical order. The levels of many character variables have an inherent ordering of magnitude. For example, non-smokers smoke less than light smokers who smoke less than moderate smokers. 2. Tabulate the levels of the character variables in the SASHELP.HEART dataset. For each of the character variables: a. What data values or levels are observed for each? The Status variable has two values: Alive (3218) and Dead (1991). The DeathCause variable includes five values: cancer (539), cerebral vascular disease (378), coronary heart disease (605), other (357), and unknown (112); with a blank data vluae indicating individuals currently alive. The Sex variable has two values: Female (2873) and Male (2336). Chol_Status has three values: Borderline (1861), Desirable (1405), and High (1791). Similarly, BP_Status has three data values: High (2267), Normal (2143), and Optimal (799). 5
Framingham Heart Study: Data Preparation Industry Applied Activity Under the Weight_Status variable, there are three levels: Normal (1472), Overweight (3550), and Underweight (181). Lastly, the Smoking_Status variable has five values: Heavy (16-25) (1046), Light (1- 5) (579), Moderate (6-15) (576), Non-smoker (2501), and Very Heavy (>25) (471). b. Which variables have an inherent ordering of magnitude? Does alphabetical order of the levels correspond to ordering levels by magnitude for any of these character variables? Cholesterol Status, Weight Status, Blood Pressure Status, and Smoking Status display an inherent magnitude ordering. This ordering is determined by the levels of frequency of specific attributes related to each status, such as the number of cigarettes smoked in Smoking Status. It is important to note that the alphabetical order of levels does not align with the magnitude ordering for any of the variables. For instance, Weight Status is listed as Normal, Overweight, and Underweight in alphabetical order, which does not accurately reflect their correct magnitude ordering. Examining the values of numeric variables can provide insights into their magnitude, spread, and symmetry. Variables with a symmetric distribution will have roughly equal mean and median, so can be summarized with either statistic. Variables with substantially different mean and median values indicate a non-symmetric distribution. Such variables may be better summarized with a median. Additionally, some numeric variables may have few unique values, so could be better summarized as categorical variables. 3. Generate descriptive statistics and histograms for the numeric variables in the SASHELP.HEART dataset. a. What is the minimum, maximum, median, and mean of each variable? Age CHD Diagnosed: Min 33, Max 90, Median 63, Mean 63.30. Age at Start: Min 28.5, Max 61.5, Median 43, Mean 44.07. Height: Min 51.5, Max 75.5, Median 64.5, Mean 64.81. Weight: Min 70, Max 290, Median 150, Mean 153.09. Diastolic: Min 50, Max 160, Median 84, Mean 85.36. Systolic: Min 84, Max 292, Median 132, Mean 136.91. Metropolitan Relative Weight: Min 68, Max 260, Median 118, Mean 119.96. Smoking: Min 0, Max 60, Median 1, Mean 9.37. Age at Death: Min 36, Max 93, Median 71, Mean 70.54. Cholesterol: Min 100, Max 540, Median 223, Mean 227.42 . b. Do the mean and median seem substantially different for any of the variables? The mean and median for the Smoking variable is substantially different compared to the other variables. The median 1 while its mean is 9.37. The other variables’ means and medians are not that different from each other. 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Framingham Heart Study: Data Preparation Industry Applied Activity c. Does Smoking seem to be better suited to be analyzed as a categorical variable or a continuous variable? The Smoking variable is better suited to be analyzed as a categorical variable. The SASHELP.HEART dataset contains several categorical variables whose levels were originally created from values of continuous variables in the dataset. Understanding the relationships between related continuous and categorical predictors in a dataset can inform choices of predictors in later statistical analyses. 4. Explore the variables Weight_Status , Smoking_Status , Chol_Status , and BP_Status as follows: a. Variables Weight_Status , MRW , and Weight : i. What are the ranges (minimum and maximum) of variables MRW and Weight for each level of Weight_Status ? Normal weight status ranges from Weight 92-186 and MRW 91-109. Overweight status spans Weight 104-300, and MRW 110-268. Underweight has a consistent minimum of 67 for both MRW and Weight, with a maximum of 90 for MRW and 150 for Weight. ii. Are the ranges of MRW for levels of Weight_Status overlapping? The ranges of MRW for levels of Weight_Status are not overlapping. iii. Are the ranges of Weight for levels of Weight_Status overlapping? The ranges of Weight for levels of Weight_Status are not overlapping. iv. Using your answers to the previous two questions, when this dataset was created which values, MRW or Weight , were used to create the levels for Weight_Status ? Weight values were used to create the levels for Weight_Status. Comparing MRW values to Weight_Status levels, the results were not consistent. b. Variables Smoking_Status and Smoking : i. Which values of Smoking are categorized as Smoking_Status=Non-smoker ? Light ? Moderate ? Heavy ? Very Heavy ? Non-smoker is 0. Light smokers are 1-5. Moderate smokers have values of 6-15. Heavy smokers are considered as 16-25. Lastly, Very Heavy smokers have a value of greater than 25. 7
Framingham Heart Study: Data Preparation Industry Applied Activity ii. Are any values of Smoking categorized into more than one level of Smoking_Status ? No, all of the values are categorized the same. c. Variables Chol_Status and Cholesterol : i. What are the ranges (minimum and maximum) of Cholesterol for each level of Chol_Status ? The minimum for Borderline status is 200, maximum is 239. For the Desirable level, the minimum is 96 while the maximum is 199. For High levels of Cholesterol, the minimum is 240 and maximum is 568. ii. Are the ranges of Cholesterol for levels of Chol_Status overlapping? No, the ranges of Cholesterol for levels of Chol_Status is not overlapping. d. Variable BP_Status : i. What are the ranges (minimum and maximum) of Diastolic and Systolic for each level of BP_Status ? High Blood Pressure Status: Diastolic minimum is 52, maximum is 160. Systolic minimum is 112, maximum is 300. Normal Blood Pressure Status: Diastolic minimum is 54, maximum is 88. Systolic minimum is 101, maximum is 140. Optical Blood Pressure Status: Diastolic minimum is 50, maximum is 78. Systolic minimum is 82, maximum is 118. ii. Are the ranges of Diastolic for levels of BP_Status overlapping? No, the ranges of Diastolic for levels of BP_Status is not overlapping. iii. Are the ranges of Systolic for levels of BP_Status overlapping? Yes, the ranges of Systolic for levels of BP_Status does overlap. iv. Normal levels of blood pressure are usually defined as under 120 for systolic blood pressure and under 80 for diastolic blood pressure. Based on your answers to the previous questions, are one or both of systolic and diastolic blood pressure required to be high for the individual to be categorized as BP_Status=High ? The High BP Status ranges exceed normal levels which means that both diastolic and systolic must be higher for a person to be classified as BP Status=High. Exploring patterns of missingness in a dataset gives insight into data collection procedures for the study generating the dataset and may also indicate data entry or data collection errors. 5. Examine missing data in the SASHELP.HEART dataset. 8
Framingham Heart Study: Data Preparation Industry Applied Activity a. Which variables have no missing data? The Alive/Dead Status, Sex variable, BP_Status, AgeAtStart, Diastolic, and Systolic variables have no missing data. b. Which variables have missing data? Cause of Death, Cholesterol Status, Weight Status, Smoking Status, Age CHD, Height, MRW, Smoking, AgeAtDeath, and Cholesterol variables have missing data. c. For each variable with missing data, what percent of the data is missing? Cause of Death: 3218 missing values (61.78% of 5209 observations). Cholesterol Status: 152 missing values (2.92%). Weight, Height, MRW: 6 missing values each (0.12% for each variable). Smoking Status: 36 missing values (0.69%). Age CHD: 3760 missing values (72.18%). Age at Death: 3218 missing values (61.78%). Cholesterol: 152 missing values (2.92%). d. Using what you currently know about the dataset, given the definition of the variable(s) or given values of other variables in the dataset, which variable(s) have patterns of missingness that could be expected? The Cause of Death and Age at Death variables have patterns of missingness that could be expected due to them not counting people that are still currently alive. 6. Examine patterns of missingness on certain groups of variables as follows: a. If MRW is non-missing, are both Height and Weight always non-missing? No, if MRW is non-missing, it is possible for Height to be missing 0.08% of the time. b. If Weight_Status is non-missing, are both Height and Weight always non-missing? No, if Weight_Status is non-missing, Height could be missing 0.08% of the time. c. If Smoking is non-missing is Smoking_Status always non-missing, and vice versa? Yes, if Smoking is non-missing, Smoking_Status will always be non-missing and vice versa. d. If Cholesterol is non-missing, is Chol_Status always non-missing, and vice versa? Yes, if Cholesterol is non-missing, Chol_Status is always non-missing and vice versa. e. Analyze DeathCause and AgeAtDeath grouped by Status . i. Are DeathCause and AgeAtDeath ever missing when Status=Dead ? No, DeathCause and AgeAtDeath is never missing when Status=Dead. 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Framingham Heart Study: Data Preparation Industry Applied Activity ii. Are DeathCause and AgeAtDeath ever non-missing when Status=Alive ? No, DeathCause and AgeAtDeath are never non-missing when Status=Alive. f. Analyze AgeCHDdiag grouped by DeathCause . Is AgeCHDDiag ever missing when DeathCause=Coronary Heart Disease ? No, AgeCHDDiag is never missing when DeathCause=coronary heart disease. Missing values can also impact later statistical analyses. SAS statistical procedures perform what is called a complete case analysis, which is to say that analyses will exclude any observation with a missing value for any variable involved in the analysis. Such exclusions can substantially decrease the number of observations in a dataset that are used in a later statistical analysis. 7. Tabulate the percent of observations in the SASHELP.HEART dataset that have non- missing values for all the predictor variables that you will use in later analyses: AgeAtStart , BP_Status , Chol_Status , Cholesterol , Diastolic , Height , MRW , Sex , Smoking , Smoking_Status , Systolic , Weight , and Weight_Status . Does the SASHELP.HEART dataset seem to have a high amount of missing data for any of these predictors? The SASHELP.HEART dataset exhibits significant missing data, particularly for variables such as Cholesterol, Chol_Status, Smoking, Smoking_Status, Height, MRW, and Weight. Although Cholesterol and Chol_Status has the highest count of missing values at 152, multiple variables show a notable presence of missing data overall. Part 2: Creating New Variables and Subsetting the Data An important next step after exploring a dataset is to create any new variables needed for later analyses. The primary outcome of the Framingham Heart study is whether a patient developed coronary heart disease. Interestingly, this variable is not included in the SASHELP.HEART dataset. 1. Use information in the variable AgeCHDdiag to create a variable describing whether a patient developed coronary heart disease. Specifically, if AgeCHDdiag is non-missing, then the individual had coronary heart disease, and if AgeCHDdiag is missing, the individual did not have coronary heart disease. a. Create a new numeric variable named CHD . b. Store this new variable in a temporary dataset named WORK.HEART1 . 10
Framingham Heart Study: Data Preparation Industry Applied Activity c. Code this variable so that CHD= 1 if AgeCHDdiag takes a value from 0 to 999 and CHD= 0 otherwise. After creating any new variable, make sure to check your work. 2. Generate descriptive statistics for the variable AgeCHDdiag grouped by CHD . a. Is CHD a numeric variable? Yes, CHD is a numeric variable. b. When CHD=1 , is AgeCHDdiag always non-missing? Yes, when CHD=1, AgeCHDdiag is always non-missing. c. When CHD=0 , is AgeCHDdiag always missing? Yes, when CHD=0, AgeCHDdiag is always missing. Let’s now turn to creating new predictor variables. Statistical analyses can determine which variables collected in the Framingham Heart Study are predictive of development of coronary heart disease. To facilitate comparison of levels of categorial predictors, levels of categorial predictors must be recoded so that alphabetical order of the levels also corresponds to ordering the levels by magnitude. This is desirable since statistical procedures use the alphabetic last level as a reference level by default. Re-coding is also useful so that levels appear in a logical order in plots. 3. Re-code categorial variables in the SASHELP.HEART dataset as follows: a. Use WORK.HEART1 as the input dataset. b. Create an output dataset named WORK.HEART2 . c. Create a new variable Chol_StatusNew by recoding Chol_Status as follows: High = 1 High Borderline = 2 Borderline Desirable = 3 Desirable d. Create a new variable Sex_New by recoding Sex as follows: Male = 1 Male Female = 2 Female e. Create a new variable Weight_StatusNew by recoding Weight_Status as follows: Overweight = 1 Overweight Normal = 2 Normal 11
Framingham Heart Study: Data Preparation Industry Applied Activity Underweight = 3 Underweight f. Create a new variable Smoking_StatusNew by recoding Smoking_Status as follows: Very Heavy (> 25) = 1 Very Heavy Heavy (16-25) = 2 Heavy Moderate (6-15) = 3 Moderate Light (1-5) = 4 Light Non-smoker = 5 Non-smoker g. Tabulate each of your new variables as follows to check your work: i. Tabulate levels of each of the four new variables over all observations. ii. Tabulate levels of Chol_StatusNew grouped by Chol_Status . iii. Tabulate levels of Sex_New grouped by Sex . iv. Tabulate levels of Weight_StatusNew grouped by Weight_Status . v. Tabulate levels of Smoking_StatusNew grouped by Smoking_Status . vi. Do you see the expected ordering of levels within each variable (in part i) as well as the expected combinations of levels of re-coded and original variables (in parts ii-v)? No, the only variable that has expected ordering of levels is Smoking Status. The other variables are ordered only alphabetically. We have now finished creating new variables. In part 1, question 7, you tabulated the amount of missing data for the set of predictor variables of interest in the SASHELP.HEART dataset. From this, you noticed that only a small percentage (<5%) of observations in the SASHELP.HEART dataset have missing data for any of these variables. Ideally, statistical analyses for the SASHELP.HEART dataset should be performed only on observations with no missing data for all these predictors. This ensures that all analyses, regardless of the predictors included, use the same number of observations. Given that the amount of missing data is small, analyses can simply exclude any observation with missing data on at least one of the predictors of interest. Other strategies such as single or multiple imputation could be employed, but those are beyond the scope of this exercise. 12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Framingham Heart Study: Data Preparation Industry Applied Activity 4. Create a new permanent dataset that can be used for later statistical analyses. a. Use WORK.HEART2 as the input dataset. b. Create a library named HEARTLIB . c. Create an output dataset named HEARTLIB.MYHEART that contains only those observations that have non-missing values for the variables below: AgeAtStart Height Systolic BP_Status MRW Weight Chol_StatusNew Sex_New Weight_Status Cholesterol Smoking Weight_Status2 Diastolic Smoking_StatusNew This dataset should have 5039 observations. d. Check your work for the dataset HEARTLIB.MYHEART by tabulating values of character variables and generating descriptive statistics for numeric variables. Do you see any missing values in any of the tabulations or statistics generated? Yes, there are missing values for AgeCHDDiag, AgeAtDeath, and Cause of Death. Congratulations- you have completed data preparation for the Framingham Heart Study dataset! A next step in exploring relationships between coronary heart disease and predictors of interest is to perform additional descriptive analyses by creating logit plots. The related Framingham Heart Study: Descriptive Analysis, Industry Applied Activity provides practice in generating logit plots. Following this, logistic regression models can be fit to formalize the statistical relationships between coronary heart disease and predictors of interest. The related Framingham Heart Study: Statistical Analysis, Industry Applied Activity provides practice in fitting these logistic regression models. These activities can be found in the Academic Hub. 13
Framingham Heart Study: Data Preparation Industry Applied Activity Appendix Appendix A: Access Software SAS OnDemand for Academics (ODA) is a free, full suite of cloud-based software that supports the analytics life cycle- from data, to discovery, to deployment. Students can use SAS OnDemand for Academics to get access to SAS Studio for free. Click here to access ODA. Note: You need to have an established SAS profile linked to an academic affiliation. If you don't have a SAS Profile, click here to set one up. Check out Frequently Asked Questions for more support. Appendix B: Helpful Documentation Below are helpful links to documentation regarding the procedures used in the activity. The CONTENTS procedure The PRINT procedure The MEANS procedure The FREQ procedure The UNIVARIATE procedure Base SAS Procedures Guide DATA Step Statements: Reference Appendix C: Recommended Learning The SAS Global Academic Program offers free e-learning courses for students to learn SAS through the Student Skill Builder. The following e-learning courses and paths available are recommended to help with this activity: SAS Programming 1: Essentials SAS Programming 2: Data Manipulation Techniques Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression 14