Week-2_Exercise---solution

pdf

School

University of Central Florida *

*We aren’t endorsed by this school

Course

CHL5201

Subject

Statistics

Date

Jan 9, 2024

Type

pdf

Pages

13

Uploaded by SheBeast

Report
Week 2_ Exercise & solution Fatema Johara 2023-09-25 Exercise The North Carolina State Center for Health Statistics and Howard W. Odum Institute for Research in Social Science at the University of North Carolina at Chapel Hill (A-20) make publicly available birth and infant death data for all children born in the state of North Carolina. Records on birth data go back to 1968. This comprehensive data set for the births in 2001 contains 120,300 records. The data represents a random sample of 800 of those births and selected variables. The variables are as follows: Variable Label Description PLURALITY Number of children born of the pregnancy SEX Sex of child (1 = male, 2 = female) MAGE Age of mother (years) WEEKS Completed weeks of gestation (weeks) MARITAL Marital status (1 = married,2 = not married) RACEMOM Race of mother (0 = other non-White, 1 = White, 2 = Black, 3 = American Indian, 4 = Chinese, 5 = Japanese, 6 = Hawaiian, 7 = Filipino, 8 = Other Asian or Pacific Islander) HISPMOM Mother of Hispanic origin (C = Cuban, M = Mexican, N = Non-Hispanic, O = other and unknown Hispanic, P = Puerto Rican, S = Central or South American, U = not classifiable) GAINED Weight gained during pregnancy (pounds) SMOKE 0 = mother did not smoke during pregnancy 1 = mother did smoke during pregnancy DRINK 0 = mother did not consume alcohol during pregnancy 1 = mother did consume alcohol during pregnancy TOUNCES Weight of child (ounces) TGRAMS Weight of child (grams) LOW 0 = infant was not low birth weight 1 = infant was low birth weight PREMIE 0 = infant was not premature 1 = infant was premature Premature defined at 36 weeks or sooner Tasks Please download the data “LDS_C02_NCBIRTH800.csv” from quercus and read that in R. Find the mean and median of the weight of infant by gender, in grams Make a scatter plot of weight gained during pregnancy vs infant weight, in grams 1
Construct side-by-side box-and-whisker plots for the variable of TOUNCES for women who admitted to smoking and women who did not admit to smoking. Do you see a difference in birth weight in the two groups? Which group has more variability? Subset the data for married mothers only, save it to both CSV and RData file 2
Solution # Setting up my working directory setwd( "D:/UofT/TA/Fall_2022_Biostatistics I/Tutorials/Tutorial 2" ) # Libraries that might be needed library(openxlsx) # ' openxlsx ' package is used to read excel datafile library(ggplot2) # ' ggplot2 ' package is used for graphical ilustration # Reading a csv data file dat <- read.csv( "LDS_C02_NCBIRTH800.csv" ) # How do we know the size and structure of data??? dim(dat) ## [1] 800 14 str(dat) ## ’data.frame’: 800 obs. of 14 variables: ## $ plural : int 1 1 1 1 1 1 1 1 1 1 ... ## $ sex : int 1 2 1 1 1 1 2 2 2 2 ... ## $ mage : int 32 32 27 27 25 28 25 15 37 21 ... ## $ weeks : int 40 37 39 39 39 43 39 42 41 39 ... ## $ marital: int 1 1 1 1 1 1 1 2 1 1 ... ## $ racemom: int 1 1 1 1 1 1 1 1 8 1 ... ## $ hispmom: chr "N" "N" "N" "N" ... ## $ gained : int 38 34 12 15 32 32 75 25 31 28 ... ## $ smoke : int 0 0 0 0 0 0 0 0 0 0 ... ## $ drink : int 0 0 0 0 0 0 0 0 0 0 ... ## $ tounces: int 111 116 138 136 121 117 143 113 139 120 ... ## $ tgrams : num 3147 3289 3912 3856 3430 ... ## $ low : int 0 0 0 0 0 0 0 0 0 0 ... ## $ premie : int 0 0 0 0 0 0 0 0 0 0 ... # Number of rows and columns of the data nrow(dat) ## [1] 800 ncol(dat) ## [1] 14 1. Please find the mean and median of the weight of infant by gender! 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
# We can get mean and median from Summary statistics! summary(dat$tgrams[dat$sex== 1 ]) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 453.6 3033.4 3373.7 3340.8 3798.9 4791.1 summary(dat$tgrams[dat$sex== 2 ]) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 340.2 2976.8 3288.6 3253.8 3628.8 4706.1 # Summary statistics of weight by gender tapply(dat$tgrams, dat$sex, summary) ## $‘1‘ ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 453.6 3033.4 3373.7 3340.8 3798.9 4791.1 ## ## $‘2‘ ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 340.2 2976.8 3288.6 3253.8 3628.8 4706.1 # However, mean and median functions are available mean(dat$tgrams[dat$sex== 1 ]) ## [1] 3340.824 mean(dat$tgrams[dat$sex== 2 ]) ## [1] 3253.793 median(dat$tgrams[dat$sex== 1 ]) ## [1] 3373.65 median(dat$tgrams[dat$sex== 2 ]) ## [1] 3288.6 # Summary statistics of weight by gender tapply(dat$tgrams, dat$sex, mean) ## 1 2 ## 3340.824 3253.793 4
tapply(dat$tgrams, dat$sex, median) ## 1 2 ## 3373.65 3288.60 2. Make a scatter plot of weight gained during pregnancy vs infant weight # scatter plot of weight gained during pregnancy vs infant weight plot(dat$gained, dat$tgrams) 0 20 40 60 80 1000 2000 3000 4000 dat$gained dat$tgrams # We can use ggplot function from the packagr "ggplot2" to make a nicer looking plot ggplot( data = dat, aes( x = gained, y = tgrams))+ geom_point( col = "blue" ) ## Warning: Removed 23 rows containing missing values (‘geom_point()‘). 5
1000 2000 3000 4000 5000 0 25 50 75 gained tgrams We can also construct a scatter plot of weight gained during pregnancy vs infant weight by marital status. # recoding marital status [1 = Married, 2 = Not-married] dat$marital_cat <- ifelse(dat$marital== 1 , "Married" , "Not Married" ) ggplot( data = dat, aes( x = gained, y = tgrams))+ geom_point( col = "purple" )+ facet_wrap(~marital) # the function ' facet_wrap ' gives two separate scatter plots ## Warning: Removed 23 rows containing missing values (‘geom_point()‘). 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
1 2 0 25 50 75 0 25 50 75 1000 2000 3000 4000 5000 gained tgrams The variable “marital” takes two values 1 and 2. It’s nice to label 1 and 2 so that we understand from the plot that what does mean by 1 and 2. # recoding marital status [1 = Married, 2 = Not-married] dat$marital_cat <- ifelse(dat$marital== 1 , "Married" , "Not Married" ) ggplot( data = dat, aes( x = gained, y = tgrams))+ geom_point( col = "purple" )+ facet_wrap(~marital_cat) ## Warning: Removed 23 rows containing missing values (‘geom_point()‘). 7
Married Not Married 0 25 50 75 0 25 50 75 1000 2000 3000 4000 5000 gained tgrams 3. Construct side-by-side box-and-whisker plots for the variable of TOUNCES for women who admitted to smoking and women who did not admit to smoking. Do you see a difference in birth weight in the two groups? Which group has more variability? # side-by-side box-and-whisker plots variable TOUNCES by smoking status boxplot(dat$tounces ~dat$smoke) 8
0 1 50 100 150 dat$smoke dat$tounces There are two boxplots for two gtoups, but we don’t know which group is what! Therefore, we are going to recode the variable smoke and then create boxplot again. class(dat$smoke) ## [1] "integer" dat$smoke_cat <- ifelse(dat$smoke== 0 , "didn ' t smoke" , "did smoke" ) boxplot(dat$tounces~dat$smoke_cat) 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
did smoke didn't smoke 50 100 150 dat$smoke_cat dat$tounces ggplot( data = dat, aes( x = smoke_cat, y = tounces, fill = smoke_cat))+ geom_boxplot() 10
50 100 150 did smoke didn't smoke NA smoke_cat tounces smoke_cat did smoke didn't smoke NA We can see three boxplot here where one boxplot is for NA category. We can explore how many missing values here in smoke variable. From below we see that, there are two missing values for smoke. That means we don’t know the smoking status for two women. For now we can exclude them and construct plot again. table(is.na(dat$smoke)) ## ## FALSE TRUE ## 798 2 dat_excluded <- dat[is.na(dat$smoke)!=TRUE,] dim(dat_excluded) # Now, we have 798 samples since we excluded 2 samples from entire dataset. ## [1] 798 16 ggplot( data = dat_excluded, aes( x = smoke_cat, y = tounces, fill = smoke_cat))+ geom_boxplot()+labs( x= "Smoking status" , y = "Weight of child (ounces)" , fill= "Smoking status" ) 11
50 100 150 did smoke didn't smoke Smoking status Weight of child (ounces) Smoking status did smoke didn't smoke 4. Subset the data for married mothers only, save it to both CSV and RData file We will select subset the data for married mothers only. The variable ‘marital’ takes two categories 1 for married and 2 for not married. For selecting married mothers, we will simply condition on “marital==1” from entire data. dat_married <- dat[dat$marital== 1 ,] nrow(dat_married) # subset data for married mother ## [1] 537 table(dat$marital) # table gives count of married and not married mother ## ## 1 2 ## 537 263 12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
How do we know the subsetting was done correctly? We can count the number of married mom and then match it with the subsetted data dimension. If the married count and sampled data matches, then we did it correctly. # saving the subset write.csv(dat_married, "dat_married.csv" ) # saving data as .csv format write.xlsx(dat_married, "dat_married.xlsx" ) # saving data as .xlsx format save(dat_married, file = "dat_married.RData" ) 13