E7_Describe_answers_v3
docx
keyboard_arrow_up
School
Tulane University *
*We aren’t endorsed by this school
Course
4
Subject
Mathematics
Date
Apr 3, 2024
Type
docx
Pages
10
Uploaded by vivianecahen
E.7. Descriptive analysis of analytic population
Overview: In this exercise, we will conduct descriptive analysis of study variables within the analytic sample, and within strata, for data quality. Descriptive analysis will provide information about extreme or outlying values, distributions of variables, and cell counts. The video clip demonstrated this process for food secure girls. In this exercise, we will apply this
process to food secure boys
. Objectives:
Learn and practice how to calculate descriptive statistics for continuous and categorical (or binary) variables.
Learn and practice how to ascertain adequacy of cell size (counts) and variability needed
to assess the association of interest.
In your course project, you will apply these approaches for all the variables
in your analysis.
This exercise will use the full merged data file: NHANES0708_all, for Stata or for SAS
Before completing this exercise:
View the following video clip: E7_Describe
Read the data cleaning paper by Van den Broeck et al. o
See the Weekly Course Materials page for the link.
Refer to the Kohn et al., article for certain questions below.
Optional: skim glossary of statistical data editing
Survey versus standard analysis
. At this stage, our primary goal is to become familiar with our data, so that we can assess feasibility and make preliminary decisions about how the study measures will be characterized, and what type of regression model is most appropriate. The results of this type of analysis often appears as a preliminary data table in a grant, or report to the research team. Therefore, we use survey commands when possible. For tabulation at this stage, the primary goal is to assess unweighted cell sizes; weighted percentages will be calculated in the next exercises and can be reported in your Table 1. However, keep in mind that the most appropriate approach may vary depending on the research context. Part I. Descriptive Assessments: Univariable Analyses
1.
Examine exposure: food assistance
a.
Among food secure and food insecure boys, how many receive food assistance? How many do not receive food assistance?
Food security (counts, unweighted)
Exposure
Food secure
Food insecure
Receive food assistance
349
216
Do not receive food assistance
90
22
b.
The E7 video indicates that a “very general” rule of thumb for minimum cell size is 20 observations. As explained in the NHANES Analytic Guidelines
1
, a denominator of at least 30 participants (unweighted) is needed in order to produce statistically reliable estimates of proportions (expressed as a percentage). Do these recommendations about minimum cell size seem generally compatible to you? Please provide a short explanation of your reasoning.
These recommendations are generally compatible. Both the video narrator and the NHANES documentation point out that minimum cell sizes are approximate and apply under some, but not all, conditions. Both sources of guidance emphasize that a larger cell size
s
may be needed depending on other conditions, such as when a variable is measured with error, when there are several confounders of an association for which adjustment is needed, or when an outcome is rare (or very common).
c.
Recall that Kohn et al. estimated percentages of boys and girls with high waist circumference or who were overweight or obese according to food assistance status (Table 1). Referring to the counts in the question 1.a above, which cells seem to have a sufficiently large enough counts to reliably estimate prevalence of the different weight status variables (e.g. high waist circumference) and which may not?
The cells corresponding to food secure boys receiving food assistance, food secure boys not receiving food assistance and food insecure boys receiving food assistance seem to have large enough counts for reliable estimation of prevalence of the body size measures. However, the cell for food insecure boys who do not receive food assistance has only 22 participants, which may not be large enough for producing statistically reliable estimates. d.
The next three questions develop our thinking about cell sizes within strata and the impact on the statistical precision for measures of association, a key point emphasized in the E.7 video. i.
Recall that in the authors of the Kohn et al. article estimated odds ratios as the
measure of association between food assistance status and categorical body size measures. Referring the table for question 1.a: which exposure category of food assistance status would be used as the referent for the odds ratio?
The category “do not receive food assistance” would be used as the referent for the odds ratio. ii.
From the table for question 1.a: what is the cell size (count of observations) in the category of food assistance status used as the referent for the odds ratio in the stratum of food insecure boys?
1
As reference, see the NHANES: Analytic Guidelines, 2011-2014 and 2015-2016, Section 3.3.1, page, 35. Note that this publication recommends an even large denominator is needed for reliable estimate of proportions that are extreme, either very rare or very common.
The cell size is 22. iii.
Now refer to page 158, paragraph 2 in the right column of Kohn et al. The authors report that “Among low income, food-insecure youth, food assistance participation was not associated with … high waist circumference or categorical
weight status for any specification of food assistance in the fully adjusted models… (data not shown).” Based on your answers to part d.i. and d.ii. above
and guidance offered in the E.7 video, what is a plausible explanation for the lack of association between food assistance participation and high waist circumference or the categorical body size measures among food insecure boys? A plausible explanation for the lack of association between food assistance participation and the categorical body size measures among food insecure boys is
that the sample size was too small to produce statistically reliable (precise) odds ratio estimates. There were only 22 observations in the referent category and just
a fraction of these would have had the body size outcome of interest. For the remaining items in Part I below, limit the analyses to food secure boys. 2.
Examine outcome: BMI z-score
a.
Calculate the following descriptive statistics. Please be aware of the footnotes to the table below. When using decimals report them to the nearest hundredth. Number
439
Mean*
0.60
SE*
0.09
Min
-3.39
Max
3.33
Median**
0.56
Skewness (p-value)**,†
0.07
Kurtosis (p-value)**, †
0.02
*From Stata svy command; SAS users use design variables as you learned in E.6.
**Stata users: Include aweight
†SAS users: skip skewness and kurtosis
R users: may be able to get Skewness and Kurtosis coefficients
but not p-values
. Peer-reviewers: please do not mark this problem wrong if the student is using R. b.
Create a histogram and box plot. Paste below. Reduce the size to about 2”x2” if possible. In your project analysis, you should also use other normality plots (e.g., Q-Q) as appropriate for your study. Survey commands for these plots are not available, so they run as described in your Biostatistics courses. R users can us
e svyhist() and svyboxplot() to include survey weights in this question. If so, their plots will look slightly different from STATA plots.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
0
.1
.2
.3
.4
Density
-4
-2
0
2
4
bmiz
-4
-2
0
2
4
bmiz
c.
Are there any extreme values? If yes, do they appear to be of concern? That is, might the values be erroneous, or are they plausible? (Hint: check the component height and weight values using variables in the dataset)
There are extreme values (those below -3.0), but they are biologically plausible. Inspection of the component height and weight values for these observations do not reveal any clearly incorrect data. d.
What do you conclude about the distribution of this outcome variable? Does it seem appropriate to model the BMI z-score as is for linear regression analysis or would it be advisable to perform some type of transformation? The distribution is not perfectly normal, but it does not have notable skew. All values are contiguous (no discontinuities). Thus , I
i
t seems appropriate to model
ing
the BMI
z-score without transformation
. seems appropriate.
3.
Examine distributions of a few other independent variables.
a.
Age (ridageyr): calculate descriptive statistics. Please be aware of the footnotes to
the table. When using decimals report then to the nearest hundredth. Number
439
Mean*
9.58
SE*
0.27
Min
4
Max
17
Median**
9
Skewness
†
0.07
Kurtosis
†
<0.00
*From Stata svy command; SAS users use design variables as you learned in E.6.
**Stata users: Include aweight
†SAS users: skip skewness and kurtosis
R users: may be able to get Skewness and Kurtosis coefficients but not p-values. Peer-reviewers: please do not mark this problem wrong if the student is using R.
b.
Create a histogram and box plot. Paste below. Reduce the size to about 2”x2” if possible. 0
.05
.1
.15
.2
Density
5
10
15
20
Age at Screening Adjudicated - Recode
5
10
15
20
Age at Screening Adjudicated - Recode
i. What do you conclude about the distribution of the age variable? Are there any concerns about using this variable in regression analysis?
The distribution of age departs somewhat from normality. However, there are no
discontinuities in the distribution or any evidence of sparse data. There do not appear to be issues of concern for including age in regression analyses. c.
Categorical: calculate frequencies. Count (unweighted)
FPL (fpl_2cat)
0-100%
209
101-200%
230
Health Insurance (hinsur)
Any private
99
Public only
254
Other
86
i. Do the cell counts for these variables seem to be of sufficient size for analysis?
Yes, the cell counts for both of these variables are of sufficient size for analysis.
Part II: Descriptive Assessments: Bivariable Analyses
For the questions in this section, limit the analyses to food secure boys
.
1.
Cross-tabulate categorical
independent variables with a dependent categorical variable,
high waist circumference.
Outcome (counts, unweighted)
Independent variables
≤ recommended waist
circumference
> recommended waist
circumference
Exposure: Food Assistance
Yes
289
60
No
74
16
Potential Confounding Variable:
Health Insurance (hinsur)
Any private
82
17
Public only
213
41
Other
68
18
a. Are the cell counts for this categorical outcome variable of sufficient size for analysis? Do these cross-tabulations raise any concerns for logistic regression analysis with regard to statistical precision of odds ratio estimates? The cross-tabulation demonstrates that there are cell counts for the category, waist circumference larger than recommended (high waist circumference), that may not of sufficient size for analysis. Among food secure boys not receiving food assistance and among those with ‘Other’ health insurance type, fewer than 20 have high waist circumference. These counts raise concern about statistical precision of the associations involving this outcome variable. 2.
Calculate 10
th
and 90
th
percentile values of continuous
independent variables within
each category of the dependent variable
(high waist circumference). Outcome (counts, unweighted)
≤ recommended waist
circumference
> recommended waist circumference
Independent variables
10
th
percentile
90
th
percentile
10
th
percentile
90
th
percentile
Age (
ridageyr)
4
15
5 14
Stata users: Do not include aweights in the percentile calculations
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
a. Do the distributions of age in the two categories of waist circumference raise any concerns for regression analysis? No, the distributions of age in the two categories of waist circumference do not raise any concerns for the regression analysis. The values of age at the 10
th
and/or 90
th
percentiles are similar among the two categories of the outcome variable. 3.
Cross-tabulate example categorical
independent variables with the exposure
variable, food assistance. Exposure (counts, unweighted)
Independent categorical variables
Does not receive food assistance
Receives food assistance
Federal Poverty Level (fpl_2cat)
0-100%
15
194
101-200%
75
155
Health Insurance (hinsur)
Any private
39
60
Public only
22
232
Other
29
57
a. Do these cross-tabulations raise any concerns for regression analysis? The cross-tabulation demonstrates that most of the cell counts are of sufficient size for analysis. However, t
here are some cell sizes that may not be sufficient
ly
for analysis. Among the boys who do not receive food assistance, there are fewer than 20 in the lowest FPL category and fewer than 30 with public
or other health insurance. 4.
Calculate 10
th
and 90
th
percentile values of age as a continuous
independent variable among each category of the exposure
variable, food assistance.
Exposure
Does not receive food assistance
Receives food assistance
10
th
percentile
90
th
percentile
10
th
percentile
90
th
percentile
Age
4
15
4
14
Stata users: Do not include aweights in the percentile calculations
a. Do the distributions of age in the two categories of exposure raise any concerns for regression analysis? No, the distributions of age in the two categories of the exposure variable do not raise any concerns for the regression analysis. The values of age at the 10
th
and/or 90
th
percentiles are nearly identical among the two food assistance categories.
APPENDIX
Part 1. Question 2.a and 2.b
Stata Sample code
*count, min/max values are not sensitive to survey design effects
su bmiz if subpop2==1 & male==1 & foodinsec==0
*mean, SE requires svy function
svy,subpop(if subpop2==1 & male==1 & foodinsec==0): mean bmiz
*percentiles: can use aweight for weighted percentiles
*http://www.stata.com/support/faqs/statistics/percentiles-for-survey-
data/
tabstat bmiz [aweight=wtmec2yr] if subpop2==1 & male==1 /// & foodinsec==0, stat(p50 p10 p90) col(stat)
*skewness, kurtosis
*does not accept pweight; this code provides approximate skewness and kurtosis for exploratory purposes
sktest bmiz [aweight=wtmec2yr] if subpop2==1 & male==1 & foodinsec==0
*these are not weighted
hist bmiz if subpop2==1 & male==1 & foodinsec==0,norm
graph box bmiz if subpop2==1 & male==1 & foodinsec==0
SAS Sample Code (also given in E7_DescriptiveAnalyses_VideoCode.sas)
/* SAS Specific Notes:
-
Like in exercise 6, we will not be able to specify which output we want in SAS survey procedures. Make sure you select the results that correspond to your variables of interest.*/
/* Using survey procedure to get the number of observations, min + max, mean, SE, and percentiles in one step */
proc
surveymeans
data
= ex7 nobs
min
mean
max
percentile
=(
50
,
10
,
90
); *specify which output you need from the survey procedure;
weight
wtmec2yr;
cluster
sdmvpsu;
strata
sdmvstra;
domain
subpop2*male*foodinsec; *As in exercise 6, include all the domain variables here in surveymeans;
var
bmiz; *Continuous variable of interest;
run
;
/* Using proc univariate to look at normality */
/* SAS Specific Notes:
-
The skewness/kurtosis output represents the values, not the test values
-
Instead of sktest, SAS users can use 4 test for non-normality: https://www.stat.purdue.edu/~tqin/system101/method/QQplot_sas.htm
- In this analysis, where we have < 2,000 values, use the Shapiro-Wilk test */
/* Note that this is not a survey procedure, means will not be accurate */
proc
univariate
data
= ex7 normal
; *specify that we are testing for normal distribution;
var
bmiz;
qqplot
bmiz /
Normal
(
mu
=est sigma
=est color
=red l
=
1
); *code to test for normality and to generate qqplots (not shown in video);
histogram
/ normal
; *code to create histogram;
where
(subpop2 = 1
and male = 0
and foodinsec = 0
);
*weight wtmec2yr;
*commented out because no weight variables when looking at normality tests in SAS;
run
;
Refer to the E7 VideoCode file for creation of the box plot. There are two options, that you can try: pro
c
x
boxplot and proc sgplot. R Sample code
2a
.
**N, min, and max do not need to be survey weighted**
```{r}
library(
pastecs)
round(
stat.desc(bmiz[subpop2==1 & male=='boys' & foodinsec==0], desc=FALSE), 2)
#
all you need here is nbr.val, min, & max
```
```{r}
svymean
(~bmiz, design=subset(subpop, male=="boys" & foodinsec==0))
#getting the median (choosing quantile = 0.5 for median only)
svyquantile
(~bmiz, design=subset(
subpop, male=="boys" & foodinsec==0), quantile=0.5)
#skew and kurtosis
##I do not know how to apply survey weights to skew or kurtosis in R for p-value, this is the closest I can figure out
library(
DescTools)
Skew(
bmiz[subpop2==1 & male=='boys' & foodinsec==0], weights = wtmec2
yr[subpop2==1 & male=='boys' & foodinsec==0])
Kurt(
bmiz[subpop2==1 & male=='boys' & foodinsec==0], weights = wtmec2
yr[subpop2==1 & male=='boys' & foodinsec==0])
```
2b. Create a histogram and box plot. In R, you can use the survey weights to do this. R outputs will look slightly different from STATA because of this
```{r}
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
svyhist
(~bmiz, design=subset(subpop, male=="boys" & foodinsec==0), main="BMI, Survey Weighted
", col="blue")
svyboxplot
(bmiz~1, design=subset(subpop, male=="boys" & foodinsec==0), main="BMI, Survey Weighted", all.outliers=TRUE)
```