E7_Describe_answers_v3

docx

School

Tulane University *

*We aren’t endorsed by this school

Course

4

Subject

Mathematics

Date

Apr 3, 2024

Type

docx

Pages

10

Uploaded by vivianecahen

Report
E.7. Descriptive analysis of analytic population Overview: In this exercise, we will conduct descriptive analysis of study variables within the analytic sample, and within strata, for data quality. Descriptive analysis will provide information about extreme or outlying values, distributions of variables, and cell counts. The video clip demonstrated this process for food secure girls. In this exercise, we will apply this process to food secure boys . Objectives: Learn and practice how to calculate descriptive statistics for continuous and categorical (or binary) variables. Learn and practice how to ascertain adequacy of cell size (counts) and variability needed to assess the association of interest. In your course project, you will apply these approaches for all the variables in your analysis. This exercise will use the full merged data file: NHANES0708_all, for Stata or for SAS Before completing this exercise: View the following video clip: E7_Describe Read the data cleaning paper by Van den Broeck et al. o See the Weekly Course Materials page for the link. Refer to the Kohn et al., article for certain questions below. Optional: skim glossary of statistical data editing Survey versus standard analysis . At this stage, our primary goal is to become familiar with our data, so that we can assess feasibility and make preliminary decisions about how the study measures will be characterized, and what type of regression model is most appropriate. The results of this type of analysis often appears as a preliminary data table in a grant, or report to the research team. Therefore, we use survey commands when possible. For tabulation at this stage, the primary goal is to assess unweighted cell sizes; weighted percentages will be calculated in the next exercises and can be reported in your Table 1. However, keep in mind that the most appropriate approach may vary depending on the research context. Part I. Descriptive Assessments: Univariable Analyses 1. Examine exposure: food assistance a. Among food secure and food insecure boys, how many receive food assistance? How many do not receive food assistance? Food security (counts, unweighted) Exposure Food secure Food insecure Receive food assistance 349 216 Do not receive food assistance 90 22
b. The E7 video indicates that a “very general” rule of thumb for minimum cell size is 20 observations. As explained in the NHANES Analytic Guidelines 1 , a denominator of at least 30 participants (unweighted) is needed in order to produce statistically reliable estimates of proportions (expressed as a percentage). Do these recommendations about minimum cell size seem generally compatible to you? Please provide a short explanation of your reasoning. These recommendations are generally compatible. Both the video narrator and the NHANES documentation point out that minimum cell sizes are approximate and apply under some, but not all, conditions. Both sources of guidance emphasize that a larger cell size s may be needed depending on other conditions, such as when a variable is measured with error, when there are several confounders of an association for which adjustment is needed, or when an outcome is rare (or very common). c. Recall that Kohn et al. estimated percentages of boys and girls with high waist circumference or who were overweight or obese according to food assistance status (Table 1). Referring to the counts in the question 1.a above, which cells seem to have a sufficiently large enough counts to reliably estimate prevalence of the different weight status variables (e.g. high waist circumference) and which may not? The cells corresponding to food secure boys receiving food assistance, food secure boys not receiving food assistance and food insecure boys receiving food assistance seem to have large enough counts for reliable estimation of prevalence of the body size measures. However, the cell for food insecure boys who do not receive food assistance has only 22 participants, which may not be large enough for producing statistically reliable estimates. d. The next three questions develop our thinking about cell sizes within strata and the impact on the statistical precision for measures of association, a key point emphasized in the E.7 video. i. Recall that in the authors of the Kohn et al. article estimated odds ratios as the measure of association between food assistance status and categorical body size measures. Referring the table for question 1.a: which exposure category of food assistance status would be used as the referent for the odds ratio? The category “do not receive food assistance” would be used as the referent for the odds ratio. ii. From the table for question 1.a: what is the cell size (count of observations) in the category of food assistance status used as the referent for the odds ratio in the stratum of food insecure boys? 1 As reference, see the NHANES: Analytic Guidelines, 2011-2014 and 2015-2016, Section 3.3.1, page, 35. Note that this publication recommends an even large denominator is needed for reliable estimate of proportions that are extreme, either very rare or very common.
The cell size is 22. iii. Now refer to page 158, paragraph 2 in the right column of Kohn et al. The authors report that “Among low income, food-insecure youth, food assistance participation was not associated with … high waist circumference or categorical weight status for any specification of food assistance in the fully adjusted models… (data not shown).” Based on your answers to part d.i. and d.ii. above and guidance offered in the E.7 video, what is a plausible explanation for the lack of association between food assistance participation and high waist circumference or the categorical body size measures among food insecure boys? A plausible explanation for the lack of association between food assistance participation and the categorical body size measures among food insecure boys is that the sample size was too small to produce statistically reliable (precise) odds ratio estimates. There were only 22 observations in the referent category and just a fraction of these would have had the body size outcome of interest. For the remaining items in Part I below, limit the analyses to food secure boys. 2. Examine outcome: BMI z-score a. Calculate the following descriptive statistics. Please be aware of the footnotes to the table below. When using decimals report them to the nearest hundredth. Number 439 Mean* 0.60 SE* 0.09 Min -3.39 Max 3.33 Median** 0.56 Skewness (p-value)**,† 0.07 Kurtosis (p-value)**, † 0.02 *From Stata svy command; SAS users use design variables as you learned in E.6. **Stata users: Include aweight †SAS users: skip skewness and kurtosis R users: may be able to get Skewness and Kurtosis coefficients but not p-values . Peer-reviewers: please do not mark this problem wrong if the student is using R. b. Create a histogram and box plot. Paste below. Reduce the size to about 2”x2” if possible. In your project analysis, you should also use other normality plots (e.g., Q-Q) as appropriate for your study. Survey commands for these plots are not available, so they run as described in your Biostatistics courses. R users can us e svyhist() and svyboxplot() to include survey weights in this question. If so, their plots will look slightly different from STATA plots.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0 .1 .2 .3 .4 Density -4 -2 0 2 4 bmiz -4 -2 0 2 4 bmiz c. Are there any extreme values? If yes, do they appear to be of concern? That is, might the values be erroneous, or are they plausible? (Hint: check the component height and weight values using variables in the dataset) There are extreme values (those below -3.0), but they are biologically plausible. Inspection of the component height and weight values for these observations do not reveal any clearly incorrect data. d. What do you conclude about the distribution of this outcome variable? Does it seem appropriate to model the BMI z-score as is for linear regression analysis or would it be advisable to perform some type of transformation? The distribution is not perfectly normal, but it does not have notable skew. All values are contiguous (no discontinuities). Thus , I i t seems appropriate to model ing the BMI z-score without transformation . seems appropriate. 3. Examine distributions of a few other independent variables. a. Age (ridageyr): calculate descriptive statistics. Please be aware of the footnotes to the table. When using decimals report then to the nearest hundredth. Number 439 Mean* 9.58 SE* 0.27 Min 4 Max 17
Median** 9 Skewness 0.07 Kurtosis <0.00 *From Stata svy command; SAS users use design variables as you learned in E.6. **Stata users: Include aweight †SAS users: skip skewness and kurtosis R users: may be able to get Skewness and Kurtosis coefficients but not p-values. Peer-reviewers: please do not mark this problem wrong if the student is using R. b. Create a histogram and box plot. Paste below. Reduce the size to about 2”x2” if possible. 0 .05 .1 .15 .2 Density 5 10 15 20 Age at Screening Adjudicated - Recode 5 10 15 20 Age at Screening Adjudicated - Recode i. What do you conclude about the distribution of the age variable? Are there any concerns about using this variable in regression analysis? The distribution of age departs somewhat from normality. However, there are no discontinuities in the distribution or any evidence of sparse data. There do not appear to be issues of concern for including age in regression analyses. c. Categorical: calculate frequencies. Count (unweighted) FPL (fpl_2cat)
0-100% 209 101-200% 230 Health Insurance (hinsur) Any private 99 Public only 254 Other 86 i. Do the cell counts for these variables seem to be of sufficient size for analysis? Yes, the cell counts for both of these variables are of sufficient size for analysis. Part II: Descriptive Assessments: Bivariable Analyses For the questions in this section, limit the analyses to food secure boys . 1. Cross-tabulate categorical independent variables with a dependent categorical variable, high waist circumference. Outcome (counts, unweighted) Independent variables ≤ recommended waist circumference > recommended waist circumference Exposure: Food Assistance Yes 289 60 No 74 16 Potential Confounding Variable: Health Insurance (hinsur) Any private 82 17 Public only 213 41 Other 68 18 a. Are the cell counts for this categorical outcome variable of sufficient size for analysis? Do these cross-tabulations raise any concerns for logistic regression analysis with regard to statistical precision of odds ratio estimates? The cross-tabulation demonstrates that there are cell counts for the category, waist circumference larger than recommended (high waist circumference), that may not of sufficient size for analysis. Among food secure boys not receiving food assistance and among those with ‘Other’ health insurance type, fewer than 20 have high waist circumference. These counts raise concern about statistical precision of the associations involving this outcome variable. 2. Calculate 10 th and 90 th percentile values of continuous independent variables within each category of the dependent variable (high waist circumference). Outcome (counts, unweighted) ≤ recommended waist circumference > recommended waist circumference Independent variables 10 th percentile 90 th percentile 10 th percentile 90 th percentile Age ( ridageyr) 4 15 5 14 Stata users: Do not include aweights in the percentile calculations
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
a. Do the distributions of age in the two categories of waist circumference raise any concerns for regression analysis? No, the distributions of age in the two categories of waist circumference do not raise any concerns for the regression analysis. The values of age at the 10 th and/or 90 th percentiles are similar among the two categories of the outcome variable. 3. Cross-tabulate example categorical independent variables with the exposure variable, food assistance. Exposure (counts, unweighted) Independent categorical variables Does not receive food assistance Receives food assistance Federal Poverty Level (fpl_2cat) 0-100% 15 194 101-200% 75 155 Health Insurance (hinsur) Any private 39 60 Public only 22 232 Other 29 57 a. Do these cross-tabulations raise any concerns for regression analysis? The cross-tabulation demonstrates that most of the cell counts are of sufficient size for analysis. However, t here are some cell sizes that may not be sufficient ly for analysis. Among the boys who do not receive food assistance, there are fewer than 20 in the lowest FPL category and fewer than 30 with public or other health insurance. 4. Calculate 10 th and 90 th percentile values of age as a continuous independent variable among each category of the exposure variable, food assistance. Exposure Does not receive food assistance Receives food assistance 10 th percentile 90 th percentile 10 th percentile 90 th percentile Age 4 15 4 14 Stata users: Do not include aweights in the percentile calculations a. Do the distributions of age in the two categories of exposure raise any concerns for regression analysis? No, the distributions of age in the two categories of the exposure variable do not raise any concerns for the regression analysis. The values of age at the 10 th and/or 90 th percentiles are nearly identical among the two food assistance categories.
APPENDIX Part 1. Question 2.a and 2.b Stata Sample code *count, min/max values are not sensitive to survey design effects su bmiz if subpop2==1 & male==1 & foodinsec==0 *mean, SE requires svy function svy,subpop(if subpop2==1 & male==1 & foodinsec==0): mean bmiz *percentiles: can use aweight for weighted percentiles *http://www.stata.com/support/faqs/statistics/percentiles-for-survey- data/ tabstat bmiz [aweight=wtmec2yr] if subpop2==1 & male==1 /// & foodinsec==0, stat(p50 p10 p90) col(stat) *skewness, kurtosis *does not accept pweight; this code provides approximate skewness and kurtosis for exploratory purposes sktest bmiz [aweight=wtmec2yr] if subpop2==1 & male==1 & foodinsec==0 *these are not weighted hist bmiz if subpop2==1 & male==1 & foodinsec==0,norm graph box bmiz if subpop2==1 & male==1 & foodinsec==0 SAS Sample Code (also given in E7_DescriptiveAnalyses_VideoCode.sas) /* SAS Specific Notes: - Like in exercise 6, we will not be able to specify which output we want in SAS survey procedures. Make sure you select the results that correspond to your variables of interest.*/ /* Using survey procedure to get the number of observations, min + max, mean, SE, and percentiles in one step */ proc surveymeans data = ex7 nobs min mean max percentile =( 50 , 10 , 90 ); *specify which output you need from the survey procedure; weight wtmec2yr; cluster sdmvpsu; strata sdmvstra; domain subpop2*male*foodinsec; *As in exercise 6, include all the domain variables here in surveymeans; var bmiz; *Continuous variable of interest; run ; /* Using proc univariate to look at normality */ /* SAS Specific Notes: - The skewness/kurtosis output represents the values, not the test values - Instead of sktest, SAS users can use 4 test for non-normality: https://www.stat.purdue.edu/~tqin/system101/method/QQplot_sas.htm
- In this analysis, where we have < 2,000 values, use the Shapiro-Wilk test */ /* Note that this is not a survey procedure, means will not be accurate */ proc univariate data = ex7 normal ; *specify that we are testing for normal distribution; var bmiz; qqplot bmiz / Normal ( mu =est sigma =est color =red l = 1 ); *code to test for normality and to generate qqplots (not shown in video); histogram / normal ; *code to create histogram; where (subpop2 = 1 and male = 0 and foodinsec = 0 ); *weight wtmec2yr; *commented out because no weight variables when looking at normality tests in SAS; run ; Refer to the E7 VideoCode file for creation of the box plot. There are two options, that you can try: pro c x boxplot and proc sgplot. R Sample code 2a . **N, min, and max do not need to be survey weighted** ```{r} library( pastecs) round( stat.desc(bmiz[subpop2==1 & male=='boys' & foodinsec==0], desc=FALSE), 2) # all you need here is nbr.val, min, & max ``` ```{r} svymean (~bmiz, design=subset(subpop, male=="boys" & foodinsec==0)) #getting the median (choosing quantile = 0.5 for median only) svyquantile (~bmiz, design=subset( subpop, male=="boys" & foodinsec==0), quantile=0.5) #skew and kurtosis ##I do not know how to apply survey weights to skew or kurtosis in R for p-value, this is the closest I can figure out library( DescTools) Skew( bmiz[subpop2==1 & male=='boys' & foodinsec==0], weights = wtmec2 yr[subpop2==1 & male=='boys' & foodinsec==0]) Kurt( bmiz[subpop2==1 & male=='boys' & foodinsec==0], weights = wtmec2 yr[subpop2==1 & male=='boys' & foodinsec==0]) ``` 2b. Create a histogram and box plot. In R, you can use the survey weights to do this. R outputs will look slightly different from STATA because of this ```{r}
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
svyhist (~bmiz, design=subset(subpop, male=="boys" & foodinsec==0), main="BMI, Survey Weighted ", col="blue") svyboxplot (bmiz~1, design=subset(subpop, male=="boys" & foodinsec==0), main="BMI, Survey Weighted", all.outliers=TRUE) ```