Homework Week 2 ANSWERS_SS UPDATED_20230124

docx

School

University of Texas Health Science Center at Houston School of Nursing *

*We aren’t endorsed by this school

Course

2858

Subject

Statistics

Date

Feb 20, 2024

Type

docx

Pages

5

Uploaded by bansarirajani123

Report
Applied Data Analysis Week 2 Homework: Preparing a Dataset Due January 30, 2019 by 5pm Instructions : Complete the following questions for the dataset labeled “HINTS for Homework 2 ORIGINAL” in Canvas. Be sure to create a do file to keep track of your coding. 1. What is the first step to data cleaning (0.5 pts)? Open raw excel data a. Did you encounter any problems with Step 1 (0.5 pts)? No, looks good. Can move on 2. Open HINTS dataset within Stata. a. What is the study design? (0.5 pts) Cross-sectional b. Given the study design, what is the unit of analysis? (0.5 pts) Person_id 3. Check for duplicate unit of analysis. sort person_id by person_id: generate duplicate =_n tab duplicate,m list person_id if duplicate ==2 a. What study participants, if any, have duplicates? (0.33 pts) 36, 719, 802 b. Fix the duplicate ID problem. Defend your analytic decisions. (0.33 pts) edit if person_id ==36 | person_id ==719 | person_id ==802 drop if duplicate==2 & person_id==36 drop if duplicate==2 & person_id==719 drop if duplicate==2 & person_id==802 Dropped all duplicates because seemed duplicate data (rather than data entry problem) c. What is your new sample size? (0.33 pts) 3630 4. One key demographic variable in the dataset is age, so it’s important that this variable is entered correctly into the dataset. a. What type of variable is age in the dataset (string vs. numeric)? (0.33 pts) string b. Take the necessary steps to convert to numerical. Show your syntax ( 0.33 pts) gen age_new=age replace age_new="." if age_new=="missing" destring age_new, replace c. What is the mean and standard deviation of age in this study sample? (0.33 pts) Mean= 53.86; SD= 16.5 sum age_new, detail
5. Create a new variable that combines race and ethnicity. (0.5 pts) gen race_eth=. replace race_eth = 1 if race_cat == "White" replace race_eth = 2 if race_cat == "Black" replace race_eth = 4 if race_cat == "Other" replace race_eth = 3 if hisp_cat2 == "Hispanic" label variable race_eth "Race Ethnicity label" label define race_eth_label 1 "Non-Hispanic White" 2 "Non-Hispanic Black" 3 "Hispanic" 4 "Non- Hispanic Other" label values race_eth race_eth_label tab race_eth,m a. What is the distribution of this new variable? (0.5 pts) 6. For the Tan et al. paper, one of the main outcomes is “awareness of e-cigarettes”. Create our first main outcome variable from the variable electciglessharm in the dataset. Create the following dichotomous variable: (0.33 pts) [0]= “I’ve never heard of electronic cigarettes” [1]= “Much less harmful” or “less harmful” or “just as harmful” or “more harmful” or “much more harmful” gen aware_ecig=. replace aware_ecig = 0 if electciglessharm == "i've never heard of electronic cigarettes"
replace aware_ecig = 1 if electciglessharm == "just as harmful" | electciglessharm == "less harmful" | electciglessharm == "more harmful" | electciglessharm == "much less harmful" | electciglessharm == "much more harmful" a. How many study participants had never heard of electronic cigarettes before? (0.33 pts) 878 b. Label the new variable as you see fit. (0.33 pts) label variable aware_ecig "Main outcome-awareness" label define aware_ecig1 0 "Not aware" 1 "Aware" label values aware_ecig aware_ecig1 7. The second main outcome of Tan et al. is “perceived harmfulness of e-cigarettes”. Create our second main outcome using, again, the variable electciglessharm. Create the following dichotomous variable: [0]= “just as harmful” or “more harmful” or “much more harmful” [1] = “much less harmful” or “less harmful” gen ecigharm=. replace ecigharm =0 if electciglessharm == "just as harmful" | electciglessharm == "more harmful" | electciglessharm == "much more harmful" replace ecigharm =1 if electciglessharm == "much less harmful" | electciglessharm== "less harmful" tab ecigharm, m
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
a. How many study participants are included in this new variable? (0.33 pts) 2609 b. Why doesn’t this new variable include the entire sample of 3630? (0.33 pts) Some had missing values and some had responses that were not applicable to our research question (i.e. those who have never heard of e-cigarettes) c. Label the new variable as you see fit. (0.33 pts) label variable ecigharm "Second Main outcome-ecig harm" label define ecigharm1 0 "More harmful" 1 "Less harmful" label values ecigharm ecigharm1 tab ecigharm,m 8. In Tan et al., the main exposure is “smoking status”. Explore our main exposure variable “smoking status” labeled smokestat in the dataset and create the following categorical variable: (1 pt) 0= Never 1= Former 2= Current gen smokestat_new=. replace smokestat_new= 0 if smokestat =="never" replace smokestat_new= 1 if smokestat =="former" replace smokestat_new= 2 if smokestat =="current" label variable smokestat_new "Exposure-smoking_status" label define smokestat_new1 0 "Never" 1 "Former" 2 "Current" label values smokestat_new smokestat_new1 tab smokestat_new,m
9. There is a sister dataset labeled “HINTS for Homework 2 ORIGINAL DATASET2”. As you will notice, this second dataset has census region for each individual. a. Combine your two datasets. (0.5 pts) merge 1:1 id using [data location] b. Did you append or merge the datasets? And why? (0.5 pts) Merge, same ID’s with new data 10. Feel free to clean the remainder of the variables in the dataset. We will be using this dataset (and your do file) for the remainder of the course. If you choose to clean now, this will save you time in the long run! 11. Develop a codebook for your clean dataset in a separate Word document. (1 pt)