module1_stata_lab

docx

School

Santa Ana College *

*We aren’t endorsed by this school

Course

50

Subject

Statistics

Date

Apr 3, 2024

Type

docx

Pages

7

Uploaded by sheenmichelle

Report
GPHP2000 Stata Lab: Exploratory Data Analysis An ungraded lab for learning Stata commands In this lab we will review the following commands that will: 1. Help you explore your dataset and understand the variables: codebook describe list 2. Help you review variables: tabulate(tab) table 3. Help you manage the dataset: label generate(gen) replace 4. Help you visualize the data: histogram graph bar graph box Learning Objectives: At the end of this lab, students will be able to: 1. Navigate Stata’s user interface 2. Open a dataset and view information about the dataset in Stata 3. Generate and label new variables in Stata 4. Generate histograms, bar charts, and boxplots in Stata For this lab, we will be using Stata with the ` dds.discr ` dataset. This dataset involves looking at various variables related to consumers of the California Department of Developmental Services (DDS), with a notable focus on examining expenditures across different ethnic groups. This is the dataset analyzed in the case study for Module 1, and you should have the case study open to follow along.
Note : This lab was created using Stata SE (Standard Edition) version 18. If you already have an earlier version of Stata installed, it is not necessary to upgrade. However, this version of Stata is available from software.brown.edu , and it does have prettier graphics. Part 1: Understanding the Dataset Load the dataset into the Stata workspace. ``` use "dda.disc.dta", clear ``` Depending on where you have saved the data file, you may need to specify the full file path for your data file. ``` use "C:\Users\cryst\Documents\Teaching\GPHP2000\Data\dds.discr.dta" ``` Describing the Variables Understand the structure, variables, and types of data present in your dataset. ``` describe ``` From the data description, you can see how each variable is named in the Stata workspace (variable name), the data type of the variable (storage type) and any labels on the variable values (e.g., category names for a categorical variable) and a human readable description of the variable (variable label) if it has been provided. When looking at the storage type, `str` indicates a string non-numeric variable. For numeric variables, there are different storage types depending on the type of number (e.g., `int/float/long/double`). This dataset does not have any variable labels. Suppose we want to label the `expenditures` variable as “Annual expenditure per individual (USD)”. ``` label variable expenditures “Annual expenditure per individual (USD)” ``` Describe your data again to see this updated label. While the change has been made in the Stata workspace, you need to explicitly save your data if you want to save the changes you make to file. Viewing the Initial Rows of the Data It is always a good idea to get a sense of what the data in the workspace looks like. Observe the first five rows of the dataset to get an initial sense of the data.
``` list in 1/5 ``` Now you can see the numeric values of variables like `age` and `expenditures`, and the categories of categorical variables like `agecohort` and `gender`. Creating a Categorical Variable Every statistical software has rules about what are valid variable names. Stata variables should not use a ‘.’ in their name. You might run into problems with using the value labels in the `age.cohort` variable that loaded in with this dataset. As an exercise, let’s delete the `agecohort` variable from the workspace. We will then go through the steps of creating a new categorical variable from the `age` variable, categorizing the continuous age data into discrete, named groups. We will use the age categories used in the case study. Let’s start by getting some summary statistics for the continuous `age` variable. ``` codebook age ``` You can see that ages range from 0-95, with an average age of 22.8 years. And fortunately, there are no missing values! Now, let’s delete the existing `agecohort` variable and create a new variable we will call `age_cohort`. ``` drop agecohort generate age_cohort =. replace age_cohort = 1 if age >= 0 & age <= 5 replace age_cohort = 2 if age >= 6 & age <= 12 replace age_cohort = 3 if age >= 13 & age <= 17 replace age_cohort = 4 if age >= 18 & age <= 21 replace age_cohort = 5 if age >= 22 & age <= 50 replace age_cohort = 6 if age > 50 ``` Let’s look at the variable we have created. ``` tab age_cohort ``` At this point, the age categories are just identified as 1-6. We want to label these categories to make it clear this is an ordinal categorical variable. We will first create a label variable `agelabel`, and then we will assign it to our categorical variable `age_cohort`. ```
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
label define agelabel 1 "0-5" 2 "6-12" 3 "13-17" 4 "18-21" 5 "22-50" 6 "51+" label value age_cohort agelabel ``` You can now repeat the steps of using `describe`, `codebook` and `tab` to see the changes we’ve made. Remember, you must save the data if you want these changes to save to file. Part 2: Univariate Analysis Analyzing Expenditures    Following along with the case study, let’s look at the variable `expenditures`. Use the `codebook` command to get the 25 th , 50 th , and 75 th percentiles of the variable and verify you see the same numbers reported in the case study. Now let’s create a histogram for expenditure values. You are welcome to explore the Graphics dropdown menu in Stata, but programmatic plotting is used here. ``` histogram expenditures, frequency ```     Because we added a label to the `expenditures` variable earlier, we get an informative axis title on our plot by default. Plots in Stata can be saved to file or copied and pasted into a document. Analyzing Age Generate histograms for `age` and a bar graph for `age_cohort`. We have not made labels for these variables, so I am going to add titles to the plots manually. ``` histogram age, frequency title("Age (years)") graph bar (count), over(age_cohort) title("Age (cohort)") ```     Analyzing Ethnicity Generate a bar graph for `ethnicity`. The ethnicity labels are long; I am going to rotate them 45 degrees, so they fit on the plot better. ``` graph bar (count), over(ethnicity, label(angle(45))) ```
Analyzing Gender Use what you have learned this far to explore the variable `gender`. Part 3: Bivariate Analysis Expenditures and Age Create boxplots to explore how expenditures change with age. ``` graph box expenditures, over(age_cohort) title("Expenditures by Age Cohort") ``` R EFLECTIONS : Interpret the boxplots: discuss the variation, median, and any potential outliers across the age cohorts. Why do you think older individuals tend to have higher expenditures? Expenditures and Ethnicity Explore the relationship between `expenditures` and `ethnicity`. First, create a table that has mean expenditures for each ethnicity. ``` table ethnicity, statistic(mean expenditures) ``` Now, use what you have learned earlier to create side-by-side boxplots of expenditures by ethnicity. R EFLECTIONS : Analyze the boxplot: do certain ethnic groups receive noticeably different median expenditures? Considering the sample size of each ethnic group, do these findings convincingly suggest disparities, or should the sample size influence our interpretation? Expenditures and Gender Use what you have learned to explore the relationship between expenditures and gender.
R EFLECTIONS : Analyze the boxplot: is there a notable difference in expenditures between genders? Is gender a considerable factor in expenditure allocation? Part 4: Investigating Potential Discrimination Comparing Expenditures between Major Ethnic Groups Given that Hispanic and White non-Hispanic are predominant groups, let's narrow our analysis to these two. We are going to create a variable named `hisp_white` that will have the value 1 if ethnicity is `Hispanic` or `White not Hispanic` and 0 otherwise. After creating this new variable, try listing the first few rows of the dataset again. After creating this flag, we will filter to it having value 1, and then create side-by-side boxplots for just these two ethnicities. ``` gen hisp_white = (ethnicity == "Hispanic" || ethnicity == "White not Hispanic") graph box expenditures if hisp_white == 1, over(ethnicity) title("Expenditures: Hispanic vs White Non-Hispanic") ``` R EFLECTIONS : Examine the boxplot: is there an observable difference in expenditures between Hispanic and White non-Hispanic groups? Does this visual evidence support the initial claim of expenditure disparity between these groups? Age Distribution Among Hispanic and White Non-Hispanic Consumers Try the following commands. What do they reproduce from the case study? ``` graph bar (count) if hisp_white==1, over(age_cohort) over(ethnicity) title("Age Distribution by Ethnicity") ytitle("Frequency") tabulate age_cohort ethnicity if hisp_white==1, col freq   ``` Examine Expenditures Across Age Cohorts and Ethnicity Calculate mean expenditures across age cohorts and ethnicity. Here I have formatted the averages to have no decimal places using the `nformat` option.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
``` table age_cohort ethnicity if hisp_white==1, statistic(mean expenditures) nformat(%9.0f) ```