RScript

docx

School

Northeastern University *

*We aren’t endorsed by this school

Course

6010

Subject

Statistics

Date

Apr 3, 2024

Type

docx

Pages

17

Uploaded by ConstableKudu4044

Report
Manish Kumar Yadav Module 1 – R Practice ALY 6010: Probability Theory and Introductory Statistics Dr. A. Narayan February 29, 2024
Introduction This report delves into an extensive dataset containing demographic and socioeconomic information about individuals. It encompasses a wide range of attributes including age, work class, education level, marital status, occupation, race, sex, and native country. The primary objective of this analysis is to uncover insights into the characteristics of the population represented in the dataset and to explore potential correlations between different variables. The analysis involves examining the distribution of various attributes within the dataset, identifying patterns or trends, and investigating relationships between different variables. By conducting this comprehensive analysis, we aim to gain a deeper understanding of the demographic and socioeconomic landscape depicted in the data. Furthermore, through statistical analysis and data visualization techniques, we seek to extract meaningful insights that can inform decision-making processes across diverse fields such as social sciences, economics, and public policy. Variables of Interest 1. Age: The dataset encompasses individuals across a wide spectrum of age groups, spanning from youth to senior citizens. 2. Education: A diverse array of educational attainment levels is captured, ranging from completion of high school to attainment of advanced degrees. 3. Occupation: Individuals are involved in a multitude of occupations, reflecting varied professional engagements. 4. Income: Of paramount significance, this variable indicates the financial earnings of individuals. 5. WorkClass: The dataset features a diversity of work classifications, including private sector, self-employment, and governmental employment. 6. Sex: Gender information of individuals is included in the dataset. 7. HoursPerWeek: The number of hours dedicated to work per week is documented. 8. MaritalStatus: Individuals' marital statuses, encompassing categories such as married, single, divorced, etc., are detailed. 9. NativeCountry: The native countries of individuals are documented, providing insights into geographic origins.
Data Cleaning 1. Column Renaming : Adjusted column names for improved clarity and comprehension by implementing col.names = column_names. 2. Whitespace Removal: Utilized mutate_all(trimws) to remove leading and trailing whitespaces from all columns. This ensures consistency in the formatting of data across columns and facilitates accurate analysis. 3. Handling Missing Values: Replaced all instances of "?" with NA (Not Available) to treat them uniformly and facilitate subsequent removal using na.strings = " ?". 4. Removing Rows with NA Values: Applied na.omit() to remove rows containing NA values. This step ensures that the analysis is conducted on complete data without missing values, thereby preserving the integrity of the dataset. 5. Removing Unnecessary Columns: Used select (-capital_loss, -capital_gain) to remove the columns "capital_gain" and "capital_loss" from the dataset. These columns were deemed unnecessary for the analysis and were thus excluded to streamline the dataset. 6. Data Type Conversion: Converted specific columns to numeric data type using mutate(across(...)). Columns "Age", "final_weight", "edu_num", and "weekly_hour" were converted to numeric to facilitate. Initial Analysis To gain insights into the dataset, the initial analysis involved the creation of frequency tables. These tables allow for a comprehensive understanding of how various categories are distributed across important variables such as "Education," "Occupation," "Work Class," "Marital status," "Age," and "Income." This preliminary examination set the stage for further in-depth analysis and interpretation of the data.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Frequency Tables 1. Frequency Table for Marital Status 2. Frequency Table of Occupation 3. Frequency table for Sex 4. Frequency table for work class 5. Frequency table for Race
Cross-tabulation was performed to analyze the relationship between colums 1. Cross table between education and income 2. Cross table between occupation and Sex
3. Cross table between Work class, Sex, and Income. 4. Cross table between Marital status and Income
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Results • Frequency tables provided insights into the distribution of Race, occupations, work classes, marital statuses, and Sex. •Cross-tabulations revealed relationships and frequency between: 1. Education and Income. 2. Occupation and Sex. 3. Work class, Sex, and Income. 4. Marital status and Income Data Visualization: Data visualizations include a histogram depicting the education count distribution and the age count distribution, a violin plot displaying age distribution by income about age, and a stacked bar chart illustrating income distribution by age and work class.
# ............................ Module 1 .................................... # # Load necessary libraries library (dplyr) # For data manipulation library (tidyverse) # For data manipulation and visualization library (janitor) # For data cleaning library (ggplot2) # For plotting library (gmodels) # For generating frequency tables, cross- tabulations, and association tests between categorical variables. library (GGally) # For additional plotting functions library (corrr) # For correlation analysis library (rmarkdown) # For creating a word/pdf/html file # Define column names for data clarity column_names <- c ( "Age" , "workclass" , "final_weight" , "edu_level" , "edu_num" , "marital_status" , "occupation" , "relationship" , "race" , "sex" , "capital_gain" , "capital_loss" , "weekly_hour" , "native_country" , "income" ) # Data cleaning and reading CSV file with specified column names df <- read.csv ( "adult.data" , header = FALSE , col.names = column_names, na.strings = " ?" ) %>% mutate_all (trimws) %>% # Remove leading/trailing whitespaces na.omit () %>% # Remove rows with NA values select ( - capital_loss, - capital_gain) %>% # Removed capital gain and capital loss from data frame mutate ( across ( c ( "Age" , "final_weight" , "edu_num" , "weekly_hour" ), as.numeric)) %>% mutate ( edu_level = factor (edu_level)) # Convert "edu_level" column to a factor # Create Frequency tables freq_ocup <- as.data.frame ( table (df $ occupation)) freq_sex <- as.data.frame ( table (df $ sex)) freq_workclass <- as.data.frame ( table (df $ workclass)) freq_marital_status <- as.data.frame ( table (df $ marital_status)) freq_race <- as.data.frame ( table (df $ race)) # Create Cross-tabulations
ocup_sex <- as.data.frame ( table (df $ occupation, df $ sex)) maritalstatus_income <- as.data.frame ( table (df $ marital_status, df $ income)) edu_income <- as.data.frame ( table (df $ edu_level, df $ income)) workclass_sex_income <- as.data.frame ( table (df $ workclass, df $ sex, df $ income)) # Display Cross-table using gmodels CrossTable (df $ edu_level, df $ income) ## ## ## Cell Contents ## |-------------------------| ## | N | ## | Chi-square contribution | ## | N / Row Total | ## | N / Col Total | ## | N / Table Total | ## |-------------------------| ## ## ## Total Observations in Table: 30162 ## ## ## | df$income ## df$edu_level | <=50K | >50K | Row Total | ## -------------|-----------|-----------|-----------| ## 10th | 761 | 59 | 820 | ## | 34.193 | 103.170 | | ## | 0.928 | 0.072 | 0.027 | ## | 0.034 | 0.008 | | ## | 0.025 | 0.002 | | ## -------------|-----------|-----------|-----------| ## 11th | 989 | 59 | 1048 | ## | 51.773 | 156.215 | | ## | 0.944 | 0.056 | 0.035 | ## | 0.044 | 0.008 | | ## | 0.033 | 0.002 | | ## -------------|-----------|-----------|-----------| ## 12th | 348 | 29 | 377 | ## | 14.849 | 44.805 | | ## | 0.923 | 0.077 | 0.012 | ## | 0.015 | 0.004 | | ## | 0.012 | 0.001 | | ## -------------|-----------|-----------|-----------| ## 1st-4th | 145 | 6 | 151 | ## | 8.798 | 26.545 | | ## | 0.960 | 0.040 | 0.005 |
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
## | 0.006 | 0.001 | | ## | 0.005 | 0.000 | | ## -------------|-----------|-----------|-----------| ## 5th-6th | 276 | 12 | 288 | ## | 16.471 | 49.698 | | ## | 0.958 | 0.042 | 0.010 | ## | 0.012 | 0.002 | | ## | 0.009 | 0.000 | | ## -------------|-----------|-----------|-----------| ## 7th-8th | 522 | 35 | 557 | ## | 25.680 | 77.485 | | ## | 0.937 | 0.063 | 0.018 | ## | 0.023 | 0.005 | | ## | 0.017 | 0.001 | | ## -------------|-----------|-----------|-----------| ## 9th | 430 | 25 | 455 | ## | 22.794 | 68.778 | | ## | 0.945 | 0.055 | 0.015 | ## | 0.019 | 0.003 | | ## | 0.014 | 0.001 | | ## -------------|-----------|-----------|-----------| ## Assoc-acdm | 752 | 256 | 1008 | ## | 0.034 | 0.103 | | ## | 0.746 | 0.254 | 0.033 | ## | 0.033 | 0.034 | | ## | 0.025 | 0.008 | | ## -------------|-----------|-----------|-----------| ## Assoc-voc | 963 | 344 | 1307 | ## | 0.355 | 1.070 | | ## | 0.737 | 0.263 | 0.043 | ## | 0.043 | 0.046 | | ## | 0.032 | 0.011 | | ## -------------|-----------|-----------|-----------| ## Bachelors | 2918 | 2126 | 5044 | ## | 199.992 | 603.439 | | ## | 0.579 | 0.421 | 0.167 | ## | 0.129 | 0.283 | | ## | 0.097 | 0.070 | | ## -------------|-----------|-----------|-----------| ## Doctorate | 95 | 280 | 375 | ## | 123.697 | 373.233 | | ## | 0.253 | 0.747 | 0.012 | ## | 0.004 | 0.037 | | ## | 0.003 | 0.009 | | ## -------------|-----------|-----------|-----------| ## HS-grad | 8223 | 1617 | 9840 | ## | 93.752 | 282.880 | | ## | 0.836 | 0.164 | 0.326 | ## | 0.363 | 0.215 | | ## | 0.273 | 0.054 | |
## -------------|-----------|-----------|-----------| ## Masters | 709 | 918 | 1627 | ## | 215.361 | 649.813 | | ## | 0.436 | 0.564 | 0.054 | ## | 0.031 | 0.122 | | ## | 0.024 | 0.030 | | ## -------------|-----------|-----------|-----------| ## Preschool | 45 | 0 | 45 | ## | 3.712 | 11.202 | | ## | 1.000 | 0.000 | 0.001 | ## | 0.002 | 0.000 | | ## | 0.001 | 0.000 | | ## -------------|-----------|-----------|-----------| ## Prof-school | 136 | 406 | 542 | ## | 180.519 | 544.684 | | ## | 0.251 | 0.749 | 0.018 | ## | 0.006 | 0.054 | | ## | 0.005 | 0.013 | | ## -------------|-----------|-----------|-----------| ## Some-college | 5342 | 1336 | 6678 | ## | 21.228 | 64.052 | | ## | 0.800 | 0.200 | 0.221 | ## | 0.236 | 0.178 | | ## | 0.177 | 0.044 | | ## -------------|-----------|-----------|-----------| ## Column Total | 22654 | 7508 | 30162 | ## | 0.751 | 0.249 | | ## -------------|-----------|-----------|-----------| ## ## # Histogram of Age using ggplot2 age_hist <- ggplot (df, aes ( x = Age)) + geom_bar ( fill = "skyblue" , color = "black" ) + labs ( title = "Histogram of Age" , x = "Age" , y = "Frequency" ) + theme_minimal () # Histogram of Education Level using ggplot2 edu_level_hist <- ggplot (df, aes ( x = fct_infreq (edu_level))) + geom_bar ( fill = "lightgreen" , color = "black" ) + labs ( title = "Histogram of Education Level" , x = "Education Level" , y = "Frequency" ) + theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 )) edu_level_hist
# Save ggplot histograms as images ggsave ( "age_histogram.png" , plot = age_hist, width = 8 , height = 6 ) ggsave ( "edu_level_histogram.png" , plot = edu_level_hist, width = 8 , height = 6 ) # Summary Statistics for each column in a data frame. It provides statistics such as mean, median etc. summary_stats <- summary (df) # Correlation Analysis correlation_matrix <- cor (df[ sapply (df, is.numeric)]) # Violin Plot of Age by Income violin_plot_age_income <- ggplot (df, aes ( x = income, y = Age, fill = income)) + geom_violin () + labs ( title = "Violin Plot of Age by Income" , x = "Income" , y = "Age" ) + scale_fill_manual ( values = c ( "lightblue" , "lightgreen" )) + # Customizing fill colors theme_minimal ()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
# Aggregate data to get counts of each combination of age, workclass, and income agg_data <- df %>% group_by (Age, workclass, income) %>% summarise ( count = n ()) %>% ungroup () ## `summarise()` has grouped output by 'Age', 'workclass'. You can override using the `.groups` argument. # Create a stacked bar chart stacked_bar <- ggplot (agg_data, aes ( x = Age, y = count, fill = income)) + geom_bar ( stat = "identity" ) + facet_wrap ( ~ workclass) + labs ( title = "Distribution of Income Levels by Age and Workclass" , x = "Age" , y = "Count" , fill = "Income" ) + theme_minimal () # Print the stacked bar chart print (stacked_bar) Insight: The stacked bar graph illustrates the distribution of income is highest in the private. Income is lowest in the self-employed that is below 50K. It provides a visual
representation of how income varies between age and work class through different occupation. # Print summary statistics print (summary_stats) ## Age workclass final_weight edu_level edu_num marital_status ## Min. :17.00 Length:30162 Min. : 13769 HS- grad :9840 Min. : 1.00 Length:30162 ## 1st Qu.:28.00 Class :character 1st Qu.: 117627 Some- college:6678 1st Qu.: 9.00 Class :character ## Median :37.00 Mode :character Median : 178425 Bachelors :5044 Median :10.00 Mode :character ## Mean :38.44 Mean : 189794 Masters :1627 Mean :10.12 ## 3rd Qu.:47.00 3rd Qu.: 237629 Assoc- voc :1307 3rd Qu.:13.00 ## Max. :90.00 Max. :1484705 11th :1048 Max. :16.00 ## (Other) :4618 ## occupation relationship race sex weekly_hour ## Length:30162 Length:30162 Length:30162 Length:30162 Min. : 1.00 ## Class :character Class :character Class :character Class :character 1st Qu.:40.00 ## Mode :character Mode :character Mode :character Mode :character Median :40.00 ## Mean :40.93 ## 3rd Qu.:45.00 ## Max. :99.00 ## ## native_country income ## Length:30162 Length:30162 ## Class :character Class :character ## Mode :character Mode :character ## ## ## ## # Print correlation matrix print (correlation_matrix)
## Age final_weight edu_num weekly_hour ## Age 1.00000000 -0.07651084 0.04352609 0.10159876 ## final_weight -0.07651084 1.00000000 -0.04499174 -0.02288575 ## edu_num 0.04352609 -0.04499174 1.00000000 0.15252207 ## weekly_hour 0.10159876 -0.02288575 0.15252207 1.00000000 # Print the plots print (age_hist) Insight: The histogram displays the distribution of ages among individuals in the dataset. It reveals that the dataset encompasses individuals across a wide age range, with a peak in the middle-aged group. The histogram illustrates a relatively uniform distribution of ages, indicating a diverse representation across different age brackets. However, there is a slight decline in the frequency of individuals in the older age groups, suggesting a smaller proportion of the dataset comprises elderly individuals. print (edu_level_hist)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Insight: The histogram presents the distribution of education levels among individuals in the dataset. It indicates that a considerable portion of the dataset comprises individuals who have finished high school, followed by those with some college education. However, there is a noticeable decline in the number of individuals with advanced degrees, such as Master's or Doctorate qualifications. This observation implies that a smaller segment of the dataset has pursued higher levels of education beyond the college level. print (violin_plot_age_income)
Insight: The violin plot illustrates how age is distributed across different income levels in the dataset, offering valuable observations on age distribution within distinct income brackets. By segmenting the dataset into income categories—specifically, "<=50K" (lower income) and ">50K" (higher income)—the plot enables a comparison of age distributions between these groups. The broader sections of the violin denote higher densities of individuals within particular age ranges. Notably, individuals aged approximately 30 or younger predominantly fall into the less than 50K income bracket. In contrast, the plot widens around the age of 50 for those earning more than 50K, indicating a denser concentration in that age group. However, individuals earning less than 50K are dispersed across various age groups, while the violin narrows towards the higher end of the income spectrum, suggesting a smaller proportion of individuals earning more than 50K. Conclusion The analysis yields valuable insights into the demographic makeup and income distribution of the individuals within the dataset. It presents a concise summary of the dataset's structure and the interplay among essential variables. These discoveries serve as a foundation for subsequent analyses or decision-making endeavors concerning income, education, and occupation.