RScript

docx

School

Northeastern University *

*We aren’t endorsed by this school

Course

6010

Subject

Statistics

Date

Apr 3, 2024

Type

docx

Pages

Uploaded by ConstableKudu4044

Manish Kumar Yadav Module 1 – R Practice ALY 6010: Probability Theory and Introductory Statistics Dr. A. Narayan February 29, 2024

Introduction This report delves into an extensive dataset containing demographic and socioeconomic information about individuals. It encompasses a wide range of attributes including age, work class, education level, marital status, occupation, race, sex, and native country. The primary objective of this analysis is to uncover insights into the characteristics of the population represented in the dataset and to explore potential correlations between different variables. The analysis involves examining the distribution of various attributes within the dataset, identifying patterns or trends, and investigating relationships between different variables. By conducting this comprehensive analysis, we aim to gain a deeper understanding of the demographic and socioeconomic landscape depicted in the data. Furthermore, through statistical analysis and data visualization techniques, we seek to extract meaningful insights that can inform decision-making processes across diverse fields such as social sciences, economics, and public policy. Variables of Interest 1. Age: The dataset encompasses individuals across a wide spectrum of age groups, spanning from youth to senior citizens. 2. Education: A diverse array of educational attainment levels is captured, ranging from completion of high school to attainment of advanced degrees. 3. Occupation: Individuals are involved in a multitude of occupations, reflecting varied professional engagements. 4. Income: Of paramount significance, this variable indicates the financial earnings of individuals. 5. WorkClass: The dataset features a diversity of work classifications, including private sector, self-employment, and governmental employment. 6. Sex: Gender information of individuals is included in the dataset. 7. HoursPerWeek: The number of hours dedicated to work per week is documented. 8. MaritalStatus: Individuals' marital statuses, encompassing categories such as married, single, divorced, etc., are detailed. 9. NativeCountry: The native countries of individuals are documented, providing insights into geographic origins.

Data Cleaning 1. Column Renaming : Adjusted column names for improved clarity and comprehension by implementing col.names = column_names. 2. Whitespace Removal: Utilized mutate_all(trimws) to remove leading and trailing whitespaces from all columns. This ensures consistency in the formatting of data across columns and facilitates accurate analysis. 3. Handling Missing Values: Replaced all instances of "?" with NA (Not Available) to treat them uniformly and facilitate subsequent removal using na.strings = " ?". 4. Removing Rows with NA Values: Applied na.omit() to remove rows containing NA values. This step ensures that the analysis is conducted on complete data without missing values, thereby preserving the integrity of the dataset. 5. Removing Unnecessary Columns: Used select (-capital_loss, -capital_gain) to remove the columns "capital_gain" and "capital_loss" from the dataset. These columns were deemed unnecessary for the analysis and were thus excluded to streamline the dataset. 6. Data Type Conversion: Converted specific columns to numeric data type using mutate(across(...)). Columns "Age", "final_weight", "edu_num", and "weekly_hour" were converted to numeric to facilitate. Initial Analysis To gain insights into the dataset, the initial analysis involved the creation of frequency tables. These tables allow for a comprehensive understanding of how various categories are distributed across important variables such as "Education," "Occupation," "Work Class," "Marital status," "Age," and "Income." This preliminary examination set the stage for further in-depth analysis and interpretation of the data.

Your preview ends here