DAT430 Module One Assignment

docx

School

Southern New Hampshire University *

*We aren’t endorsed by this school

Course

430

Subject

Statistics

Date

Feb 20, 2024

Type

docx

Pages

5

Uploaded by ChancellorNightingale2248

Report
Module One Journal Assignment Tiffany Rudman Quinn Southern New Hampshire University Dr. Arash Kamari October 29, 2023
Aggregation Data aggregation is the process of combining two or more objects into one, such as combining data sets that contain similar data. In our project scenarios, this could mean combining data sets based on the employees’ reason for leaving the organization. Data aggregation is used to store and present data in a summary format and used for different analysis, such as analysis of human patterns. Data aggregation could be used in our project one scenario since we are looking to analysis the attrition rate of employees. Data aggregation could be used in project two if we present a visual of the reasons an employee is leaving, such as in a pie chart to show what percent of the population is leaving for a specified reason. Sampling Sampling is using a subset of data from a larger population. It is used to reduce the size of a dataset while keeping important data (2023). Data sampling allows data analysts to work with a small amount of data to run analytical models while producing accurate results (Yasar & Biscobing 2023). Sampling is more cost effective, less time consuming, and can produce results with confidence. In project one, sampling can be used to analyze the population of employees who have left the organization. Project two can use sampling to help with predicting the probability of attrition based on current and historical reasons for employees leaving. Dimensionality Reduction Dimensionality reduction is a process that reduces the number of dimensions while retaining important information. This is done to reduce the complexity of a model, improve the performance of a learning algorithm, or to make it easier to visualize the data (2023). Dimensionality reduction can be used in project two as we are tasked with providing visuals of
our analysis. We also need to provide predictive modeling based on some predictors in project two and dimensionality reduction will aid in this. Feature Subset Selection This approach to data preprocessing, according to geeksforgeeks.com, is the most important activity. It selects a subset of attributes that are most meaningful and gives better results than the full data set. Feature subset selection also helps to reduce the dimensionality of a dataset, ensuring that we only have the attributes that are important to our analysis. Feature subset selection can be used in both projects as it will minimize irrelevant data points and only use that will help to answer the business question we are trying to solve for. Feature Creation Feature creation is a process in feature engineering. Feature creation generate new features in the data. Feature creation can significantly improve machine learning models. Feature creation can be domain specific, data-driven based on patterns in the data, and synthetic by combining existing features. It is used to improve model performance, increase robustness, improve interpretations of the model, and the flexibility of the model to handle different types of data. Both projects may use feature creation as we try to understand the data points in project one and provide predictions and analysis of the features, we determine in project two. Discretization and Binarization Discretization divides data into discreet intervals with minimum data loss. Binarization converts the data into binary form which makes it easier to process and analyze. Discretization helps to determine if the data fits the business’ problem statement. In projects one and two, we
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
could use discretization and binarization to group employee age into groups or bins depending on the employee age. We could also use the bins of age group in a visual in project two to display based on age, how many employees in that age bin left for a group of specified reasons. We can analyze our attributes better with discretization and binarization. Variable Transformation Variable transformation is a way to make the data work better for your model. Variable transformation can be in the form of numerical or categorical values. Variable transformation is used so that the data is easy to understand, is clear of any errors or inconsistencies, and improve performance of data mining algorithms. Variable transformation is used to improve data quality, increase data security, and improve data analysis. Project one will benefit from variable transformation as we transform data from a database program to an excel file. In project two, we will need to provide visuals that may be variables transformed within coding of the database program. The approach that I think will fit both projects is variable transformation. This approach will allow us to ensure that the data is clean and accurate. It will exclude any errors or inconsistencies in the data. We can use the variables that will be needed to solve for our business statement. Using the variable transformation approach, we will be able to make the data work better for our models, especially when looking to produce visuals to show predictions on attrition rate and reasons why the employees attrite.
References Data Preprocessing in Data Mining. (06 May 2023). Geeks for Geeks. Data Preprocessing in Data Mining - GeeksforGeeks Data Transformation in Data Mining. (03 Feb 2019). Geeks for Geeks. Data Transformation in Data Mining - GeeksforGeeks Feature Subset Selection Process. (18 May 2023). Geeks for Geeks. Feature Subset Selection Process - GeeksforGeeks Gupta, R. (6 Dec. 2019). An Introduction to Discretization Techniques for Data Scientists | by Rohan Gupta | Towards Data Science Introduction to Dimensionality Reduction. (06 May 2023). Geeks for Geeks. Introduction to Dimensionality Reduction - GeeksforGeeks Yasar, K. & Biscobing, J. (2023, May). What is data sampling? | TechTarget