DAT 430 MOD1 Journal (1)
docx
keyboard_arrow_up
School
Southern New Hampshire University *
*We aren’t endorsed by this school
Course
430
Subject
Industrial Engineering
Date
Dec 6, 2023
Type
docx
Pages
4
Uploaded by HighnessCrownMouse16
Nathan Cumbo
DAT 430
November 1, 2023
Module 1 Journal
In this class, we are exploring seven different potential approaches to
preprocessing data
and relating how or if they might be useful working on the projects for this course: Aggregation,
data sampling,
dimensionality reduction, feature subset selection, feature creation, discretization
& binarization, and variable transformation.
As an analyst, the questions to be asked are: What is
the target data source for Projects One and Two? Which of these approaches of preprocessing are
in scope for meeting the needs of this project?
Data aggregation is simply the process of collecting data and presenting it in summary
form, for the use of conducting statistical analysis to help company executives make informed
decisions regarding marketing strategies, price settings, structuring operations, and more. The
data aggregation approach is primarily used by companies to improve marketing and sales. Data
aggregation is relevant and applicable in the scope of Project One, where our goal is to analyze
HR attrition data and present it in visual form to show the causes of employees leaving their jobs
at the rate they are. Data aggregation applies in collecting the attrition information and
summarizing it, including causes of attrition and visualization of any recurring patterns found in
the data.
In data analytics, data sampling is the practice of analyzing a small subset of data
collected from a larger set of data, discovering processes, patterns and trends in the small data,
and transferring the findings to the larger complete data set. The large benefit of data sampling is
that it allows analysts to save time and quickly produce more accurate findings in statistical
analysis (2022). Data sampling likely won’t be necessary for the projects in this class.
Dimensionality Reduction is the process of transforming high-dimensional data into low-
dimensional data. This makes it easier for analysts to work with raw data with a lot of
dimensions by reducing and removing many of those dimensions. This technique is common
when working with raw data fields such as speech recognition, language dialect relations, signal
processing, neuroinformatics, and bioinformatics. Along with dimensionality reduction comes
feature creation and feature subset selection. Feature subset selection comes into play only when
dimensionality reduction is also present; in the case of this class and working on Projects one and
two, I believe that both dimensional reduction and feature subset selection will play roles in the
decision making process. This project calls for the analyst to sort through HR attrition data to
create metrics to help them draw conclusions related to why employees are leaving their jobs.
When it comes to career attrition, there are a plethora of reasons for leaving; relocation, pay,
family emergencies, dislike of coworkers or bosses, benefits, etc. Dimensionality reduction will
allow us to sort through this data and make more precise and accurate conclusions.
Discretization is essentially the process of regrouping certain values of data in new
categorized smaller values. A great example of this is classifying age groups. For example, if we
were given the age of 50 participants and asked to group them, we could avoid discretization by
placing all participants in groups with others by decade and listing their ages. However, with data
discretization, we can make categorized groups, labeled ‘Infant’, ‘Young’, ‘Adult’, and ‘Senior’,
for example. This process helps when working with a large pool of data by reducing workload
while obtaining minimum data loss.While dimensionality reduction will likely be present, I don’t
think discretization will be a factor in the class project.
Finally, variable transformation is a way to make the data work better for our model.
There are two types of variable transformation: numeric and categorical. In both cases, the
transformation involves turning a variable from its original format, either numeric or categorical,
to a numeric variable always. This may be necessary, at least in project one, as we will be
looking at categories and characteristics regarding attrition rates. It may be applicable working
with further numeric variables and data, such as when collecting data regarding employee
numbers, length of time (number of days/months/years) an employee has worked before quitting,
hours worked per week, etc.
Of these possible approaches, I think the best approaches for working on Projects one and
two in this class are data aggregation and dimensional reduction. Dimensional reduction will
come into play when we start discovering more and more reasons for increasing HR attrition
data. Data aggregation comes into play beforehand, as well as post analysis. Part of data
aggregation is putting the data into summary form, for the sake of data visualization and
presentation. This project will certainly call for visual representation of our findings, as well as
descriptions of their importance and relevance.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
References:
Racickas, L. (2023, July 26).
Data aggregation: Definition, benefits, and examples
.
Coresignal.
https://coresignal.com/blog/data-aggregation/#:~:text=There%20are%20two
%20primary%20types,over%20a%20given%20time%20period
.
Data sampling
. Egnyte. (2022, April 19).
https://www.egnyte.com/guides/life-
sciences/data-sampling#:~:text=With%20data%20sampling%2C%20researchers%2C
%20data,findings%20from%20a%20statistical%20population
.
Google For Developers. (2016, November 15).
A.I. experiments: Visualizing High-
Dimensional Space
. YouTube.
https://www.youtube.com/watch?v=wvsE8jm1GzE
Wikimedia Foundation. (2023, October 28).
Dimensionality reduction
. Wikipedia.
https://en.wikipedia.org/wiki/Dimensionality_reduction
Discretization in data mining - javatpoint
. www.javatpoint.com. (n.d.).
https://www.javatpoint.com/discretization-in-data-
mining#:~:text=ADVERTISEMENT-,Data%20discretization%20is%20a%20method
%20of%20converting%20attributes%20values%20of,discrete%20attributes%20into
%20binary%20attributes
.
DEI, M. (2020, May 1).
Catalog of variable transformations to make your model work
better
. Medium.
https://towardsdatascience.com/catalog-of-variable-transformations-to-make-
your-model-works-better-7b506bf80b97#:~:text=Variable%20transformation%20is%20a
%20way,variable%20to%20another%20numeric%20variable
.