Project #3
pdf
keyboard_arrow_up
School
Pennsylvania State University *
*We aren’t endorsed by this school
Course
380
Subject
Communications
Date
Jan 9, 2024
Type
Pages
4
Uploaded by GeneralSummer13484
For your third project, you’ll continue to work with the publicly available
data from MIDFIELD–Multiple-Institution Database for Investigating
Engineering Longitudinal Development. You can learn more about
MIDFIELD project from their website https://midfield.online/
.
Getting the Data
You will need to install the {midfielddata}
package in order to complete
this project. This package is not
located on CRAN. You’ll need to use the
following command to get the package:
The version of the package that I explored is version 0.2.0. You can learn
more about the data in this package from their package page:
https://midfieldr.github.io/midfielddata/
.
There is a companion package, {midfieldr}
, that could provide you with
some additional tools for working with the MIDFIELD data.
The Two Parts of Project 3
Project 3’s return to MIDFIELD has two parts: an application of CART
Project #3
MIDFIELD, Once Again
Neil J. Hatfield
November 8, 2023
November 8, 2023
AUTHOR
PUBLISHED
MODIFIED
install.packages
(
"midfielddata"
,
repos = "https://MIDFIELDR.github.io/drat/"
,
type = "source"
)
Tip
algorithm and a comparison of modeling approaches.
Part 1: Applying CART
For the first part of Project #3, you will explore the same research
question as in Project #2:
What predictors/factors impact the probability at a student will
graduate with a degree
?
However, instead of using (binary) logistic regression, you’ll need to apply
CART to build a decision tree. Not only should you build an example
decision tree, but you should also apply an ensemble method to improve
the underlying model either through bagging or random forest. You will
need to ensure that you provide an interpretation of one tree and discuss
the fit of your model.
Part 2: Comparing Modeling Approaches
For the second part of Project #3, you will need to compare your resulting
tree (singular and/or ensemble) with that of your final
(binary) logistic
regression model from Project #2. In this section of your report, you’ll
want to be sure that you:
1
. Discuss the similarities and differences between the two approaches,
2
. Have pre-planned so that you have a reserved data set that you can
apply to both methods for direct comparison,
3
. Compare the results of two approaches apply to said reserved data,
and
4
. Synthesize the results from the two approaches into a final set of
recommendations for answering the central research question.
Project Format
You will need to prepare and submit a typed report that includes all
necessary narrative text, visualizations, values of statistics, and end
material (i.e., references, appendices).
Your report should
1
. Have a coherent structure that assists the reader.
2
. Build upon your prior explorations with the MIDFIELD data.
You do not need to include all of what you submitted from Project #1,
but you should use that project as a launch pad for your explorations
here.
3
. Address the Central Research Question.
4
. Provide a clear explanation of your decision process, complete with
evidence.
5
. Be submitted as a knitted PDF or Word Document. (HTML files saved
as PDFs will be returned ungraded.)
How you choose to carry out your project is up to you. Here’s what you’ll
need to submit to the appropriate submission portal:
A knitted/rendered document, either a Word document
or PDF
.
There should be a coherent structure
There should be a Code Appendix
The code should show evidence of being reproducible
If you use code written by another individual and/or generated for you by a
large-language model/generative AI, you MUST
flag and document all such
instances.
You may work with other students in the class, but each person is
responsible for submitting their own report. This means you can do things
such as help each other out with coding issues, bouncing ideas off of
each other, running interpretations by each other, and brainstorming ideas
for further analyses. What you can’t do includes submitting a single report
as a team or submitting reports that are functionally identical.
Targeted Learning Outcomes
The following learning outcomes will be assessed via your project
submission.
SRT.1: The student will be able to determine which underlying
perspective (e.g., Exploratory Data Analysis) they or someone else is
working from for a particular analysis.
SRT.2: The student will be able to differentiate between the goals of
prediction and inference.
Tech.2: The student will learn to use technology to create data
visualizations.
Comm.4: The student will learn to meaningfully discuss data
visualizations to support others in their learning about the current
context.
Tech.3: The student will learn to use technology to perform
calculations on data sets.
Comm.5: The student will learn to interpret the values of statistics
(both descriptive/incisive and inferential) within the current context.
Algo.1: The student will be able to explain how an algorithm works
Warning
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
(e.g., linear regression, logistic regression, k-nearest neighbors
regression, ridge regression, LASSO regression, k-nearest neighbors
classification, classification/decision trees, random forest, k-means
clustering) so that an individual can decide whether it would be
appropriate in a given context.
Algo.2: The student will be able to compare and critque different
algorithms/modeling approaches.
SRT.5: The student will be able to describe various methods for
assessing the closeness of predictions to the observed data.
Algo.5: The student will learn to evaluate implemented models
through a variety of tools (e.g., MSE, RMSE, Misclassification Rate,
Confusion Matrix, Gini Index, ROCs, AUC).
Algo.6: The student will be able to apply various techniques meant to
improve a model (e.g., subset selection methods,
shrinkage/regularization, tuning, cross validation).
Tech.4: The student will learn to use technology to implement
different types of algorithms (e.g., regression methods, k-nearest
neighbors, regression and classification trees, clustering) to gain
insight about an underlying sample, population, or phenomenon.
DW.1 The student will demonstrate a workable knowledge base from
Stat184 that functions as a basis for Stat380.
DW.3: The student will create/modify data frames using sub-setting
and other transformational functions to assist in data analysis.
Comm.1: The student will learn to generate materials (e.g.,
presentations, posters, reports, etc.) that tell a coherent story,
incorporating visualizations and statistics, and provides a basis for
making informed decisions.
Tech.1: The student will learn to use technology to his/her advantage
when engaged in data analysis.
Tech.5: The student will learn to use technology to analyze data to
answer research questions.
Comm.6: The student will learn to produce insights, grounded in the
present context, based upon their analytical work using various
statistical models.