Lecture Workbook Part 1 - Understanding the Problem and the Data
xlsx
keyboard_arrow_up
School
Kennesaw State University *
*We aren’t endorsed by this school
Course
NOT SURE
Subject
Statistics
Date
Feb 20, 2024
Type
xlsx
Pages
15
Uploaded by MajorDeer2792
The Problem: The Student Success Center and the Advising Department at Data University want to better understand the factors that determine a student’s GPA upon graduation. They do a great job of helping a student decide exactly what mix of classes they should take in each semester to satisfy their degree requirements, but they would like to be able to advise students on what lifestyle habits lead to student success. Of course, they tell students the generalized statements of “study hard”, “go to class”, “balance your school and social life”, etc. However, it would be helpful to know what specific habits have the greatest impact on a student’s GPA and be able to provide them with more targeted recommendations. It would also be helpful to be able to predict the expected GPA of a particular student based on their current habits.
Step 1: The first step in any data science or analytics problem is to understand the problem. This step often requires asking lots of questions and really getting to the root of the problem. We are limited to the information we have above, however it does provide us with a relatively good understanding of the problem.
1. What are the pain points? What will they do once they have the specific habit?
Who is the client?
What so they need to achompolish their goal? What do they need?
What do they want to do?
What relationships do the varbiles have to the GPA?
What relationship do they have with each other?
to predict the expected GPA of a particular student based on their current habits.
2. What sort of outcome are you or the client looking for? What habits contribute towards GPA
Wants to provide better advice
What defines success and what sort of analysis would add value?
Improvement in overall GPA
GPA Average imporved by .2
To be completeed by 12 weeks
Step 2:
Next, we need to use the information provided to formulate a problem statement that can be solved with data. The key components of a good problem statement are that it is clear, concise, and measurable.
problem statement are that it is clear, concise, and measurable. 1. Clear meaning that it is easily understood and not ambiguous. 2. Concise
meaning
that is it no longer than it absolutely needs to be... it is straight to the point (*this problem statement should only be 1-2 sentences). 3. Measurable meaning that it can be measured and is actionable. Sample Problem Statement: “We will perform an explanatory analysis to determine the most impactful habits that lead to student success and build a model to predict a student’s expected GPA, at graduation, based on student habits, at the time of advisement.” “We will perform an explanatory analysis to determine the most impactful habits that lead to student success and build a model to predict a student’s expected GPA, at graduation, based on student habits, at the time of advisement.” + Timeframe
+ Improve average GPA by .2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Collecting the Data: After we have a good understanding of the problem and have developed a well-defined problem statement, we need to collect the data. If you are working for or consulting with a company, you will figure out what private data they have available for you to use. After that you may want to brainstorm and search for any publicly available data that might be helpful in your analysis. That said, there may be no public data that is relevant to your specific problem, but it is important to consider. After you have a pass at the analysis using the data that is readily available, you may need to collect new data and different data on the observational units to create a more robust and meaningful analysis. For our problem, the advising department has provided us with a small private data set to get started. It contains observations on 133 recent graduate students from Data University that were enrolled as full-time students. The variables include: 1. Student ID: A deidentified student ID number that cannot be linked to the student for ethical reasons 2. Sex: The student’s biological sex 3. Sleep: The average number of hours the student sleeps per night 4. Alcohol: The average number of alcoholic beverages a student consumes per week 5. Exercise: The average number of hours a student exercises per week 6. TV: The average number of hours of TV a student watches per week 7. Study: The average number of hours a student studies per week 8. Seat: The general location a student typically sits in the classroom (1-Front, 2-Middle, 3-Back) 9. GPA: The students grade point average upon graduation Defining the Dependent (Target/Response) Variable: After we obtain the data, we need to define our target variable of interest. (The target is generally determine before the data collection process, but may happen after seeing what data is readily available). Other names for the target variable are the “response variable” or the “dependent variable.” Target, Response and Dependent can be used interchangeably.
What is the variable we are trying to better understand? If we are building an explanatory model, what variable are we seeking to better explain? If we are building a predictive model, what variable are we seeking to predict? 1. Which variable from the list above would be our target? GPA
Listing the Independent (Explanatory/Predictor) Variables: The other variables in the data set can be referred to as “explanatory variables” if we are using them to explain how the target responds to changes in them, “predictor variables” if we are using them to predict the target, or “independent variables.” 2. List the explanatory or predictor variables in the dataset: (Hint: There are only 7 independent analysis variables, because one of the variables above would not make sense to analyze): The objective of the first part of the analysis is to understand the variables we have and how (or if) the independent variables are related to the dependent variable. Exploring the Dataframe Exercise
: Look at the worksheet “0 – Raw Data” to answer the questions
1. How many variables are in the data set? This should correspond to the number of columns 2. How many people (observational units) did the study collect data on? This should correspond to the number of rows in our dataset. 3. What data point is found in cell D17? What does it represent?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Student ID
Sex
Sleep
Alcohol
Exercise
TV
Study
Seat
GPA
001
Female
7
0
3
120
29
2
3.17
002
Male
6
5
2
35
31
3
3.31
003
Male
6
10
30
48
1
3.59
004
Female
6
0
2
30
28
2
3.47
005
Female
9.5
0
1
30
48
1
3.54
006
Female
6.5
0
1.5
28
17
3
2.91
007
Male
5
20
25
28
3
2.78
008
Male
4
45
3
25
11
3
2.00
009
Female
8
0
10
21
40
1
3.55
010
Male
8
1
3
20
13
3
2.67
011
Male
7
14
5
20
9
3
2.19
012
Female
7
0
0
20
31
1
3.49
013
Male
6.5
30
4
20
26
3
2.75
014
Male
9
0
5
20
57
1
3.74
015
Male
6
0
10
20
37
1
3.03
016
Female
7
0
5
18
26
2
3.32
017
Female
7
14
16
59
1
3.74
018
Female
7
2
2
16
50
1
3.80
019
Male
6
30
6
15
20
3
2.56
020
Female
9
0
2
15
41
1
3.62
021
Female
8
3
2
15
29
2
3.14
022
Male
6
2
1
15
25
2
3.50
023
Male
6
0
2
15
16
3
2.93
024
Male
7.5
12
4
15
26
3
2.94
025
Female
7
2
5
15
36
1
3.71
026
Male
8
0
10
15
40
2
3.32
027
Female
4.5
11
14
26
3
2.90
028
Male
9
18
3
14
31
3
3.10
029
Male
7
6
3
14
34
2
3.24
030
Male
7
0
2
14
44
1
3.55
031
Male
9
10
1.5
14
32
3
3.51
032
Male
5.5
0
14
24
3
2.93
033
Female
3.5
12
10
13
11
3
1.85
034
Male
5
2
6
12
40
2
3.10
035
Female
7
1
5
11
54
1
3.74
036
Male
4.5
20
4
11
18
3
2.90
037
Female
5
3
0
11
12
3
2.57
038
Female
3
15
4
10
7
3
1.70
039
Male
5.5
20
8
10
6
3
2.67
040
Male
8
2
3
10
30
2
3.21
041
Female
9
0
3
10
26
2
3.30
042
Male
2
5
1
10
15
3
2.00
043
Female
8
0
2.5
10
27
1
3.05
044
Female
9
1
5
10
33
2
3.15
045
Male
9
10
7
10
29
2
3.10
046
Female
5.5
3
10
10
26
2
3.22
047
Male
8
15
10
10
42
3
3.53
048
Male
5
6
10
19
3
2.68
049
Male
8
0
5
10
32
2
3.47
050
Male
7
0
2
10
31
3
3.03
051
Female
7
1
7
10
59
1
3.75
052
Male
7
20
15
9
25
2
3.39
053
Male
7
10
7
9
27
2
3.47
054
Female
3.5
18
4.5
9
18
3
2.44
055
Male
6
5
10
8
46
2
3.52
056
Male
26
0
8
8
40
1
3.51
057
Female
8
0
6
8
48
2
3.60
058
Female
6
3
4
8
35
2
3.19
059
Female
7
5
4.5
8
36
2
3.50
060
Female
8
0
2
8
29
2
3.46
061
Female
8
4
8
38
2
3.54
062
Female
5
2
2
8
27
3
3.06
063
Male
6
0
6
8
17
3
2.96
064
Male
6
3
1.5
8
24
3
2.82
065
Male
8
15
25
8
40
3
3.49
066
Female
8
4
1.5
7
34
2
3.18
067
Female
4.5
1
5.5
6
26
3
2.83
068
Female
7
2
10
6
38
1
3.72
069
Female
8.5
4
1
6
26
2
3.11
070
Male
5
20
10
6
30
3
2.90
071
Female
6.5
0
2
6
47
1
3.76
072
Female
9
0
2
6
49
1
3.71
073
Female
9.5
2
0
6
57
1
4.00
074
Female
7
6
4
5.5
47
1
3.62
075
Female
6
0
6
5
35
1
3.69
076
Female
7
0
2
5
46
1
3.64
077
Female
8.5
10
0.5
5
38
2
3.44
078
Female
7
3
5
5
32
2
3.21
079
Female
8
5
4.5
5
45
1
3.90
080
Male
4
2
2
5
29
3
2.93
081
Female
5
4
5
5
29
33
2.93
082
Male
10
0
2.5
5
31
2
3.44
083
Female
7.5
4
4
4
48
1
3.83
084
Male
6
10
3
4
40
3.23
085
Male
6.5
4
9
4
29
3
2.88
086
Female
6
1
4
4
37
2
3.50
087
Female
7
2
6
4
40
3
3.05
088
Female
8.5
0
2
4
35
1
3.59
089
Male
8
10
6
4
22
3
3.03
090
Female
9
1
2
4
19
3
3.00
091
Female
6.5
0
3
4
31
2
3.38
092
Female
8.5
2
3
4
39
1
3.36
093
Female
5.5
5
0
4
22
3
2.90
094
Female
8
0
2.5
3.5
59
1
3.93
095
Female
6.5
9
5.5
3.5
39
1
3.69
096
Male
8
7
8
3.5
45
2
3.61
097
Female
6.5
1
3.5
3.5
59
1
3.73
098
Mle
7
2
3
3.5
26
2
3.46
099
Female
9
2
7
3
36
2
3.29
100
Female
7
0
2
3
26
1
3.12
101
Female
7
5
2
3
42
1
3.71
102
Male
5.5
3
2
3
25
3
3.10
103
Female
5
5
6.5
2.5
19
3
2.86
104
Male
6
0
11
2
30
1
3.32
105
Female
6.5
1
30
2
38
1
3.68
106
Female
8
0
3
2
28
2
3.45
107
Female
8
0
1
2
36
1
3.66
108
Male
5
35
20
2
20
3
2.00
109
Male
7.5
3
6
2
32
2
3.17
110
Male
7.5
0
21.5
2
35
2
3.24
111
Male
8
7
7
2
29
2
3.41
112
Male
9
0
2
2
38
2
3.50
113
Female
8
2
2
2
47
1
3.78
114
Male
6
0
0
2
25
1
3.20
115
Male
7
3
2
2
35
2
3.56
116
Male
7
3
4
2
27
1
3.00
117
Female
9
3
5
2
33
2
3.44
118
Male
8.5
0
5
1.5
28
1
3.51
119
Female
8
0
0
1.5
44
1
3.59
120
Female
8
0
4.5
1
45
1
3.53
121
Female
7
3
2.5
1
50
2
3.53
122
Female
7
1
3
1
50
1
3.75
123
Femal
6.5
5
3
1
37
2
3.60
124
Female
6
0
7
1
17
3
2.90
125
Female
7
10
4
1
9
3
2.66
126
Male
5
5
3
1
30
3
2.94
127
Male
6.5
0
8
1
35
2
3.20
128
Male
8
1
3
1
27
2
3.48
129
Female
8
4
2
1
38
1
3.72
130
Male
5
3
3
1
26
2
3.00
131
Male
6.5
10
0
0
37
3
3.36
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
132
Male
7.5
0
4
0
30
1
3.10
133
Female
8.5
3
2
0
53
1
3.80
Getting to Know the Data Exercise: Often when we receive a dataset there is a data dictionary that organizes the characteristics of each variable, but this is not always the case. When we do not have a data dictionary, we need to try our best to understand the variables we have.
Create a neat table in Excel that organizes our data dictionary
1. Click on the worksheet tab titled “1 – Data Info” 2. Let’s give our table a title. In cell A1, type “Table 1: Data Dictionary for Student Survey at Data University” 3. Now, we will create the column names for the information that will be organized in our table. A. In cell A2, type “Variable Name” B. In cell B2, type “Description” C. In cell C2, type “General Type” D. In cell D2, type “Specific Type” E. In cell E2, type “Measurement Units” 4. Next, we need to fill in the table A. In column A, beginning in cell A3, type the names of the variables one after the other. Be sure to clean up any spelling and capitalization errors that may be present in the raw data. B. In column B, type a brief description of each variable next to the corresponding variable name. The key is to be succinct in your description, while still conveying the information presented by the variable. C. In column C, fill in the general type of each variable. The two general classification types are: i. “Categorical or Qualitative” - These variables categorize the observational units into distinct groups or qualities, hence the name. They can be words, symbols, or coded values that appear as numbers but are treated as categories.
ii. “Numerical or Quantitative” – These variables are measuring or quantifying something about each observational unit. They always appear as numbers and typically have some unit of measure (centimeters, seconds, miles, number of, etc.). We can perform meaningful mathematical summarizations or transformations on these variable types (e.g., averages or means, medians, standard deviations, sums, differences, products, etc.) D. In column D, fill in the specific variable type. This is a more granular classification. There are two quantitative types (discrete and continuous). There are 3 categorical types (nominal, ordinal, and identifiers). i. Quantitative Variables 1. Discrete data is reported as whole number values. Based on the way it is measured, it cannot take on a value between two whole numbers
1. Discrete data is reported as whole number values. Based on the way it is measured, it cannot take on a value between two whole numbers (I.e., an interval). They are generally counts of something (e.g., # of points scored in a football game, # of children in a household, etc.) 2. Continuous data can be reported as whole numbers, but it can also take on a value between two whole numbers (giving it a decimal component). These are generally thought of as being on some sort of measurement scale (e.g., the weight of a car, the time it takes to complete a task, the driving distance for a delivery, etc.) ii. Categorical Variables (a.k.a. Qualitative Variables) 1. Nominal data simply names or labels a characteristic about something. There is no order to the categories, and it cannot be measured (e.g., types of flowers, major field of study, names of cities, etc.) 2. Ordinal data orders the data in some way. One category objectively comes before another in a natural order. (e.g., places in a race – 1 st, 2nd, 3rd, etc., levels of education – high school diploma, associate’s degree, bachelor’s degree, master’s degree, Ph.D., etc.) 3. Identifiers are a special case nominal variables that actually do the opposite of group or categorize observational units. The purpose of an identifier is to identify one thing and one thing only. It is a unique ID for each observational unit (e.g., Social Security Number, Patient ID Number, Full Street Address, etc.). We often use these when merging datasets together to ensure the observations are matched with the correct observational unit. When copying, pasting, splitting, and merging data, always make sure the ID variable is with each dataset. E. In column E, fill in the measurement units for quantitative variables, and list the category names/values for categorical variables (e.g., quantitative - minutes, meters, points… categorical - XS, S, M, L, XL) F. Now that we have entered all the information, we need to clean up our table and make it look nice. i. First, we will center the title. Click in cell A1 and drag the cursor across to cell E1. After the cells are highlighted, go up to the “Alignment” Panel located in the top tool ribbon of the “Home” tab and click “Merge and Center” ii. Next, we are going to widen each column to fit the contents of each cell. Click on the column header “A” and drag the cursor to column header “E.” After the columns are highlighted, bring your cursor to the right edge of column header E until your cursor switches to a vertical line with 2 arrows pointing to the left and right. Now, double click and your columns should widen, so you can see all the information contained in each cell. iii. Now let's center the column titles in our table. Click in cell A2 and drag the cursor across to cell E2. After the cells are highlighted, go up to the “Alignment” Panel located in the top tool ribbon of the “Home” tab and click the “Center” icon ( ) iv. Lastly, we will add some borders and shading. 1. Highlight the table by clicking on the table title and dragging down to row 12. Once the table is highlighted, go up to the “Font” Panel located in the top tool ribbon of the “Home” tab and click the “Borders” dropdown ( ). Select “All Borders.” 2. Now, we will shade the Main Title and Column Title rows to set them apart. Click on the main title of the table. Once the title is highlighted,
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
2. Now, we will shade the Main Title and Column Title rows to set them apart. Click on the main title of the table. Once the title is highlighted, go up to the “Font” Panel located in the top tool ribbon of the “Home” tab and click on the paint bucket. Select a medium shade of gray that will still allow you to clearly see the title. 3. Last, select the cells for the column title and use the paint bucket to shade those cells a lighter shade of gray. v. Congrats! You just created your first table in Excel and learned some useful formatting tools to put in your toolbox. Your final product should look something like this:
Table 1: Data Dicti
Variable Name
Description
Student ID
Unique Student Number
Sex
Biological gender of student
Sleep
Average Hours of sleep per night
Alcohol
Average Alcoholic Beverage consumed per week
Exercise
Average Hours of exercise per week
TV
Average Hours of TV watched per week
Study
Average hours of studing per week
Seat
Typical Seat Location in classroom
GPA
Grade point average of student
tionary for Student Survey at Data University
General Type
Specific Type
Measurement Units
Categorical
Identifier
None
Categorical
Nominal
Male or Female
Quantitative
Continuous
Hours
Quantitative
Discrete
Drinks
Quantitative
Continuous
Hours
Quantitative
Continuous
Hours
Quantitative
Continuous
Hours
Categorical
Ordinal
1-Front, 2-Middle, 3-Back
Quantitative
Continous
Points
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Documents
Recommended textbooks for you

Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill

Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL

Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt
Recommended textbooks for you
- Glencoe Algebra 1, Student Edition, 9780079039897...AlgebraISBN:9780079039897Author:CarterPublisher:McGraw HillHolt Mcdougal Larson Pre-algebra: Student Edition...AlgebraISBN:9780547587776Author:HOLT MCDOUGALPublisher:HOLT MCDOUGALBig Ideas Math A Bridge To Success Algebra 1: Stu...AlgebraISBN:9781680331141Author:HOUGHTON MIFFLIN HARCOURTPublisher:Houghton Mifflin Harcourt

Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill

Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL

Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt