Lecture Workbook Part 1 - Understanding the Problem and the Data

xlsx

School

Kennesaw State University *

*We aren’t endorsed by this school

Course

NOT SURE

Subject

Statistics

Date

Feb 20, 2024

Type

xlsx

Pages

15

Uploaded by MajorDeer2792

Report
The Problem: The Student Success Center and the Advising Department at Data University want to better understand the factors that determine a student’s GPA upon graduation. They do a great job of helping a student decide exactly what mix of classes they should take in each semester to satisfy their degree requirements, but they would like to be able to advise students on what lifestyle habits lead to student success. Of course, they tell students the generalized statements of “study hard”, “go to class”, “balance your school and social life”, etc. However, it would be helpful to know what specific habits have the greatest impact on a student’s GPA and be able to provide them with more targeted recommendations. It would also be helpful to be able to predict the expected GPA of a particular student based on their current habits. Step 1: The first step in any data science or analytics problem is to understand the problem. This step often requires asking lots of questions and really getting to the root of the problem. We are limited to the information we have above, however it does provide us with a relatively good understanding of the problem. 1. What are the pain points? What will they do once they have the specific habit? Who is the client? What so they need to achompolish their goal? What do they need? What do they want to do? What relationships do the varbiles have to the GPA? What relationship do they have with each other? to predict the expected GPA of a particular student based on their current habits. 2. What sort of outcome are you or the client looking for? What habits contribute towards GPA Wants to provide better advice What defines success and what sort of analysis would add value? Improvement in overall GPA GPA Average imporved by .2 To be completeed by 12 weeks Step 2: Next, we need to use the information provided to formulate a problem statement that can be solved with data. The key components of a good problem statement are that it is clear, concise, and measurable.
problem statement are that it is clear, concise, and measurable. 1. Clear meaning that it is easily understood and not ambiguous. 2. Concise meaning that is it no longer than it absolutely needs to be... it is straight to the point (*this problem statement should only be 1-2 sentences). 3. Measurable meaning that it can be measured and is actionable. Sample Problem Statement: “We will perform an explanatory analysis to determine the most impactful habits that lead to student success and build a model to predict a student’s expected GPA, at graduation, based on student habits, at the time of advisement.” “We will perform an explanatory analysis to determine the most impactful habits that lead to student success and build a model to predict a student’s expected GPA, at graduation, based on student habits, at the time of advisement.” + Timeframe + Improve average GPA by .2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Collecting the Data: After we have a good understanding of the problem and have developed a well-defined problem statement, we need to collect the data. If you are working for or consulting with a company, you will figure out what private data they have available for you to use. After that you may want to brainstorm and search for any publicly available data that might be helpful in your analysis. That said, there may be no public data that is relevant to your specific problem, but it is important to consider. After you have a pass at the analysis using the data that is readily available, you may need to collect new data and different data on the observational units to create a more robust and meaningful analysis. For our problem, the advising department has provided us with a small private data set to get started. It contains observations on 133 recent graduate students from Data University that were enrolled as full-time students. The variables include: 1. Student ID: A deidentified student ID number that cannot be linked to the student for ethical reasons 2. Sex: The student’s biological sex 3. Sleep: The average number of hours the student sleeps per night 4. Alcohol: The average number of alcoholic beverages a student consumes per week 5. Exercise: The average number of hours a student exercises per week 6. TV: The average number of hours of TV a student watches per week 7. Study: The average number of hours a student studies per week 8. Seat: The general location a student typically sits in the classroom (1-Front, 2-Middle, 3-Back) 9. GPA: The students grade point average upon graduation Defining the Dependent (Target/Response) Variable: After we obtain the data, we need to define our target variable of interest. (The target is generally determine before the data collection process, but may happen after seeing what data is readily available). Other names for the target variable are the “response variable” or the “dependent variable.” Target, Response and Dependent can be used interchangeably. What is the variable we are trying to better understand? If we are building an explanatory model, what variable are we seeking to better explain? If we are building a predictive model, what variable are we seeking to predict? 1. Which variable from the list above would be our target? GPA
Listing the Independent (Explanatory/Predictor) Variables: The other variables in the data set can be referred to as “explanatory variables” if we are using them to explain how the target responds to changes in them, “predictor variables” if we are using them to predict the target, or “independent variables.” 2. List the explanatory or predictor variables in the dataset: (Hint: There are only 7 independent analysis variables, because one of the variables above would not make sense to analyze): The objective of the first part of the analysis is to understand the variables we have and how (or if) the independent variables are related to the dependent variable. Exploring the Dataframe Exercise : Look at the worksheet “0 – Raw Data” to answer the questions 1. How many variables are in the data set? This should correspond to the number of columns 2. How many people (observational units) did the study collect data on? This should correspond to the number of rows in our dataset. 3. What data point is found in cell D17? What does it represent?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Student ID Sex Sleep Alcohol Exercise TV Study Seat GPA 001 Female 7 0 3 120 29 2 3.17 002 Male 6 5 2 35 31 3 3.31 003 Male 6 10 30 48 1 3.59 004 Female 6 0 2 30 28 2 3.47 005 Female 9.5 0 1 30 48 1 3.54 006 Female 6.5 0 1.5 28 17 3 2.91 007 Male 5 20 25 28 3 2.78 008 Male 4 45 3 25 11 3 2.00 009 Female 8 0 10 21 40 1 3.55 010 Male 8 1 3 20 13 3 2.67 011 Male 7 14 5 20 9 3 2.19 012 Female 7 0 0 20 31 1 3.49 013 Male 6.5 30 4 20 26 3 2.75 014 Male 9 0 5 20 57 1 3.74 015 Male 6 0 10 20 37 1 3.03 016 Female 7 0 5 18 26 2 3.32 017 Female 7 14 16 59 1 3.74 018 Female 7 2 2 16 50 1 3.80 019 Male 6 30 6 15 20 3 2.56 020 Female 9 0 2 15 41 1 3.62 021 Female 8 3 2 15 29 2 3.14 022 Male 6 2 1 15 25 2 3.50 023 Male 6 0 2 15 16 3 2.93 024 Male 7.5 12 4 15 26 3 2.94 025 Female 7 2 5 15 36 1 3.71 026 Male 8 0 10 15 40 2 3.32 027 Female 4.5 11 14 26 3 2.90 028 Male 9 18 3 14 31 3 3.10 029 Male 7 6 3 14 34 2 3.24 030 Male 7 0 2 14 44 1 3.55 031 Male 9 10 1.5 14 32 3 3.51 032 Male 5.5 0 14 24 3 2.93 033 Female 3.5 12 10 13 11 3 1.85 034 Male 5 2 6 12 40 2 3.10 035 Female 7 1 5 11 54 1 3.74 036 Male 4.5 20 4 11 18 3 2.90 037 Female 5 3 0 11 12 3 2.57 038 Female 3 15 4 10 7 3 1.70 039 Male 5.5 20 8 10 6 3 2.67 040 Male 8 2 3 10 30 2 3.21 041 Female 9 0 3 10 26 2 3.30 042 Male 2 5 1 10 15 3 2.00 043 Female 8 0 2.5 10 27 1 3.05
044 Female 9 1 5 10 33 2 3.15 045 Male 9 10 7 10 29 2 3.10 046 Female 5.5 3 10 10 26 2 3.22 047 Male 8 15 10 10 42 3 3.53 048 Male 5 6 10 19 3 2.68 049 Male 8 0 5 10 32 2 3.47 050 Male 7 0 2 10 31 3 3.03 051 Female 7 1 7 10 59 1 3.75 052 Male 7 20 15 9 25 2 3.39 053 Male 7 10 7 9 27 2 3.47 054 Female 3.5 18 4.5 9 18 3 2.44 055 Male 6 5 10 8 46 2 3.52 056 Male 26 0 8 8 40 1 3.51 057 Female 8 0 6 8 48 2 3.60 058 Female 6 3 4 8 35 2 3.19 059 Female 7 5 4.5 8 36 2 3.50 060 Female 8 0 2 8 29 2 3.46 061 Female 8 4 8 38 2 3.54 062 Female 5 2 2 8 27 3 3.06 063 Male 6 0 6 8 17 3 2.96 064 Male 6 3 1.5 8 24 3 2.82 065 Male 8 15 25 8 40 3 3.49 066 Female 8 4 1.5 7 34 2 3.18 067 Female 4.5 1 5.5 6 26 3 2.83 068 Female 7 2 10 6 38 1 3.72 069 Female 8.5 4 1 6 26 2 3.11 070 Male 5 20 10 6 30 3 2.90 071 Female 6.5 0 2 6 47 1 3.76 072 Female 9 0 2 6 49 1 3.71 073 Female 9.5 2 0 6 57 1 4.00 074 Female 7 6 4 5.5 47 1 3.62 075 Female 6 0 6 5 35 1 3.69 076 Female 7 0 2 5 46 1 3.64 077 Female 8.5 10 0.5 5 38 2 3.44 078 Female 7 3 5 5 32 2 3.21 079 Female 8 5 4.5 5 45 1 3.90 080 Male 4 2 2 5 29 3 2.93 081 Female 5 4 5 5 29 33 2.93 082 Male 10 0 2.5 5 31 2 3.44 083 Female 7.5 4 4 4 48 1 3.83 084 Male 6 10 3 4 40 3.23 085 Male 6.5 4 9 4 29 3 2.88 086 Female 6 1 4 4 37 2 3.50 087 Female 7 2 6 4 40 3 3.05
088 Female 8.5 0 2 4 35 1 3.59 089 Male 8 10 6 4 22 3 3.03 090 Female 9 1 2 4 19 3 3.00 091 Female 6.5 0 3 4 31 2 3.38 092 Female 8.5 2 3 4 39 1 3.36 093 Female 5.5 5 0 4 22 3 2.90 094 Female 8 0 2.5 3.5 59 1 3.93 095 Female 6.5 9 5.5 3.5 39 1 3.69 096 Male 8 7 8 3.5 45 2 3.61 097 Female 6.5 1 3.5 3.5 59 1 3.73 098 Mle 7 2 3 3.5 26 2 3.46 099 Female 9 2 7 3 36 2 3.29 100 Female 7 0 2 3 26 1 3.12 101 Female 7 5 2 3 42 1 3.71 102 Male 5.5 3 2 3 25 3 3.10 103 Female 5 5 6.5 2.5 19 3 2.86 104 Male 6 0 11 2 30 1 3.32 105 Female 6.5 1 30 2 38 1 3.68 106 Female 8 0 3 2 28 2 3.45 107 Female 8 0 1 2 36 1 3.66 108 Male 5 35 20 2 20 3 2.00 109 Male 7.5 3 6 2 32 2 3.17 110 Male 7.5 0 21.5 2 35 2 3.24 111 Male 8 7 7 2 29 2 3.41 112 Male 9 0 2 2 38 2 3.50 113 Female 8 2 2 2 47 1 3.78 114 Male 6 0 0 2 25 1 3.20 115 Male 7 3 2 2 35 2 3.56 116 Male 7 3 4 2 27 1 3.00 117 Female 9 3 5 2 33 2 3.44 118 Male 8.5 0 5 1.5 28 1 3.51 119 Female 8 0 0 1.5 44 1 3.59 120 Female 8 0 4.5 1 45 1 3.53 121 Female 7 3 2.5 1 50 2 3.53 122 Female 7 1 3 1 50 1 3.75 123 Femal 6.5 5 3 1 37 2 3.60 124 Female 6 0 7 1 17 3 2.90 125 Female 7 10 4 1 9 3 2.66 126 Male 5 5 3 1 30 3 2.94 127 Male 6.5 0 8 1 35 2 3.20 128 Male 8 1 3 1 27 2 3.48 129 Female 8 4 2 1 38 1 3.72 130 Male 5 3 3 1 26 2 3.00 131 Male 6.5 10 0 0 37 3 3.36
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
132 Male 7.5 0 4 0 30 1 3.10 133 Female 8.5 3 2 0 53 1 3.80
Getting to Know the Data Exercise: Often when we receive a dataset there is a data dictionary that organizes the characteristics of each variable, but this is not always the case. When we do not have a data dictionary, we need to try our best to understand the variables we have. Create a neat table in Excel that organizes our data dictionary 1. Click on the worksheet tab titled “1 – Data Info” 2. Let’s give our table a title. In cell A1, type “Table 1: Data Dictionary for Student Survey at Data University” 3. Now, we will create the column names for the information that will be organized in our table. A. In cell A2, type “Variable Name” B. In cell B2, type “Description” C. In cell C2, type “General Type” D. In cell D2, type “Specific Type” E. In cell E2, type “Measurement Units” 4. Next, we need to fill in the table A. In column A, beginning in cell A3, type the names of the variables one after the other. Be sure to clean up any spelling and capitalization errors that may be present in the raw data. B. In column B, type a brief description of each variable next to the corresponding variable name. The key is to be succinct in your description, while still conveying the information presented by the variable. C. In column C, fill in the general type of each variable. The two general classification types are: i. “Categorical or Qualitative” - These variables categorize the observational units into distinct groups or qualities, hence the name. They can be words, symbols, or coded values that appear as numbers but are treated as categories. ii. “Numerical or Quantitative” – These variables are measuring or quantifying something about each observational unit. They always appear as numbers and typically have some unit of measure (centimeters, seconds, miles, number of, etc.). We can perform meaningful mathematical summarizations or transformations on these variable types (e.g., averages or means, medians, standard deviations, sums, differences, products, etc.) D. In column D, fill in the specific variable type. This is a more granular classification. There are two quantitative types (discrete and continuous). There are 3 categorical types (nominal, ordinal, and identifiers). i. Quantitative Variables 1. Discrete data is reported as whole number values. Based on the way it is measured, it cannot take on a value between two whole numbers
1. Discrete data is reported as whole number values. Based on the way it is measured, it cannot take on a value between two whole numbers (I.e., an interval). They are generally counts of something (e.g., # of points scored in a football game, # of children in a household, etc.) 2. Continuous data can be reported as whole numbers, but it can also take on a value between two whole numbers (giving it a decimal component). These are generally thought of as being on some sort of measurement scale (e.g., the weight of a car, the time it takes to complete a task, the driving distance for a delivery, etc.) ii. Categorical Variables (a.k.a. Qualitative Variables) 1. Nominal data simply names or labels a characteristic about something. There is no order to the categories, and it cannot be measured (e.g., types of flowers, major field of study, names of cities, etc.) 2. Ordinal data orders the data in some way. One category objectively comes before another in a natural order. (e.g., places in a race – 1 st, 2nd, 3rd, etc., levels of education – high school diploma, associate’s degree, bachelor’s degree, master’s degree, Ph.D., etc.) 3. Identifiers are a special case nominal variables that actually do the opposite of group or categorize observational units. The purpose of an identifier is to identify one thing and one thing only. It is a unique ID for each observational unit (e.g., Social Security Number, Patient ID Number, Full Street Address, etc.). We often use these when merging datasets together to ensure the observations are matched with the correct observational unit. When copying, pasting, splitting, and merging data, always make sure the ID variable is with each dataset. E. In column E, fill in the measurement units for quantitative variables, and list the category names/values for categorical variables (e.g., quantitative - minutes, meters, points… categorical - XS, S, M, L, XL) F. Now that we have entered all the information, we need to clean up our table and make it look nice. i. First, we will center the title. Click in cell A1 and drag the cursor across to cell E1. After the cells are highlighted, go up to the “Alignment” Panel located in the top tool ribbon of the “Home” tab and click “Merge and Center” ii. Next, we are going to widen each column to fit the contents of each cell. Click on the column header “A” and drag the cursor to column header “E.” After the columns are highlighted, bring your cursor to the right edge of column header E until your cursor switches to a vertical line with 2 arrows pointing to the left and right. Now, double click and your columns should widen, so you can see all the information contained in each cell. iii. Now let's center the column titles in our table. Click in cell A2 and drag the cursor across to cell E2. After the cells are highlighted, go up to the “Alignment” Panel located in the top tool ribbon of the “Home” tab and click the “Center” icon ( ) iv. Lastly, we will add some borders and shading. 1. Highlight the table by clicking on the table title and dragging down to row 12. Once the table is highlighted, go up to the “Font” Panel located in the top tool ribbon of the “Home” tab and click the “Borders” dropdown ( ). Select “All Borders.” 2. Now, we will shade the Main Title and Column Title rows to set them apart. Click on the main title of the table. Once the title is highlighted,
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
2. Now, we will shade the Main Title and Column Title rows to set them apart. Click on the main title of the table. Once the title is highlighted, go up to the “Font” Panel located in the top tool ribbon of the “Home” tab and click on the paint bucket. Select a medium shade of gray that will still allow you to clearly see the title. 3. Last, select the cells for the column title and use the paint bucket to shade those cells a lighter shade of gray. v. Congrats! You just created your first table in Excel and learned some useful formatting tools to put in your toolbox. Your final product should look something like this:
Table 1: Data Dicti Variable Name Description Student ID Unique Student Number Sex Biological gender of student Sleep Average Hours of sleep per night Alcohol Average Alcoholic Beverage consumed per week Exercise Average Hours of exercise per week TV Average Hours of TV watched per week Study Average hours of studing per week Seat Typical Seat Location in classroom GPA Grade point average of student
tionary for Student Survey at Data University General Type Specific Type Measurement Units Categorical Identifier None Categorical Nominal Male or Female Quantitative Continuous Hours Quantitative Discrete Drinks Quantitative Continuous Hours Quantitative Continuous Hours Quantitative Continuous Hours Categorical Ordinal 1-Front, 2-Middle, 3-Back Quantitative Continous Points
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help