Highland_HW1_Answers_Train

docx

School

Boston College *

*We aren’t endorsed by this school

Course

021

Subject

Statistics

Date

Feb 20, 2024

Type

docx

Pages

4

Uploaded by AdmiralArtJackal44

Report
Highland_HW1_Data Analytics_Train Q1 a. What are the types of variable (quantitative / qualitative) (nominal / ordinal / interval / ratio) for PassengerId and Age? A1 a. Categorical or nominal A categorical variable (sometimes called a nominal variable) is one that has two or more categories, but there is no intrinsic ordering to the categories.  For example, a binary variable (such as yes/no question) is a categorical variable having two categories (yes or no) and there is no intrinsic ordering to the categories.  Hair color is also a categorical variable having a number of categories (blonde, brown, brunette, red, etc.) and again, there is no agreed way to order these from highest to lowest.  A purely nominal variable is one that simply allows you to assign categories but you cannot clearly order the categories.  If the variable has a clear ordering, then that variable would be an ordinal variable, as described below. Ordinal An ordinal variable is similar to a categorical variable.  The difference between the two is that there is a clear ordering of the categories.  For example, suppose you have a variable, economic status, with three categories (low, medium and high).  In addition to being able to classify people into these three categories, you can order the categories as low, medium and high. Now consider a variable like educational experience (with values such as elementary school graduate, high school graduate, some college and college graduate). These also can be ordered as elementary school, high school, some college, and college graduate.  Even though we can order these from lowest to highest, the spacing between the values may not be the same across the levels of the variables. Say we assign scores 1, 2, 3 and 4 to these four levels of educational experience and we compare the difference in education between categories one and two with the difference in educational experience between categories two and three, or the difference between categories three and four. The difference between categories one and two (elementary and high school) is probably much bigger than the difference between categories two and three (high school and some college).  In this example, we can order the people in level of educational experience but the size of the difference between categories is inconsistent (because the spacing between categories one and two is bigger than categories two and three).  If these categories were equally spaced, then the variable would be an interval variable. Page 1 of 4
Highland_HW1_Data Analytics_Train Interval (also called numerical) An interval variable is similar to an ordinal variable, except that the intervals between the values of the numerical variable are equally spaced.  For example, suppose you have a variable such as annual income that is measured in dollars, and we have three people who make $10,000, $15,000 and $20,000. The second person makes $5,000 more than the first person and $5,000 less than the third person, and the size of these intervals is the same.  If there were two other people who make $90,000 and $95,000, the size of that interval between these two people is also the same ($5,000). Q1b. Which variable has the most missing observations? Age has the most missing values in the train.csv dataset. Q2. Impute missing observations for Age, SibSp, and Parch with the column median (ordinal, interval, or ratio), or the column mode. The median should be imputed because Age i variable, SibSp, NS Parch are intervals and ratio variables. Q3. Install the psych package in R: install.packages('pscyh') . Invoke the package using library(psych). Then provide descriptive statistics for Age, SibSp, and Parch (e.g., describe(mydata$Age ). Please comment on what you observe from the summary statistics. Min. 1st Qu. Median Mean 3rd Qu. Max. 0.42 22.00 28.00 29.36 35.00 80.00 > summary(train$SibSp) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 0.000 0.000 0.523 1.000 8.000 > summary(train$Parch) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0000 0.0000 0.0000 0.3816 0.0000 6.0000 female male 0 81 468 Page 2 of 4
Highland_HW1_Data Analytics_Train 1 233 109 Age = Child, Survived = No Sex Class Male Female 1st 0 0 2nd 0 0 3rd 35 17 Crew 0 0 , , Age = Adult, Survived = No Sex Class Male Female 1st 118 4 2nd 154 13 3rd 387 89 Crew 670 3 , , Age = Child, Survived = Yes Sex Class Male Female 1st 5 1 2nd 11 13 3rd 13 14 Crew 0 0 , , Age = Adult, Survived = Yes Sex Class Male Female 1st 57 140 2nd 14 80 3rd 75 76 Crew 192 20 Page 3 of 4
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Highland_HW1_Data Analytics_Train Page 4 of 4