DISC2_Soln

pdf

School

University of Wisconsin, Madison *

*We aren’t endorsed by this school

Course

324

Subject

Mechanical Engineering

Date

Dec 6, 2023

Type

pdf

Pages

Uploaded by CoachFog15516

Discussion 2: Numeric and Graphical Summaries Put Your Name Here Warmup Consider the Week 2 Quiz Question 6 with your neighbor. Choose the answers you think make sense and explain how you know. What improvement could be made to these graphs to make this question easier to answer? Comparison of Two Sets of Data We will be looking at the salaries of 1000 recently graduated engineers from two schools: Regional Technical Institute and the State Polytechnic. We are interested to compare the distribution of salaries according to various factors. 1. Load the data into your environment by reading in the CSV file (Engineering_Undergraduates_sample_10 to the variable engineers a. Download the CSV file into the same folder as this Rmd file (drag and drop this file directly to the folder as opening the .csv file in other programs such as Numbers can cause issues) b. Set the folder that holds both the Discussion 2 files as your working directory by (navigate to that folder in the Files pane of RStudio, and select “Session > Set Working Directory > To Files Pane Location” from the top RStudio menu)) c. Run the code below to define the variable engineers. The engineers variable is a data frame that has 1000 observations of 6 variables. Confirm that it shows up in the Environment tab of RStudio. Click on the blue arrow next to the engineers name to see some information about the 6 variables it contains. engineers <- read.csv( "Engineering_Undergraduates_sample_1000.csv" , header= TRUE) d. View the data frame to see what the data looks like. Run View(enginers) in the console as View() can create issues when you knit the document. Or, click on the table icon in the Environment tab. (This will run View(enginers) in the console for you.) 2. We will be focusing on the variables Salary.K., job, and School. Salary.K. : is the starting annual salary of the accepted job offer in thousands. job : is yes if the graduate got a starting job offer and no otherwise School : is a categorical variable recording the school the student graduated from a. Run the following code to see how R has identified the variables in engineers. Identify whether any of the 3 we are interested in are saved incorrectly. 1

str(engineers) ## data.frame : 1000 obs. of 6 variables: ## $ degree : chr "civil" "physics" "civil" "biomedical" ... ## $ GPA : num 2.53 3.67 3.58 3.45 2.71 3.59 2.7 3.67 3.47 3.7 ... ## $ School : chr "Regional Technical Institute" "Regional Technical Institute" "State Polytechnic" ## $ offers : int 3 2 1 3 2 2 2 1 1 1 ... ## $ Salary.K.: num 77.7 99.8 79.4 80.5 73.4 79.8 67.6 83.7 76.4 83.2 ... ## $ job : chr "yes" "yes" "yes" "yes" ... b. Run the following code to resave School and job as categorical vectors in the engineers data frame. Notice the $ after the data frame name engineers pulls up a list of all of the columns that are defined in engineers . And the as.factor() function changes the variable type to categorical. Reference the Environment tab or rerun str(engineers) to confirm School and job have been updated correctly. engineers$School = as.factor(engineers$School) engineers$job = as.factor(engineers$job) str(engineers) ## data.frame : 1000 obs. of 6 variables: ## $ degree : chr "civil" "physics" "civil" "biomedical" ... ## $ GPA : num 2.53 3.67 3.58 3.45 2.71 3.59 2.7 3.67 3.47 3.7 ... ## $ School : Factor w/ 2 levels "Regional Technical Institute",..: 1 1 2 1 2 1 1 1 2 2 ... ## $ offers : int 3 2 1 3 2 2 2 1 1 1 ... ## $ Salary.K.: num 77.7 99.8 79.4 80.5 73.4 79.8 67.6 83.7 76.4 83.2 ... ## $ job : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ... 3. First let’s only look at graduates who got a job #Create a subset dataset where job == "yes" engineers.jobs = subset(engineers, job== "yes" ) 4. We will focus on the variable Salary.K. compared across the two schools. a. Create Side by Side boxplots and comparative histograms of the variable Salary.K. between graduates where School==“Regional Technical Institute” and “State Polytechnic” (i) Save off two data frames, RTI and SP, to hold the data for those graduates from the two schools #First make a data frame RTI for just those graduates who went to Regional Technical Institute RTI = subset(engineers.jobs, School== "Regional Technical Institute" ) #Then make a data frame SP for just the graduates of State Polytechnic SP = subset(engineers.jobs, School== "State Polytechnic" ) #After the above two steps look at your environment tab to see what the variables have stored in them. ( 2

(ii) Then save off the two vectors of Salary.K. for those two dataframes into salary.RTI and salary.SP #define salary.RTI to be the salary values from engineers from RTI salary.RTI <- RTI$Salary.K. #define salary.SP to be the salary values for graduates from SP salary.SP <- SP$Salary.K. #After the above two steps look at your environment tab to see what the variables have stored in them. (iii) Update the following boxplot code to include labels that show which data is which. You’ll need to change eval = TRUE for the code to run when you knit. boxplot(salary.RTI, salary.SP, horizontal = TRUE, names= c( "RTI" , "SP" ), main= "Salary Comparison" , xlab= "Salary (thousands $)" ) RTI SP 65 70 75 80 85 90 95 100 Salary Comparison Salary (thousands $) #or using the original engineers data frame: boxplot(Salary.K. ~ School, data= engineers.jobs, horizontal = TRUE, main= "Salary Comparison" , ylab= "" , xlab= "Salary (thousands $)" ) 3

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Regional Technical Institute 65 70 75 80 85 90 95 100 Salary Comparison Salary (thousands $) (iv) Update the following frequency and relative frequency histogram code so that both histograms have x axis classes from 50 to 100 with a width of 5. Also, choose a more useful y axis for both graphs. Again, you’ll need to change eval = TRUE . Why is it important that the x and y axis are consistent across the two histograms? (Skip relative frequency if short on time) #this makes two rows and 1 column for graphs par( mfrow = c( 2 , 1 ), mar = c( 4 , 4 , 1.5 , 1.5 )) #frequency histograms ok since the sample sizes are similar hRTI <- hist(salary.RTI, breaks = seq( 50 , 100 , 5 ), ylim= c( 0 , 200 ), plot= TRUE) hSN <- hist(salary.SP, breaks = seq( 50 , 100 , 5 ), ylim= c( 0 , 200 ), plot= TRUE) 4

Histogram of salary.RTI salary.RTI Frequency 50 60 70 80 90 100 0 100 200 Histogram of salary.SP salary.SP Frequency 50 60 70 80 90 100 0 100 200 #Relative Frequency Histogram Code takes more messing around hRTI <- hist(salary.RTI, breaks = seq( 50 , 100 , 5 ), plot= FALSE) hRTI$counts <- hRTI$counts / length(salary.RTI) plot(hRTI, ylab= "Relative Frequency" , main = "Graduates of RTI" , xlab = "Salary (thousands)" , ylim= c( 0 , 1 )) hSN <- hist(salary.SP, breaks = seq( 50 , 100 , 5 ), plot= FALSE) hSN$counts <- hSN$counts/length(salary.SP) plot(hSN, ylab = "Relative Frequency" , ylim = c( 0 , 1 ), main = "Graduates of SP" , xlab= "Salary (thousands)" , axes= TRUE) 5

Graduates of RTI Salary (thousands) Relative Frequency 50 60 70 80 90 100 0 1 Graduates of SP Salary (thousands) Relative Frequency 50 60 70 80 90 100 0 1 par( mfrow= c( 1 , 1 ), mar= c( 5.1 , 4.1 , 4.1 , 2.1 )) #this makes one row and one column for graphing b. Compare the center, variability and shape of the two groups’ data using the graphs and numeric summaries. mean(salary.SP) ## [1] 77.84685 mean(salary.RTI) ## [1] 76.83511 median(salary.SP) ## [1] 76.9 median(salary.RTI) ## [1] 75.85 IQR(salary.SP, type= 2 ) ## [1] 8 IQR(salary.RTI, type= 2 ) ## [1] 6.4 sd(salary.SP) ## [1] 5.567901 6

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

sd(salary.RTI) ## [1] 5.973521 Mean and median salary is slightly higher for SP graduates. IQR for SP graduates is also higher, since data is more spread out between median and Q3. RTI has slightly higher sd (and lower IQR) since middle 50% is more tightly packed with more “outlying” values. Both distributions look to be slightly right skewed. 7