Lazar_ Lab_5

docx

School

Southern New Hampshire University *

*We aren’t endorsed by this school

Course

510

Subject

Statistics

Date

Feb 20, 2024

Type

docx

Pages

11

Uploaded by nlazar734

Report
Data Science and Big Data Analytics Lab 05 Guide
Copyright Copyright © 1996, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012 EMC Corporation. All Rights Reserved. EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice. THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC CORPORATION MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Use, copying, and distribution of any EMC software described in this publication requires an applicable software license. EMC2, EMC, Data Domain, RSA, EMC Centera, EMC ControlCenter, EMC LifeLine, EMC OnCourse, EMC Proven, EMC Snap, EMC SourceOne, EMC Storage Administrator, Acartus, Access Logix, AdvantEdge, AlphaStor, ApplicationXtender, ArchiveXtender, Atmos, Authentica, Authentic Problems, Automated Resource Manager, AutoStart, AutoSwap, AVALONidm, Avamar, Captiva, Catalog Solution, C-Clip, Celerra, Celerra Replicator, Centera, CenterStage, CentraStar, ClaimPack, ClaimsEditor, CLARiiON, ClientPak, Codebook Correlation Technology, Common Information Model, Configuration Intelligence, Configuresoft, Connectrix, CopyCross, CopyPoint, Dantz, DatabaseXtender, Direct Matrix Architecture, DiskXtender, DiskXtender 2000, Document Sciences, Documentum, elnput, E-Lab, EmailXaminer, EmailXtender, Enginuity, eRoom, Event Explorer, FarPoint, FirstPass, FLARE, FormWare, Geosynchrony, Global File Virtualization, Graphic Visualization, Greenplum, HighRoad, HomeBase, InfoMover, Infoscape, Infra, InputAccel, InputAccel Express, Invista, Ionix, ISIS, Max Retriever, MediaStor, MirrorView, Navisphere, NetWorker, nLayers, OnAlert, OpenScale, PixTools, Powerlink, PowerPath, PowerSnap, QuickScan, Rainfinity, RepliCare, RepliStor, ResourcePak, Retrospect, RSA, the RSA logo, SafeLine, SAN Advisor, SAN Copy, SAN Manager, Smarts, SnapImage, SnapSure, SnapView, SRDF, StorageScope, SupportMate, SymmAPI, SymmEnabler, Symmetrix, Symmetrix DMX, Symmetrix VMAX, TimeFinder, UltraFlex, UltraPoint, UltraScale, Unisphere, VMAX, Vblock, Viewlets, Virtual Matrix, Virtual Matrix Architecture, Virtual Provisioning, VisualSAN, VisualSRM, Voyence, VPLEX, VSAM-Assist, WebXtender, xPression, xPresso, YottaYotta, the EMC logo, and where information lives, are registered trademarks or trademarks of EMC Corporation in the United States and other countries. All other trademarks used herein are the property of their respective owners. © Copyright 2012 EMC Corporation. All rights reserved. Published in the USA. Revision Date: June 2012 Revision Number: MR-1CP-DSBDA .1.2.3 LAB 1 – Introduction to Data Environment 3
Lab Exercise 4: Basic Statistics, Visualization, and Hypothesis Tests Purpose: The lab introduces you to the analysis of data using the R statistical package within the Data Science and Big Data Analytics environment. After completing the tasks in this lab you should able to: Perform summary (descriptive) statistics on the data sets Create basic visualizations using R both to support investigation of the data as well as exploration of the data Create plot visualizations of the data using a graphics package Test a hypothesis about the data Tasks: Tasks you will complete in this lab include: Reload data sets into the R statistical package Perform summary statistics on the data Remove outliers from the data Plot the data using R Plot the data using lattice and ggplot Test a hypothesis about the data References: References used in this lab are located in your Student Resource Guide Appendix . See the Appendix for: R Commands – Quick Reference Surviving LINUX – Quick Reference Final LAB 5
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Part 1 – Basic Statistics and Visualization Using R Workflow Overview Final LAB 1 Prepare working environment for the Lab and load data files 2 Obtain summary statistics for Household Income and visualize data 3 Obtain summary statistics for number of rooms and visualize data 4 Remove Outliers 5 Stratify Variable – Household Income and plot the results 6 Plot Histogram and Distributions 7 Compute Correlation between income and number of rooms 8 Create a Boxplot – Distribution of income as a factor of number of rooms 9 Exit R 6
Alt tag: Workflow overview LAB Instructions Step Action 1 Prepare working environment for the Lab and load data files First need to get the files created to use in R environment. For this lab, you will be using RStudio to create your file to use for manipulation – 1) Load lab01.txt and lab02.txt into Big Sheets 2) Set the working directory to whatever directory you have stored your data. For example, I have my data stored at: M:/Users/<your_user_name>/Desktop /DAT510/” On the console window type: (note—I am using my directory information as an example- you will need to change the directory path to yours) setwd(“M:/Users/g.britton/Desktop/DAT510") 2. Download the file from the Module 5 Lab area inside of your learning environment. (Module5RLab2.r) 3. In the script window, open the script called “Module5Lab2.R”. (Click on “File”, “Open File” and click on file “Module5Lab2.R”). Start R and Read the Data Set Back Into Your Workspace: NOTE: you will need to change the “path” that is listed in the file to whatever the path you have saved your files to is. Example – in the lab it uses the path “~/LAB01” – I saved my files at this path: M:/Users/g.britton/Desktop/DAT510/” Everytime I see “~/LAB01” I will change that reference to my path. 4. Execute the following commands from the script window: options(digits=3) Final LAB 7
options(width=68) ls() ## load(file=”Labs.Rdata”) ls() rm(lab2) ds <- lab1 colnames(ds) <- c("income", "rooms") 2 Obtain summary statistics for Household Income and visualize data: 1. Execute the following commands from the script window: summary(ds$income) range(ds$income) sd(ds$income) var(ds$income) plot(density(ds$income)) # left skewed 2. What is the mean? __67200__ 3. What is the median? __50300__ 4. What is the standard deviation? _68178__ 3 Obtain summary statistics for Number of rooms and visualize data: Execute the following commands from the script window: summary(ds$rooms) range(ds$rooms) sd(ds$rooms) plot(as.factor(ds$rooms)) What is the mean? 5.63 What is the median? 6.00 What is the standard deviation? 1.99 4 Remove Outliers Final LAB 8
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In a previous lab, you recorded the range of income. You observed that the minimum household income is 4, and the maximum is 1,620,560. 1. Does this make sense to you? Why? * _Yes, it does make sense since the distribution plot is heavily skewed towards the left which is expected when the maximum value is far off towards the right. ____ 2. What happens if you throw out the top and bottom 10%? Execute the following line from the script window (m <- mean(ds$income, trim=0.10) ) m= 55347.497 3. How does this compare to the previous mean of this variable? The present mean is lower than the previous mean. 4. Execute the following commands from the script window: ds <- subset(ds, ds$income >= 10000 & ds$income < 1000000) summary(ds) quantile(ds$income, seq(from=0, to=1, length=11)) 5. How do these values vary from the values in the original data set? Varies because the minimum and maximum values are closer to the mean. 6. Do they make more sense? Yes 7. Which data set would you prefer to use? I would prefer the second data set since it has less outliers than the first set. __________________________________________________ *We might consider the high and low value as outliers, and get rid of them. On the other hand, as we will discover, income is best described via a lognormal distribution, and hence these values are in the extreme ends +- 3 sds from the mean. 5 Stratify Variable – Household Income and plot the results: Stratify breaks that occur close to U.S. Guidelines for Poverty, Median Income, Wealth, and Rich (> $250k @ year) 1. Execute the following code (listed under comment heading “step 5” in the script file): breaks <- c(0, 23000, 52000, 82000, 250000, 999999) Final LAB 9
labels <- c("Poverty", "LowerMid", "UpperMid", "Wealthy", "Rich") wealth <- cut(ds$income, breaks, labels) # add wealth as a column to ds ds <- cbind(ds, wealth) # show the 1 st few lines. head(ds) 2. Continue to execute the remaining part of the code in Step 5 wt <- table(wealth) percent <- wt/sum(wt)*100 wt <- rbind(wt, percent) wt plot(wt) 3. Take another look at the relationship between wealth and income. Execute the following lines: # take another look -- wealth by rooms nt <- table(wealth, ds$rooms) print(nt) plot(nt) # nice mosaic plot 4. Execute this code from the script file. These lines will remove the variables wealth, breaks and labels, and then save the variables data set and write into a file named “Census.Rdata”. rm(wealth,breaks,labels) save(ds, wt, nt, file="Census.Rdata") 6 Plot Histogram and Distributions: Problem: How do you represent income given the range of values? 1. Select and execute the code under Step 6 Histograms and distributions in the script file. library(MASS) with(ds, { hist(income, main="Distribution of Household Income", freq=FALSE) Final LAB 10
lines(density(income), lty=2, lwd=2) # line type (lty) 2 is dashed xvals = seq(from=min(income), to=max(income), length=100) param = fitdistr(income, "lognormal") lines(xvals, dlnorm(xvals, meanlog=param$estimate[1], sdlog=param$estimate[2]), col=”blue”) }) 2. Now try the same thing with log10(income) logincome = log10(ds$income) hist(logincome, main="Distribution of Household Income", freq=FALSE) # line type lty(2) is a dashed line lines(density(logincome), lty=2, lwd=2) xvals = seq(from=min(logincome), to=max(logincome), length=100) param = fitdistr(logincome, "normal") lines(xvals, dnorm(xvals, param$estimate[1], param$estimate[2]), lwd=2, col=”blue”) 7 Compute Correlation between income and number of rooms: 1. You need to consider your hypothesis. Your hypothesis is that the number of rooms in a house is predicted by household income (the rich can buy bigger houses), e.g. lm(rooms ~ income) Therefore, our null hypothesis: no correlation between income and number of rooms. Alternate hypothesis: there is a correlation between income and the number of rooms. 5. Execute the following code (listed after the comment line “Step7 in the script file). with(ds, cor(income, rooms)) with(ds, cor(log(income), rooms))) # this will give a better correlation Final LAB 11
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
6. For comparison, correlate rooms with a completely unrelated variable. n = length(ds$income) with(ds, cor(runif(n), rooms)) 8 Create a Boxplot - Distribution of income as a factor of number of rooms: 1. Select and execute the code (Listed after the comment line “Step 8”) in the script window. 2. Plot the distribution of income as a factor of # of rooms. ‘log=”y”’ plots income on log scale. We will suppress the outlier points and let the whiskers cover the full range of the data. boxplot(income ~ as.factor(rooms), data=ds, range=0, outline=F, log=”y”, xlab="# rooms", ylab="Income") 3. Plot the # of rooms as a function of wealth level. boxplot(rooms ~ wealth, data = ds, main="Room by Wealth", Xlab="Category", ylab="# rooms") # we’ll keep the outlier points in this one 9 Exit R: 1. Type the following command into the RStudio command window: q() 2. R will ask you if you want to save your workspace. Answer “ no .” End of Lab Exercise Final LAB 12
Final LAB 13