Lazar_ Lab_5
docx
keyboard_arrow_up
School
Southern New Hampshire University *
*We aren’t endorsed by this school
Course
510
Subject
Statistics
Date
Feb 20, 2024
Type
docx
Pages
11
Uploaded by nlazar734
Data Science and Big Data Analytics
Lab 05 Guide
Copyright
Copyright © 1996, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012 EMC Corporation. All Rights Reserved. EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice. THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC CORPORATION MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Use, copying, and distribution of any EMC software described in this publication requires an applicable software license.
EMC2, EMC, Data Domain, RSA, EMC Centera, EMC ControlCenter, EMC LifeLine, EMC OnCourse, EMC Proven, EMC Snap, EMC SourceOne, EMC Storage Administrator, Acartus, Access Logix, AdvantEdge, AlphaStor, ApplicationXtender, ArchiveXtender, Atmos, Authentica, Authentic Problems, Automated Resource Manager, AutoStart, AutoSwap, AVALONidm, Avamar, Captiva, Catalog Solution, C-Clip, Celerra, Celerra Replicator, Centera, CenterStage, CentraStar, ClaimPack, ClaimsEditor, CLARiiON, ClientPak, Codebook Correlation Technology, Common Information Model, Configuration Intelligence, Configuresoft, Connectrix, CopyCross, CopyPoint, Dantz, DatabaseXtender, Direct Matrix Architecture, DiskXtender, DiskXtender 2000, Document Sciences, Documentum, elnput, E-Lab, EmailXaminer, EmailXtender, Enginuity, eRoom, Event Explorer, FarPoint, FirstPass, FLARE, FormWare, Geosynchrony, Global File Virtualization, Graphic Visualization, Greenplum, HighRoad, HomeBase, InfoMover, Infoscape, Infra, InputAccel, InputAccel Express, Invista, Ionix, ISIS, Max Retriever, MediaStor, MirrorView, Navisphere, NetWorker, nLayers, OnAlert, OpenScale, PixTools, Powerlink, PowerPath, PowerSnap, QuickScan, Rainfinity, RepliCare, RepliStor, ResourcePak, Retrospect, RSA, the RSA logo, SafeLine, SAN Advisor, SAN Copy, SAN Manager, Smarts, SnapImage, SnapSure, SnapView, SRDF, StorageScope, SupportMate, SymmAPI, SymmEnabler, Symmetrix, Symmetrix DMX, Symmetrix VMAX, TimeFinder, UltraFlex, UltraPoint, UltraScale, Unisphere, VMAX, Vblock, Viewlets, Virtual Matrix, Virtual Matrix Architecture, Virtual Provisioning, VisualSAN, VisualSRM, Voyence, VPLEX, VSAM-Assist, WebXtender, xPression, xPresso, YottaYotta, the EMC logo, and where information lives, are registered trademarks or trademarks of EMC Corporation in the United States and other countries. All other trademarks used herein are the property of their respective owners.
© Copyright 2012 EMC Corporation. All rights reserved. Published in the USA.
Revision Date: June 2012
Revision Number: MR-1CP-DSBDA .1.2.3
LAB 1 – Introduction to Data Environment
3
Lab Exercise 4: Basic Statistics, Visualization, and Hypothesis
Tests
Purpose:
The lab introduces you to the analysis of data using the R statistical package within the Data Science and Big Data Analytics environment. After completing the tasks in this lab you should able to:
Perform summary (descriptive) statistics on the data sets
Create basic visualizations using R both to support investigation of the data as well as exploration of the data
Create plot visualizations of the data using a graphics package
Test a hypothesis about the data
Tasks:
Tasks you will complete in this lab include:
Reload data sets into the R statistical package
Perform summary statistics on the data
Remove outliers from the data
Plot the data using R
Plot the data using lattice and ggplot
Test a hypothesis about the data
References:
References used in this lab are located in your Student Resource Guide Appendix
. See the Appendix for:
R Commands – Quick Reference
Surviving LINUX – Quick Reference
Final LAB 5
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Part 1 – Basic Statistics and Visualization Using R
Workflow Overview
Final LAB 1
Prepare working environment for the Lab and load data files
2
Obtain summary statistics for Household Income and visualize data
3
Obtain summary statistics for number of rooms and visualize data
4
Remove Outliers
5
Stratify Variable – Household Income and plot the results
6
Plot Histogram and Distributions
7
Compute Correlation between income and number of rooms
8
Create a Boxplot – Distribution of income as a factor of number of rooms
9
Exit R
6
Alt tag: Workflow overview LAB Instructions
Step
Action
1
Prepare working environment for the Lab and load data files
First need to get the files created to use in R environment. For this lab, you will be using RStudio to create your file to use for manipulation – 1)
Load lab01.txt and lab02.txt into Big Sheets 2)
Set the working directory to whatever directory you have stored your data. For example, I have my data stored at: M:/Users/<your_user_name>/Desktop /DAT510/” On the console window type: (note—I am using my directory information as an example- you will need to change the directory path to yours)
setwd(“M:/Users/g.britton/Desktop/DAT510")
2.
Download the file from the Module 5 Lab area inside of your learning environment. (Module5RLab2.r)
3.
In the script window, open the script called “Module5Lab2.R”. (Click on “File”, “Open File” and click on file “Module5Lab2.R”). Start R and Read the Data Set Back Into Your Workspace:
NOTE: you will need to change the “path” that is listed in the file to whatever the path you have saved your files to is. Example – in the lab it uses the path “~/LAB01” – I saved my files at this path: M:/Users/g.britton/Desktop/DAT510/” Everytime I see “~/LAB01” I will change that reference to my path.
4.
Execute the following commands from the script window:
options(digits=3)
Final LAB 7
options(width=68)
ls()
## load(file=”Labs.Rdata”)
ls()
rm(lab2)
ds <- lab1
colnames(ds) <- c("income", "rooms")
2
Obtain summary statistics for Household Income and visualize data:
1.
Execute the following commands from the script window:
summary(ds$income)
range(ds$income)
sd(ds$income)
var(ds$income)
plot(density(ds$income)) # left skewed
2.
What is the mean? __67200__
3.
What is the median? __50300__
4.
What is the standard deviation? _68178__
3
Obtain summary statistics for Number of rooms and visualize data:
Execute the following commands from the script window:
summary(ds$rooms)
range(ds$rooms)
sd(ds$rooms)
plot(as.factor(ds$rooms))
What is the mean? 5.63
What is the median? 6.00
What is the standard deviation? 1.99
4
Remove Outliers Final LAB 8
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In a previous lab, you recorded the range of income. You observed that the minimum
household income is 4, and the maximum is 1,620,560. 1.
Does this make sense to you? Why? *
_Yes, it does make sense since the distribution plot is heavily skewed towards the left which is expected when the maximum value is far off towards the right. ____
2.
What happens if you throw out the top and bottom 10%? Execute the following line from the script window
(m <- mean(ds$income, trim=0.10) )
m= 55347.497
3.
How does this compare to the previous mean of this variable?
The present mean is lower than the previous mean. 4.
Execute the following commands from the script window:
ds <- subset(ds, ds$income >= 10000 & ds$income < 1000000)
summary(ds)
quantile(ds$income, seq(from=0, to=1, length=11))
5.
How do these values vary from the values in the original data set? Varies because the minimum and maximum values are closer to the mean. 6.
Do they make more sense? Yes
7.
Which data set would you prefer to use? I would prefer the second data set since
it has less outliers than the first set.
__________________________________________________
*We might consider the high and low value as outliers, and get rid of them. On the other hand, as we will discover, income is best described via a lognormal distribution, and hence these values are in the extreme ends +- 3 sds from the mean.
5
Stratify Variable – Household Income and plot the results:
Stratify breaks that occur close to U.S. Guidelines for Poverty, Median Income, Wealth, and Rich (> $250k @ year)
1.
Execute the following code (listed under comment heading “step 5” in the script file):
breaks <- c(0, 23000, 52000, 82000, 250000, 999999)
Final LAB 9
labels <- c("Poverty", "LowerMid", "UpperMid", "Wealthy",
"Rich") wealth <- cut(ds$income, breaks, labels)
# add wealth as a column to ds
ds <- cbind(ds, wealth)
# show the 1
st
few lines.
head(ds) 2.
Continue to execute the remaining part of the code in Step 5
wt <- table(wealth)
percent <- wt/sum(wt)*100
wt <- rbind(wt, percent)
wt
plot(wt)
3.
Take another look at the relationship between wealth and income. Execute the following lines: # take another look -- wealth by rooms
nt <- table(wealth, ds$rooms)
print(nt)
plot(nt) # nice mosaic plot
4.
Execute this code from the script file. These lines will remove the variables wealth, breaks and labels, and then save the variables data set and write into a file named “Census.Rdata”.
rm(wealth,breaks,labels)
save(ds, wt, nt, file="Census.Rdata")
6
Plot Histogram and Distributions: Problem: How do you represent income given the range of values?
1.
Select and execute the code under Step 6 Histograms and distributions in the script file. library(MASS)
with(ds, {
hist(income, main="Distribution of Household Income", freq=FALSE)
Final LAB 10
lines(density(income), lty=2, lwd=2)
# line type (lty) 2 is dashed
xvals = seq(from=min(income), to=max(income), length=100)
param = fitdistr(income, "lognormal")
lines(xvals, dlnorm(xvals, meanlog=param$estimate[1], sdlog=param$estimate[2]), col=”blue”)
})
2.
Now try the same thing with log10(income)
logincome = log10(ds$income)
hist(logincome, main="Distribution of Household Income", freq=FALSE)
# line type lty(2) is a dashed line
lines(density(logincome), lty=2, lwd=2) xvals = seq(from=min(logincome), to=max(logincome), length=100)
param = fitdistr(logincome, "normal")
lines(xvals, dnorm(xvals, param$estimate[1], param$estimate[2]), lwd=2, col=”blue”)
7
Compute Correlation between income and number of rooms: 1. You need to consider your hypothesis.
Your hypothesis is that the number of rooms in a house is predicted by household income (the rich can buy bigger houses), e.g. lm(rooms ~ income)
Therefore, our null hypothesis: no correlation between income and number of rooms.
Alternate hypothesis: there is a correlation between income and the number of rooms. 5.
Execute the following code (listed after the comment line “Step7 in the script file).
with(ds, cor(income, rooms))
with(ds, cor(log(income), rooms))) # this will give a better correlation
Final LAB 11
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
6.
For comparison, correlate rooms with a completely unrelated variable.
n = length(ds$income)
with(ds, cor(runif(n), rooms)) 8
Create a Boxplot - Distribution of income as a factor of number of rooms:
1.
Select and execute the code (Listed after the comment line “Step 8”) in the script
window. 2.
Plot the distribution of income as a factor of # of rooms. ‘log=”y”’ plots income on log scale. We will suppress the outlier points and let the whiskers cover the full range of the data.
boxplot(income ~ as.factor(rooms), data=ds, range=0, outline=F, log=”y”,
xlab="# rooms", ylab="Income")
3.
Plot the # of rooms as a function of wealth level.
boxplot(rooms ~ wealth, data = ds,
main="Room by Wealth", Xlab="Category", ylab="# rooms")
# we’ll keep the outlier points in this one
9
Exit R: 1.
Type the following command into the RStudio command window: q()
2.
R will ask you if you want to save your workspace. Answer “
no
.”
End of Lab Exercise
Final LAB 12
Final LAB 13
Related Documents
Related Questions
solve no 1,3,7,9,13,15,17,19,21,23,27,29,31,35,37
arrow_forward
Need ASAP
arrow_forward
attempt 1 out of 2
Privacy Policy Terms of Service
Copyright 2021 DeltaMath.com. All Rights Reserved.
arrow_forward
Manuel ate of the crackers on a plate. His brother ate of the crackers. There were 5
crackers left on the plate. How many crackers were on the plate to begin with?
17
24
12
My Progress
Copyright 2021 by Curriculum Associates. All rights reserved. These materials, or any portion thereof, may not be reproduced or shared in any manner without express written consent of Curriculum Asso
hp
arrow_forward
is not ansered
arrow_forward
Question 8part 3,4,5
arrow_forward
Evaluate en 17 without using a calculator.
arrow_forward
A person invested $3, 700 in an account growing at a rate allowing the
money to double every 14 years. How long, to the nearest tenth of a year
would it take for the value of the account to reach $5, 300?
Answer:
Submit Answer
attempt 2 out
Privacy Policy Terms of Service
Copyright O 2021 DeltaMath.com. All Rights Reserved.
&
7
8
arrow_forward
please help me with this. everything is included
arrow_forward
I'm having trouble with this one
arrow_forward
part b is incorrect
arrow_forward
Can you please answer part b
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
data:image/s3,"s3://crabby-images/b163a/b163ac7fc560a1b46434c46e2314e7017295e5d4" alt="Text book image"
Algebra for College Students
Algebra
ISBN:9781285195780
Author:Jerome E. Kaufmann, Karen L. Schwitters
Publisher:Cengage Learning
Related Questions
- Manuel ate of the crackers on a plate. His brother ate of the crackers. There were 5 crackers left on the plate. How many crackers were on the plate to begin with? 17 24 12 My Progress Copyright 2021 by Curriculum Associates. All rights reserved. These materials, or any portion thereof, may not be reproduced or shared in any manner without express written consent of Curriculum Asso hparrow_forwardis not anseredarrow_forwardQuestion 8part 3,4,5arrow_forward
- Evaluate en 17 without using a calculator.arrow_forwardA person invested $3, 700 in an account growing at a rate allowing the money to double every 14 years. How long, to the nearest tenth of a year would it take for the value of the account to reach $5, 300? Answer: Submit Answer attempt 2 out Privacy Policy Terms of Service Copyright O 2021 DeltaMath.com. All Rights Reserved. & 7 8arrow_forwardplease help me with this. everything is includedarrow_forward
arrow_back_ios
arrow_forward_ios
Recommended textbooks for you
- Algebra for College StudentsAlgebraISBN:9781285195780Author:Jerome E. Kaufmann, Karen L. SchwittersPublisher:Cengage Learning
data:image/s3,"s3://crabby-images/b163a/b163ac7fc560a1b46434c46e2314e7017295e5d4" alt="Text book image"
Algebra for College Students
Algebra
ISBN:9781285195780
Author:Jerome E. Kaufmann, Karen L. Schwitters
Publisher:Cengage Learning