Key_Lab 3

pdf

School

Drexel University *

*We aren’t endorsed by this school

Course

410

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

9

Uploaded by ChiefOtter2846

Report
10/17/23, 9:36 PM Key_Lab 3 file:///Users/zacharykey/Key_Lab-3-.html 1/9 Key_Lab 3 2023-10-11 #Calucating Statistics ##Question 1 cod_data=read.csv("/Users/zacharykey/Downloads/coddata.csv") #Mean sum(cod_data$X6)/49 ## [1] 11.85714 #Median,1st,and 3rd quartiles sort(cod_data$X6) ## [1] 4 5 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8 8 8 9 9 9 9 9 ## [26] 9 10 10 10 11 11 11 12 12 12 12 12 12 14 15 15 16 18 19 20 21 21 32 72 #Mode COD=table(cod_data) COD ## X6 ## 4 5 6 7 8 9 10 11 12 14 15 16 18 19 20 21 32 72 ## 1 5 4 4 6 6 3 3 6 1 2 1 1 1 1 2 1 1 #Range max(cod_data$X6)-min(cod_data$X6) ## [1] 68 #Standard Deviation MeanDiff=cod_data$X6 - 11.85714 SquareDiff=MeanDiff^2 SumSquareDiff=sum(SquareDiff) AgesVar=SumSquareDiff/(48) sqrt(AgesVar)
10/17/23, 9:36 PM Key_Lab 3 file:///Users/zacharykey/Key_Lab-3-.html 2/9 ## [1] 10.27335 I was able to do each by hand and checked my answers using the R functions. For the median I used the sort function to have the data laid out from least to greatest in order. Then I counted the data till I reached the middle number 9. This was the same process for the 1st and 3rd quartiles. I split the numbers into an lower and upper side based on where the median was and did the same process of finding the middle number of the two regions. I found that the lower was 7 and the upper was 12. For mode I used the method from lab 2 of putting the data into a frequency table to then see that number 8,9,and 12 were equally in the data the most. ## Question 2 set1=read.csv("/Users/zacharykey/Downloads/dataset_1.csv") hist(set1$Data,breaks=200,main="Histogram of DataSet 1",xlab="Number",xlim=c(0,350)) #Mean mean(set1$Data) ## [1] 7.2 #Median median(set1$Data)
10/17/23, 9:36 PM Key_Lab 3 file:///Users/zacharykey/Key_Lab-3-.html 3/9 ## [1] 3 #Interquartile Range IQR(set1$Data) ## [1] 2 #Range 314-0 ## [1] 314 #Standard Dev sd(set1$Data) ## [1] 35.93651 The mean comes out to 7.2 and the median comes out to 3. The better measure of the center of the data set in this case would be the median. This is because most of the numbers are on the lower side but there is a single outlier at 314 that raises the mean significantly. The reason there is a large difference in the standard deviation and the interquartile range is because of the outlier in the data. The interquartile range looks at the middle spread of the data so it takes the upper and lower quartiles and subtracts them to show the difference between them. In the case of data set one most of the number are on the lower side and the one outlier would not impact the interquartile range as it looks at where the individual numbers are in a data set not the sum. The standard deviation however looks at how dispersed the data is in realation to the mean. To find the SD calucation of the sum most be made where the outlier’s higher value can be more disruptive. Question 3 set2=read.csv("/Users/zacharykey/Downloads/dataset_2.csv") hist(set2$Data,breaks=10,main="Histogram of DataSet 2",xlab="Numbers")
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
10/17/23, 9:36 PM Key_Lab 3 file:///Users/zacharykey/Key_Lab-3-.html 4/9 #Mean mean(set2$Data) ## [1] 3.067568 #Median median(set2$Data) ## [1] 3 #Range max(set2$Data)-min(set2$Data) ## [1] 8 #Interquartile Range IQR(set2$Data) ## [1] 2
10/17/23, 9:36 PM Key_Lab 3 file:///Users/zacharykey/Key_Lab-3-.html 5/9 #Standard Dev sd(set2$Data) ## [1] 1.537932 In the case of data set 2 as there are no crazy outliers in the data the mean is better at measuring the center of the data. This is because the mean summarizes the data overall while the median just finds the central value of the data. ##Question 4 set3=read.csv("/Users/zacharykey/Downloads/dataset_3.csv") hist(set3$Data,breaks=6,main="Histogram of DataSet 3",xlab="Numbers") #Mean mean(set3$Data) ## [1] 5.769231 #Median median(set3$Data)
10/17/23, 9:36 PM Key_Lab 3 file:///Users/zacharykey/Key_Lab-3-.html 6/9 ## [1] 1 #Range max(set3$Data)-min(set3$Data) ## [1] 17 #Interquartile Range IQR(set3$Data) ## [1] 9.5 #Standard Dev sd(set3$Data) ## [1] 5.569945 In the case of data set 3 there is not necessarily a outlier but there is a bunch of a single value. To best represent the center of this data the mean would be the best as it takes it to account all of the data giving a value of 5. The median because of the repetitive of the number 1 comes out to number 1 which doesn’t represent the center of the data as it doesn’t include the other higher numberic values. ##Question 5 fish=read.csv("/Users/zacharykey/Downloads/bluegill.csv") #Population Mean mean(fish$Length) ## [1] 120.0338 #Standard Dev sd(fish$Length) ## [1] 41.94847 #Sample Mean and SD of 10 fish fish_10=sample(fish$Length,size=10,replace=F) mean(fish_10) ## [1] 134.2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
10/17/23, 9:36 PM Key_Lab 3 file:///Users/zacharykey/Key_Lab-3-.html 7/9 sd(fish_10) ## [1] 34.49573 #Sample Mean and SD of 100 fish fish_100=sample(fish$Length,size=100,replace=F) mean(fish_100) ## [1] 125.81 sd(fish_100) ## [1] 37.26609 #Sample Mean and SD of 500 fish fish_500=sample(fish$Length,size=500,replace=F) mean(fish_500) ## [1] 120.622 sd(fish_500) ## [1] 41.83649 As the sample size gets bigger the mean and standard deviation are increasing.This is because it is getting closer to the actual population mean and standard dev as more numbers are added. The population mean was 120 so as the sample mean size increases it gets closer to that number. The same can be applied for the standard deviation. #Boxplots ##Question 1 boxplot(cod_data$X6,main="Boxplot of Cod Weights",ylab="Weight (kg)")
10/17/23, 9:36 PM Key_Lab 3 file:///Users/zacharykey/Key_Lab-3-.html 8/9 The advantages of using a histogram compared to a boxplot is that all of the values of a frequency of a variable are shown. It is also clear where the distribution of the data is on a histogram. The disadvantage and what the box plot shows is the summary of the data. The histogram does not show the max or min values nor highlight the outliers in the data. The boxplot also shows the median and IQR ranges of the data. ##Question 2 Mosq=read.csv("/Users/zacharykey/Downloads/Mosquitofish.csv") split_gender=split(Mosq,Mosq$Gender) Male=split_gender$M Female=split_gender$F Male_data=Male$FishLength Female_data=Female$FishLength boxplot(Male_data,Female_data, col=c('green','blue'), names=c('Male Mosquito Fish','Female Mosquito Fish'), ylab='Length', main="Comparison of Mosquito Fish Length by Gender")
10/17/23, 9:36 PM Key_Lab 3 file:///Users/zacharykey/Key_Lab-3-.html 9/9 The box plots show that on average a female mosquito fish is longer than a male mosquito fish. There is however some overlap in the data as the outliers in male mosquito fish are around the same size as the median size of the female fish data. You can also see that by the size of the female boxplot that the data is more spread compared to the male fish. There are also more outliers in the female data as seen by the white dots above the top whisker. Looking more closely at each boxplot the male mosquito has data more frequent on the lower end while the female has the data more frequent on the higher end. This can be seen by how the boxplot is broken up by the blackline (median).
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help