HW1

docx

School

University of California, Davis *

*We aren’t endorsed by this school

Course

206

Subject

Statistics

Date

Apr 3, 2024

Type

docx

Pages

5

Uploaded by JudgeGoldfinch4183

Report
Practice set is due on Monday, Oct 2, 2023 at 1:10PM Part 1: Create and manipulate a matrix Q1. Use the help in R and Google to look up the set.seed() function. Explain in your own words what the set.seed() function does. set.seed() function helps you generate the same random values each time you run a code. You cannot obtain the same values if the function is not used. Q2. Create a dataframe called “df”. Note: there are annotations included in the R Script Template for HW1 to help you interpret this code. After creating the object, find the sum of all the values in df. set.seed(1522) #replace 1522 with your favorite (any) number df <- data.frame(yield=rnorm(100, mean=100, sd=5), temp=rnorm(100, mean=10, sd=1)) Q3. Find the mean of the yield. 99.92474 Q4. Find the 8th value of yield 103.372 Part 2: Import and examine the wine data set Download the data file called “winedata.csv” from Canvas. This data contains the results of a chemical analysis of 3 cultivars of wine from the same region of Italy. In total, 13 chemicals were measured for each sample. The data set contains multiple samples for each cultivar. It was modified from a data set that is freely available through the UCI Machine Learning Repository. After downloading the file, read the file into R with the read.csv function as a store as object “Wine”. Wine <- read.csv("winedata.csv", header=T) # Import the wine data View(Wine) # look at Wine using View() Q5. What are the dimensions of the wine data set? 14 columns / 178 rows Q6. What type of object is the wine data set? Data Frame Object Q7. Provide the name of one column in the wine data that contains: a) categorical variable
b) quantitative variable as integers c) quantitative variable as numeric a. Cultivar b. Mg / Proline c. Flav / Hue / OD Part 3: Identify and fix common coding errors Examine and run the following three chunks of code. Each one has a different common coding error. Fix the error and provide the corrected code in your R script. Write a brief description of the error in your written answer document. Note: you will need to complete Parts 1 and 2 before doing these questions as we use objects from those parts of the homework in these questions. Read this blog post about common errors in R for help: https://warin.ca/posts/rcourse-howto-interpretcommonerrors/ Q8. DF[1:5, 1:2] # subset rows 1 through 5 and columns 1 through 2 in the dataframe you created earlier The use of DF with capital letter makes R unable to recognize the data frame. The correct name of our data frame is df. Correct code df[1:5, 1:2] Q9. vector.a <- c(0.2, 0.5, 0.8, 0.01, 0.03) # create a vector vector.b <- c(12, 7, 5, 4, 14) # create a second vector dataframe <- rbind(vector.a vector.b) # join two vectors together to create a data frame. There is an error that needs to be fixed is in the third line of code The unexpected symbol error it normally points out a missing punctuation mark, in this case a coma (,). Correct code dataframe <- rbind(vector.a,vector.b) Q10. color5 <- Wine[Wine$Color >= 5] # subset data to contain only observations where Color is greater than or equal to 5 The unexpected symbol error normally points out a missing punctuation mark, in this case a coma (,). Correct code: color5 <- Wine[Wine$Color >= 5,] Part 4: Calculate basic summary statistics Q11. Write code to obtain the mean for all the columns in the wine data that contain quantitative variables (either numeric or integer). Which variable (aka column) has the highest mean? colMeans(Wine[ ,2:14]) lapply(Wine[ ,2:14], mean)
Proline Q12. Write code to obtain the standard deviation for all the columns that contain quantitative variables (either numeric or integer). Which variable (aka column) has the lowest standard deviation? sapply(Wine [ , 2:14], sd) lapply(Wine[ , 2:14], sd) NonFlav Phenols Part 5: Subset dataset to retain specific columns Reduce the Wine data to contain fewer variables. You need to retain the following columns: Cultivar, Alcohol, Ash, Color, Flav, Mg, and OD. Wine[c(1:178), c(1,2,4,5,6,9,11)] Q13. Use a function to display the top rows of your new data frame. Paste a screenshot of this output in your written answer document. Part 6: Examine correlations between variables Load the following packages: ggplot2 and GGally. Use the ggpairs() function to create a scatter plot matrix for all of the quantitative variables in the reduced data set you created in Part 5. Q14. Paste a screenshot of your scatter plot matrix in your written answer document.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Q15. Which 3 pairs of variables are mostly strongly correlated? Consider the absolute value of the correlations to include both positive and negative relationships. Flavor, Color, and Alcohol Part 7: Examine data for each of the 3 wine cultivars Q16. Write code to obtain the sample size for each of the three cultivars. List the values you obtain in your written answer document. Barolo 59 Gringnolino 71 Barbera 48 Q17. Follow the example from the lecture to create a second scatter plot matrix with the data. Your plot should include the following: • Data for each cultivar shown in a different color • Scatter plots for the relationships between each pair of quantitative variables • Box plots and density plots • Correlations with the overall correlation value and the correlation for each of the 3 cultivars Paste a screenshot of your plot in your written answer document.
Q18. Write code to find the mean value by cultivar for each of the 6 chemicals in the reduced wine data set. Examine these means as well as the plots (histogram, box, and density plots) in the scatterplot matrix to identify the following: a) A variable where the values for each of the cultivars appear different Color, b) A variable where the values for each of the cultivars are mostly overlapping Mg and Ash Part 8: More closely examine relationship between Flav and OD Q19. Look at the second scatter plot matrix and find the values for the relationship between Flav and OD. Describe how the correlations for each cultivar compare to the overall correlation value. Is the relationship consistent across cultivars or does one cultivar primarily drive the positive correlation between Flav and OD? The correlation of Flav and OD in each cultivar is different. For Barbena and Barolo, the correlation is negative; conversely, for Gringnolino it is positive. The general correlation between Flav and OD is 0.787 among all cultivars, even when two out of three cultivar correlations are negative. This positive correlation is driven by the cultivar Gringnolino, with a Flav and OD correlation of 0.580.