Entity Academy Lesson 5 Statistical Plots



University of South Florida *

*We aren’t endorsed by this school






Jan 9, 2024





Uploaded by ssaintclair

Sindy Saintclair Thursday, December 16, 2021 Lesson 5 – Statistical Plots Learning Objectives and Questions Notes and Answers Installing and Using Packages Argument Names What is ggplot2? R has many different systems to create plots of data; the two most popular ones are R Base Graphics and a system called ggplot2 . R Base Graphics is built into R; there’s no need to do anything to make these capabilities available. ggplot2 is a package that you must install. ggplot2 is part of a collection of R packages called the tidyverse . You can find more info about the tidyverse here . The Very Basics ggplot(dataFrame, aes(x =column, y=column) + geom_plotType( ) - ggplot – always starts with ggplot - dataFrame – Must Match! - aes – Code Stands for “aesthetics” - x – bottom axis variable - y – left axis variable - geom_plotType() – Additional Arguments can be added; options include boxplot, line, smooth, point, path, and more! Changing a Variable Type for a Plot ggplot(dataFrame, aes(x =type(column), y=column)) + geom_plotType( ) - x=type – Options include: factor, date, and more! - (column) – column name in parentheses - (column)) – Additional close parenthesis Changing the Scaling ggplot(dataFrame, aes(x=type(column), y=column)) + geom_plotType( ) + scale_y_log10( ) - scale – code - _y_ - select x or y for axis - log10( ) – options to make bigger: log or log10, to make smaller: sqrt, and many more! Labels ggplot(dataFrame, aes(x=column, y=column) +
geom_plotType( ) + xlab(“xLabel”) + ylab(“yLabel”) + Text must be in parentheses and quotes ggtitle(“GraphTitle”) Plotting with Color ggplot(dataFrame, aes(x=column, y=column) + geom_plotType(color=column) + xlab(“xLabel”) + Argument Name ylab(“yLabel”) + Additional factor variable; Colors randomly assigned ggititle(“GraphTitle”) Plotting with Size ggplot(dataFrame, aes(x=colun, y=column) + geom_plotType(aes(size=column)) + xlab(“xLabel”) + ylab(“yLabel”) + { Argument Name; use within aes( )}; {additional factor variable; sizes assigned} ggititle(“GraphTitle”) Plotting with Transparency ggplot(dataFrame, aes(x=column, y=column) + geom_plotType(aes(alpha=column)) + xlab(“xLabel”) + ylab(‘yLabel”) + {Argument Name; use with aes()}; {additional factor variable, transparencies assigned} ggititle(“GraphTitle”) Plotting with Multiple Graphs - From the “gridExtra” package - grid.arrange(plot1, plot2, ncol=1) Number indicates vertical or horizontal Named ggplots already created; argument name Installing ggplot2 The first thing to do in RStudio is to install the ggplot2 package. Click on the Packages tab.
Click on the Install button. This will b ring up a dialog box that looks like this: Type ggplot2 into the “Packages” field, then click on the Install button. It will take a few minutes for ggplot2 to install and the different packages needed to support ggplot2 . Another way to install packages is to use the function install.packages() and place the name of the package you are trying to install in the parentheses. For instance, to install ggplot2 without the use of the RStudio controls, you would do the following: install .packages( "ggplot2" ) The end result is the same, however, no matter which method you employ. The last thing to do to make ggplot2 available for use is to
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
type the following command in the Console pane: library ( "ggplot2" ) If your Packages tab is still visible, you should be able to find ggplot2 in the list of packages, and the box by it should be checked. Installing Other Packages You’ll need many other packages throughout this course, and the best thing to do is download all of them now! This way, you won’t end up with any weird versioning issues. Please make sure you run this code in R: install .packages( "ggplot2" ) install .packages( "datasets" ) install .packages( "readxl" ) install .packages( "dplyr" ) install .packages( "PerformanceAnalytics" ) install .packages( "corrplot" ) install .packages( "gapminder" ) install .packages( "gridextra" ) install .packages( "Ecdat" ) install .packages( "corpcor" ) install .packages( "GPArotation" ) install .packages( "psych" ) install .packages( "IDPmisc" ) install .packages( "lattice" ) install .packages( "treetop" ) install .packages( "scales" ) install .packages( "rcompanion" ) install .packages( "gmodels" ) install .packages( "car" ) install .packages( "caret" ) install .packages( "gvlma" )
install .packages( "predictmeans" ) install .packages( "caret" ) install .packages( "magrittr" ) install .packages( "tidyr" ) install .packages( "lmtest" ) install .packages( "popbio" ) install .packages( "e1071" ) install .packages( "data.table" ) install .packages( "effects" ) install .packages( "multcomp" ) install .packages( "mvnormtest" ) Histograms As learned previously, a histogram provides a visual representation of the distribution of values in a sample. Because it is silly to graph each value by itself, it gets broken into bins . A bin takes the range of data and divides it into equal segments. Then so many data points fall into each range of numbers, or bins . Typically, the number of values that fall into each bin are counted, and this count is used to create a bar graph in which the height of each bar is proportional to the number of values that fall into that bin. Suppose I want to make a histogram of the height data I used previously. I will first create the vector of heights and assign it to the variable height : height <- c( 171 , 192 , 183 , 177 , 154 , 176 ) You then create a data frame from height : height_df <- data .frame( height ) You create a histogram with these commands: h <- ggplot( height_df , aes( x = height)) h + geom_histogram() The aes() function is where you specify your variables. You want the variable height, from the data frame height_df . These commands create a histogram that
looks like this: Adding in Bins This also creates a warning: “stat_bin() using bins = 30. Pick better value with binwidth.” This histogram has a vertical bar for each height value. visually, this is not very satisfying. You can make a much more informative histogram by setting the width of each bin to be 10. You accomplish this with the binwidth= attribute. h + geom_histogram ( binwidth = 10 ) As you can see, the histogram is just a special kind of bar chart. The horizontal variable is the height in centimeters. Each bar in the chart is associated with a lower value and an upper value on the horizontal axis. For example, the bar of height three in the middle of the chart has a lower value of 175 and an upper value of 185. The height of each bar is the number of data values that fall between its lower and upper values. If you look at the height data, there are three values that
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
fall between 175 and 185. These lower and upper values define the bin for each bar; the height of the bar represents the number of data values that fall into the bin. There is no bar (or you can think of it as a bar that has zero height) corresponding to the bin for values between 155 and 165; that is because there are no heights that are between 155 and 165. The vertical axis is labeled count to indicate that the vertical height of the bars is the number of values that fall into the corresponding bins. Adding a Title and Labels You can improve this histogram by giving it a title of “Histogram of Heights” and adding a label to the horizontal axis of “Height (in cm)”. You can do this with the following command: h + geom_histogram(binwidth = 10 ) + ggtitle ( "Histogram of Heights" ) + xlab ( "Height (in cm)" ) This results in the histogram below: Relative Frequencies Sometimes, when plotting a histogram, you want the height of the bars to be related to the relative frequency , or the fraction of the total number of values that fall into each bin. This requires a rather complicated command: h + geom_histogram(binwidth = 10 , aes(y
= . .count ../sum(. .count ..))) + ggtitle ( "Histogram of Heights" ) + xlab( "Height (in cm)" ) + ylab ( "Relative frequency" ) Adding aes(y = ..count../sum(..count..) as an argument to geom_histogram() is the code that changes the counts to relative frequency. Adding ylab("Relative frequency") gives a reasonable label for the vertical axis. These commands give the following histogram. The vertical axis is the relative frequency. You can change the color of the bars and the color of the lines outlining the bars by adding arguments to geom_histogram() ; the color of the bars is specified by adding a fill= argument, while the color of the lines is specified by adding a color= argument. You can create a histogram with these colors: h + geom_histogram(binwidth = 10, fill = "goldenrod" , color = "deepskyblue4" ) + ggtitle( "Histogram of Heights" ) + xlab( "Height (in cm)" )
Eruptions Histogram You will now look a larger data set and how it can be interpreted in a histogram. You will use the eruption times for Old Faithful. faithful is data frame with two columns; one is labeled eruptions . You can create the histogram with the following commands: faithful_histogram <- ggplot( faithful , aes( x = eruptions)) faithful_histogram + geom_histogram() This gives the following histogram: This histogram has 30 bins (the default number of bins for the geom_histogram() function). You could use the binwidth= argument to change the number of bins. However, there is another way that gives even more control over the bin boundaries. I will create a vector of bin boundaries (sometimes called breaks), and pass this vector as the breaks argument to geom_histogram() . In the
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
following, you create bins with a width of 0.2: faithful_histogram + geom_histogram(breaks = seq( 1 . 4 , 5 . 2 , by = 0 . 2 )) Although it doesn’t look much different in the general shape, you will notice that the counts in the second histogram for eruptions goes up much higher than the eruption counts in the first histogram Box Plots A box plot is similar to a histogram in that it summarizes data values in a visual format. It is reasonably straightforward to create a box plot using ggplot2 . In this section, I will use the cars dataset that is supplied as part of R. It includes speed measurements (in miles per hour) and stopping distance (in feet) for cars measured in the 1920’s. The cars dataset is in a data frame format. If you want to see the first six rows, you can use the head() function: head (cars) speed dist 1 4 2 2 4 10 3 7 4 4 7 22 5 8 16 6 9 10 You can use the following commands to create a box plot
that represents the stopping distance. Box plots will have no x= except for empty quotes, and you’ll put in the variable you’re interested in seeing a box plot with under y= . d <- ggplot( cars , aes( x = "" , y = dist)) d + geom_boxplot() + xlab( "" ) This produces the following box plot. The box plot is created from the following values: * summary() - Minimum: the smallest value in the vector - 1 st Quartile: the value below which one quarter of the values lie - Median: the middle value in the vector: ½ of the values are larger and ½ of the values are smaller - 3 rd Quartile: the value below which ¾ of the values lie - Maximum: the largest value in the vector Remember, you can use the summary() function to compute all of these values (plus the mean). It works on data frames just as well as vectors, but you will need to specify the variable in the data frame you are interested in using the dollar sign $ after the dataset name. So cars$dist specifies that you want a summary of the dist variable from the cars dataset. summary (cars $dist )
Min. 1st Qu. Median Mean 3rd Qu. Max. 2.00 26.00 36.00 42.98 56.00 120.00 The plot below is labeled with these values just for ease of interpretation for you, R will not normally label them. Outliers There is one value in this box plot that is an outlier. How do you know if a value is an outlier? Two ways: 1. Visually. Anything outside the “whiskers” of the plot (the lines at the end) are outliers. 2. Mathematically. Anything 1.5 times higher or lower than the value of the interquartile range (IQR) is an outlier. If you were to inspect the plot above for outliers, you would notice that the point labeled “outlier” is in fact after the end of the top whisker. You could also calculate the IQR (3 rd quartile minus 1 st quartile) to come up with an IQR value of 30 for the stopping distance data. Multiply 30 by 1.5: 30 * 1.5 = 45 Then add this value to your third quartile to find upper outliers: 45 + 56 = 101 And subtract it from the first quartile to find your lower outliers: 26 - 45 = -19 So anything below 19 or above 101 would be considered
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
outliers in your data. On the chart above, you can easily see that the point labeled an outlier is higher than 101. This procedure of using 1.5*IQR beyond the “box” part of the box plot is pretty standard, but it is not a hard and fast rule. There are statistical software packages out there that will use different criteria to determine whether a point is an outlier or not; but many packages use this rule of thumb. The Importance of Outliers In a small dataset, one or two outliers can really make a difference! Outliers will mess with the mean of a dataset tremendously. A few high outliers could raise the mean; a few low outliers may bottom it out. Consider the cars dataset: with the outlier in, the mean is 42.98. When the outlier is 120 is removed, the mean drops to 41.41. While that may not be a lot in this case, adding just one or two more upper outliers can create a completely inaccurate mean much higher than it should be. Because outlier data is so improbable, it can bias your data. Outliers always need to be noted, but for some analyses, you may also remove them. Normal Probability Plots Determining whether data are normally distributed is important for many different statistics, since it is an assumption that must be met. If data are approximately normally distributed, the points in their normal probability plot will lie approximately in a straight line. On the other hand, if they are not normally distributed, the plot will not be straight. As a guideline, use the “fat pencil test:” if a fat pencil placed over the graph will cover all or virtually all of the data points in the normal probability plot, then the distribution can be assumed normal. Using ggplot2 , it is straightforward to create a normal probability plot. You will use the eruption times from the faithful dataset to demonstrate this. The following command creates the normal probability plot: ggplot(faithful, aes(sample = eruptions)) + geom_ qq() The name of the variable will go after sample= and adding the geom_qq() is the type of plot.
*The command geom_qq() is what creates this plot. For this reason, normal probability plots are often called QQ Plots instead. When you look at this plot, it is clear that the data do not fall on a straight line – it is not even close – so the eruption times in the faithful data frame do not come from a normal distribution. You can also see this by looking at the histogram of eruption times you created previously; this histogram is repeated below: This histogram does not have the same shape as a normal distribution; a normal distribution has one “peak,” that (beautiful bell-shaped curve) while this histogram has 2 peaks. Distributions with two peaks are called bimodal . This is further evidence that the eruption times do not come from a normal distribution. If I were working the faithful data, a glance at the histogram would make it obvious that the data are not normally distributed.
A Normally Distributed Example You can create a normal probability plot from the speed light data in the morley data set. The following command creates the normal probability plot: ggplot(morley, aes(sample = Speed)) + geom_ qq() This creates the following normal probability plot: The data fall more or less in a straight line, so this indicates that these data have a distribution that is approximately normal. In your mind’s eye, place a fat pencil over the data. Does your pencil cover all or very nearly all of the data? If your answer is ‘yes’ you can assume that the data are close enough to being normally distributed to treat them as such. Summary Screening and cleaning your data are one of the most important steps a data scientist can take to ensure high quality data. Histograms, box plots, and normal probability plots are all great visuals to use for your own internal visualization purposes, so you can get a handle on the normality and number of outliers in your data. Armed with that knowledge, you will be better able to make judgments about which statistics to use and whether the data you have is accurate. ggplot2 is a very useful package for visualizations in R that you will use over and over again, so getting familiar with the syntax of ggplot2 will speed you on your way to bigger and better things!
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help