Entity Academy Lesson 5 Statistical Plots
docx
keyboard_arrow_up
School
University of South Florida *
*We aren’t endorsed by this school
Course
102
Subject
Mathematics
Date
Jan 9, 2024
Type
docx
Pages
15
Uploaded by ssaintclair
Sindy Saintclair
Thursday, December 16, 2021
Lesson 5 – Statistical Plots
Learning
Objectives and
Questions
Notes and Answers
Installing and
Using Packages
Argument
Names
What is ggplot2?
R has many different systems to create plots of data; the
two most popular ones are
R Base Graphics
and a system
called
ggplot2
.
R Base Graphics
is built into R; there’s no
need to do anything to make these capabilities available.
ggplot2
is a package that you must install.
ggplot2
is part
of a collection of R packages called the
tidyverse
. You can
find more info about the
tidyverse
here
.
The Very Basics
ggplot(dataFrame, aes(x =column, y=column) +
geom_plotType( )
-
ggplot
– always starts with ggplot
-
dataFrame
– Must Match!
-
aes
– Code Stands for “aesthetics”
-
x
– bottom axis variable
-
y
– left axis variable
-
geom_plotType()
– Additional Arguments can be
added; options include boxplot, line, smooth, point,
path, and more!
Changing a Variable Type for a Plot
ggplot(dataFrame, aes(x =type(column),
y=column)) + geom_plotType( )
-
x=type
– Options include: factor, date, and more!
-
(column)
– column name in parentheses
-
(column))
– Additional close parenthesis
Changing the Scaling
ggplot(dataFrame, aes(x=type(column),
y=column)) + geom_plotType( ) + scale_y_log10( )
-
scale –
code
-
_y_ -
select x or y for axis
-
log10( ) –
options to make bigger: log or log10, to
make smaller: sqrt, and many more!
Labels
ggplot(dataFrame, aes(x=column, y=column) +
geom_plotType( ) +
xlab(“xLabel”) +
ylab(“yLabel”) +
Text must be in parentheses and
quotes
ggtitle(“GraphTitle”)
Plotting with Color
ggplot(dataFrame, aes(x=column, y=column) +
geom_plotType(color=column) + xlab(“xLabel”) +
Argument Name
ylab(“yLabel”) +
Additional factor variable; Colors
randomly assigned
ggititle(“GraphTitle”)
Plotting with Size
ggplot(dataFrame, aes(x=colun, y=column) +
geom_plotType(aes(size=column)) +
xlab(“xLabel”) +
ylab(“yLabel”) +
{
Argument Name; use within aes( )};
{additional factor variable; sizes assigned}
ggititle(“GraphTitle”)
Plotting with Transparency
ggplot(dataFrame, aes(x=column, y=column) +
geom_plotType(aes(alpha=column)) +
xlab(“xLabel”) +
ylab(‘yLabel”) +
{Argument Name; use with aes()}; {additional
factor variable, transparencies assigned}
ggititle(“GraphTitle”)
Plotting with Multiple Graphs
-
From the “gridExtra” package
-
grid.arrange(plot1, plot2, ncol=1)
Number
indicates vertical or horizontal
Named ggplots already created; argument
name
Installing ggplot2
The first thing to do in RStudio is to install the
ggplot2
package. Click on the
Packages
tab.
Click on the
Install
button. This will b ring up a dialog box
that looks like this:
Type
ggplot2
into the “Packages” field, then click on the
Install button. It will take a few minutes for
ggplot2
to
install and the different packages needed to support
ggplot2
.
Another way to install packages is to use the function
install.packages()
and place the name of the package
you are trying to install in the parentheses. For instance,
to install
ggplot2
without the use of the RStudio controls,
you would do the following:
install
.packages(
"ggplot2"
)
The end result is the same, however, no matter which
method you employ.
The last thing to do to make
ggplot2
available for use is to
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
type the following command in the Console pane:
library
(
"ggplot2"
)
If your Packages tab is still visible, you should be able to
find
ggplot2
in the list of packages, and the box by it
should be checked.
Installing Other Packages
You’ll need many other packages throughout this course,
and the best thing to do is download all of them now! This
way, you won’t end up with any weird versioning issues.
Please make sure you run this code in R:
install
.packages(
"ggplot2"
)
install
.packages(
"datasets"
)
install
.packages(
"readxl"
)
install
.packages(
"dplyr"
)
install
.packages(
"PerformanceAnalytics"
)
install
.packages(
"corrplot"
)
install
.packages(
"gapminder"
)
install
.packages(
"gridextra"
)
install
.packages(
"Ecdat"
)
install
.packages(
"corpcor"
)
install
.packages(
"GPArotation"
)
install
.packages(
"psych"
)
install
.packages(
"IDPmisc"
)
install
.packages(
"lattice"
)
install
.packages(
"treetop"
)
install
.packages(
"scales"
)
install
.packages(
"rcompanion"
)
install
.packages(
"gmodels"
)
install
.packages(
"car"
)
install
.packages(
"caret"
)
install
.packages(
"gvlma"
)
install
.packages(
"predictmeans"
)
install
.packages(
"caret"
)
install
.packages(
"magrittr"
)
install
.packages(
"tidyr"
)
install
.packages(
"lmtest"
)
install
.packages(
"popbio"
)
install
.packages(
"e1071"
)
install
.packages(
"data.table"
)
install
.packages(
"effects"
)
install
.packages(
"multcomp"
)
install
.packages(
"mvnormtest"
)
Histograms
As learned previously, a histogram provides a visual
representation of the distribution of values in a sample.
Because it is silly to graph each value by itself, it gets
broken into
bins
. A bin takes the range of data and divides
it into equal segments. Then so many data points fall into
each range of numbers, or
bins
. Typically, the number of
values that fall into each bin are counted, and this count
is used to create a bar graph in which the height of each
bar is proportional to the number of values that fall into
that bin.
Suppose I want to make a histogram of the height data I
used previously. I will first create the vector of heights and
assign it to the variable
height
:
height
<- c(
171
,
192
,
183
,
177
,
154
,
176
)
You then create a data frame from
height
:
height_df
<-
data
.frame(
height
)
You create a histogram with these commands:
h <- ggplot(
height_df
, aes(
x
= height))
h + geom_histogram()
The
aes()
function is where you specify your variables.
You want the variable height, from the data frame
height_df
. These commands create a histogram that
looks like this:
Adding in Bins
This also creates a warning: “stat_bin() using bins = 30.
Pick better value with binwidth.” This histogram has a
vertical bar for each height value. visually, this is not very
satisfying. You can make a much more informative
histogram by setting the width of each bin to be 10. You
accomplish this with the
binwidth=
attribute.
h
+
geom_histogram
(
binwidth
=
10
)
As you can see, the histogram is just a special kind of bar
chart. The horizontal variable is the height in centimeters.
Each bar in the chart is associated with a lower value and
an upper value on the horizontal axis.
For example, the bar of height three in the middle of the
chart has a lower value of 175 and an upper value of 185.
The height of each bar is the number of data values that
fall between its lower and upper values.
If you look at the height data, there are three values that
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
fall between 175 and 185. These lower and upper values
define the bin for each bar; the height of the bar
represents the number of data values that fall into the
bin.
There is no bar (or you can think of it as a bar that has
zero height) corresponding to the bin for values between
155 and 165; that is because there are no heights that are
between 155 and 165.
The vertical axis is labeled
count
to indicate that the
vertical height of the bars is the number of values that fall
into the corresponding bins.
Adding a Title and Labels
You can improve this histogram by giving it a title of
“Histogram of Heights” and adding a label to the
horizontal axis of “Height (in cm)”. You can do this with
the following command:
h + geom_histogram(binwidth =
10
) +
ggtitle
(
"Histogram of Heights"
)
+
xlab
(
"Height (in cm)"
)
This results in the histogram below:
Relative Frequencies
Sometimes, when plotting a histogram, you want the
height of the bars to be related to the
relative frequency
,
or the fraction of the total number of values that fall into
each bin. This requires a rather complicated command:
h + geom_histogram(binwidth =
10
, aes(y
= .
.count
../sum(.
.count
..))) +
ggtitle
(
"Histogram of Heights"
)
+ xlab(
"Height
(in cm)"
) +
ylab
(
"Relative frequency"
)
Adding
aes(y = ..count../sum(..count..)
as an
argument to
geom_histogram()
is the code that changes
the counts to relative frequency. Adding
ylab("Relative
frequency")
gives a reasonable label for the vertical axis.
These commands give the following histogram. The
vertical axis is the relative frequency.
You can change the color of the bars and the color of the
lines outlining the bars by adding arguments to
geom_histogram()
; the color of the bars is specified by
adding a
fill=
argument, while the color of the lines is
specified by adding a
color=
argument. You can create a
histogram with these colors:
h + geom_histogram(binwidth = 10,
fill
=
"goldenrod"
, color =
"deepskyblue4"
) +
ggtitle(
"Histogram of Heights"
) + xlab(
"Height
(in cm)"
)
Eruptions Histogram
You will now look a larger data set and how it can be
interpreted in a histogram. You will use the eruption times
for Old Faithful.
faithful
is data frame with two columns;
one is labeled
eruptions
. You can create the histogram
with the following commands:
faithful_histogram <- ggplot(
faithful
, aes(
x
=
eruptions))
faithful_histogram + geom_histogram()
This gives the following histogram:
This histogram has 30 bins (the default number of bins for
the
geom_histogram()
function). You could use the
binwidth=
argument to change the number of bins.
However, there is another way that gives even more
control over the bin boundaries. I will create a vector of
bin boundaries (sometimes called breaks), and pass this
vector as the breaks argument to
geom_histogram()
. In the
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
following, you create bins with a width of 0.2:
faithful_histogram
+ geom_histogram(breaks =
seq(
1
.
4
,
5
.
2
, by =
0
.
2
))
Although it doesn’t look much different in the general
shape, you will notice that the counts in the second
histogram for eruptions goes up much higher than the
eruption counts in the first histogram
Box Plots
A box plot is similar to a histogram in that it summarizes
data values in a visual format.
It is reasonably straightforward to create a box plot using
ggplot2
. In this section, I will use the
cars
dataset that is
supplied as part of R. It includes speed measurements (in
miles per hour) and stopping distance (in feet) for cars
measured in the 1920’s. The cars dataset is in a data
frame format. If you want to see the first six rows, you can
use the
head()
function:
head
(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
You can use the following commands to create a box plot
that represents the stopping distance. Box plots will have
no
x=
except for empty quotes, and you’ll put in the
variable you’re interested in seeing a box plot with under
y=
.
d <- ggplot(
cars
, aes(
x
=
""
, y = dist))
d + geom_boxplot() + xlab(
""
)
This produces the following box plot.
The box plot is created from the following values: *
summary()
-
Minimum: the smallest value in the vector
-
1
st
Quartile: the value below which one quarter of
the values lie
-
Median: the middle value in the vector: ½ of the
values are larger and ½ of the values are smaller
-
3
rd
Quartile: the value below which ¾ of the values
lie
-
Maximum: the largest value in the vector
Remember, you can use the
summary()
function to
compute all of these values (plus the mean). It works on
data frames just as well as vectors, but you will need to
specify the variable in the data frame you are interested
in using the dollar sign
$
after the dataset name. So
cars$dist
specifies that you want a summary of the
dist
variable from the
cars
dataset.
summary
(cars
$dist
)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 26.00 36.00 42.98 56.00 120.00
The plot below is labeled with these values just for ease of
interpretation for you, R will not normally label them.
Outliers
There is one value in this box plot that is an outlier. How
do you know if a value is an outlier? Two ways:
1.
Visually. Anything outside the “whiskers” of the plot
(the lines at the end) are outliers.
2.
Mathematically. Anything 1.5 times higher or lower
than the value of the interquartile range (IQR) is an
outlier.
If you were to inspect the plot above for outliers, you
would notice that the point labeled “outlier” is in fact after
the end of the top whisker. You could also calculate the
IQR (3
rd
quartile minus 1
st
quartile) to come up with an IQR
value of 30 for the stopping distance data. Multiply 30 by
1.5:
30 * 1.5 = 45
Then add this value to your third quartile to find upper
outliers:
45 + 56 = 101
And subtract it from the first quartile to find your lower
outliers:
26 - 45 = -19
So anything below 19 or above 101 would be considered
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
outliers in your data. On the chart above, you can easily
see that the point labeled an outlier is higher than 101.
This procedure of using 1.5*IQR beyond the “box” part of
the box plot is pretty standard, but it is not a hard and
fast rule. There are statistical software packages out there
that will use different criteria to determine whether a
point is an outlier or not; but many packages use this rule
of thumb.
The Importance of Outliers
In a small dataset, one or two outliers can really make a
difference! Outliers will mess with the mean of a dataset
tremendously. A few high outliers could raise the mean; a
few low outliers may bottom it out. Consider the cars
dataset: with the outlier in, the mean is 42.98. When the
outlier is 120 is removed, the mean drops to 41.41. While
that may not be a lot in this case, adding just one or two
more upper outliers can create a completely inaccurate
mean much higher than it should be. Because outlier data
is so improbable, it can bias your data. Outliers always
need to be noted, but for some analyses, you may also
remove them.
Normal
Probability
Plots
Determining whether data are normally distributed is
important for many different statistics, since it is an
assumption that must be met.
If data are approximately normally distributed, the points
in their normal probability plot will lie approximately in a
straight line. On the other hand, if they are not normally
distributed, the plot will not be straight. As a guideline,
use the “fat pencil test:” if a fat pencil placed over the
graph will cover all or virtually all of the data points in the
normal probability plot, then the distribution can be
assumed normal.
Using
ggplot2
, it is straightforward to create a normal
probability plot. You will use the eruption times from the
faithful
dataset to demonstrate this. The following
command creates the normal probability plot:
ggplot(faithful, aes(sample = eruptions)) +
geom_
qq()
The name of the variable will go after
sample=
and adding
the
geom_qq()
is the type of plot.
*The command geom_qq() is what creates this plot. For
this reason, normal probability plots are often called QQ
Plots instead.
When you look at this plot, it is clear that the data do not
fall on a straight line – it is not even close – so the
eruption times in the
faithful
data frame do not come
from a normal distribution. You can also see this by
looking at the histogram of eruption times you created
previously; this histogram is repeated below:
This histogram does not have the same shape as a normal
distribution; a normal distribution has one “peak,” that
(beautiful bell-shaped curve) while this histogram has 2
peaks. Distributions with two peaks are called
bimodal
.
This is further evidence that the eruption times do not
come from a normal distribution.
If I were working the
faithful
data, a glance at the
histogram would make it obvious that the data are not
normally distributed.
A Normally Distributed Example
You can create a normal probability plot from the speed
light data in the
morley
data set. The following command
creates the normal probability plot:
ggplot(morley, aes(sample = Speed)) + geom_
qq()
This creates the following normal probability plot:
The data fall more or less in a straight line, so this
indicates that these data have a distribution that is
approximately normal. In your mind’s eye, place a fat
pencil over the data. Does your pencil cover all or very
nearly all of the data? If your answer is ‘yes’ you can
assume that the data are close enough to being normally
distributed to treat them as such.
Summary
Screening and cleaning your data are one of the most
important steps a data scientist can take to ensure high
quality data. Histograms, box plots, and normal
probability plots are all great visuals to use for your own
internal visualization purposes, so you can get a handle
on the normality and number of outliers in your data.
Armed with that knowledge, you will be better able to
make judgments about which statistics to use and
whether the data you have is accurate.
ggplot2
is a very useful package for visualizations in R
that you will use over and over again, so getting familiar
with the syntax of
ggplot2
will speed you on your way to
bigger and better things!
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help