Lab15(1)
pdf
keyboard_arrow_up
School
Texas Tech University *
*We aren’t endorsed by this school
Course
1330
Subject
Computer Science
Date
Dec 6, 2023
Type
Pages
1
Uploaded by juanchozulu31
Laboratory 15: Matplotlib for Jam!
Juans-MacBook-Pro.local
juanandreszuluqga
/Users/juanandreszuluqga/anaconda3/bin/python
3.11.4 (main, Jul
5 2023, 09:00:44) [Clang 14.0.6 ]
sys.version_info(major=3, minor=11, micro=4, releaselevel='final', serial=0)
Full name: Juan Zuluaga
R#: 11830028
Title of the notebook: Lab 15
Date: 11/13/2023
Matplotlip and Visual Display of Data
This lesson will introduce the
matplotlib
external module package, and examine how to construct line charts, scatter plots, bar charts, and histograms using methods in
matplotlib
and
pandas
The theory of histograms will appear in later lessons, here we only show how to construct one using
matplotlib
About `matplotlib`
Quoting from:
https://matplotlib.org/tutorials/introductory/pyplot.html#sphx-glr-tutorials-introductory-pyplot-py
matplotlib.pyplot
is a collection of functions that make matplotlib work like MATLAB. Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots
some lines in a plotting area, decorates the plot with labels, etc.
In
matplotlib.pyplot
various states are preserved across function calls, so that it keeps track of things like the current figure and plotting area, and the plotting functions are directed to the current axes
(please note that "axes" here and in most places in the documentation refers to the axes part of a figure and not the strict mathematical term for more than one axis).
Background
Data are not always numerical. Data can music (audio files), or places on a map (georeferenced attributes files), images (various imge files, e.g. .png, jpeg)
They can also be categorical into which you can place individuals:
The individuals are cartons of ice-cream, and the category is the flavor in the carton
The individuals are professional basketball players, and the category is the player's team.
Bar Graphs
Bar charts (graphs) are good display tools to graphically represent categorical information. The bars are evenly spaced and of constant width. The height/length of each bar is proportional to the
relative
frequency
of the corresponding category.
Relative frequency
is the ratio of how many things in the category to how many things in the whole collection.
The example below uses
matplotlib
to create a box plot for the ice cream analogy, the example is adapted from an example at
https://www.geeksforgeeks.org/bar-plot-in-matplotlib/
Lets tidy up the script so it is more understandable, a small change in the import statement makes a simpler to read (for humans) script - also changed the bar colors just 'cause!
Using pandas, we can build bar charts a bit easier.
Flavor
Number of Cartons
0
Chocolate
16
1
Strawberry
5
2
Vanilla
9
<Axes: xlabel='Flavor'>
<Axes: xlabel='Flavor'>
Example- Language Bars!
Consider the data set "data" defined as
data = {'C':20, 'C++':15, 'Java':30, 'Python':35}
which lists student count by programming language in some school.
Produce a bar chart of number of students in each language, where language is the classification, and student count is the variable.
Plot it as a horizontal bar chart:
Line Charts
A line chart or line plot or line graph or curve chart is a type of chart which displays information as a series of data points called 'markers' connected by straight line segments.
It is a basic type of chart common in many fields. It is similar to a scatter plot (below) except that the measurement points are
ordered
(typically by their x-axis value) and joined with straight line segments.
A line chart is often used to visualize a trend in data over intervals of time – a time series – thus the line is often drawn chronologically.
The x-axis spacing is sometimes tricky, hence line charts can unintentionally decieve - so be careful that it is the appropriate chart for your application.
Example- Speed vs Time
Consider the experimental data below
Elapsed Time (s)
Speed (m/s)
0
0
1.0
3
2.0
7
3.0
12
4.0
20
5.0
30
6.0
45.6
Show the relationship between time and speed. Is the relationship indicating acceleration? How much?
From examination of the plot, estimate the speed at time t = 5.0 (eyeball estimate)
Example- Add a linear fit
Using the same series from Exercise 1, Plot the speed vs time (speed on y-axis, time on x-axis) using a line plot. Plot a second line based on the linear model
,
where
.
Example- Find a better fit
Using trial and error try to improve the 'fit' of the model, by adjusting values of
.
Scatter Plots
A scatter plot (also called a scatterplot, scatter graph, scatter chart, scattergram, or scatter diagram) is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two
variables for a set of data. If the points are coded (color/shape/size), one additional variable can be displayed. The data are displayed as a collection of points, each having the value of one variable determining
the position on the horizontal axis and the value of the other variable determining the position on the vertical axis.
A scatter plot can be used either when one continuous variable that is under the control of the experimenter and the other depends on it or when both continuous variables are independent. If a parameter
exists that is systematically incremented and/or decremented by the other, it is called the control parameter or independent variable and is customarily plotted along the horizontal axis. The measured or
dependent variable is customarily plotted along the vertical axis. If no dependent variable exists, either type of variable can be plotted on either axis and a scatter plot will illustrate only the degree of correlation
(not causation) between two variables.
A scatter plot can suggest various kinds of correlations between variables with a certain confidence interval. For example, weight and height, weight would be on y axis and height would be on the x axis.
Correlations may be positive (rising), negative (falling), or null (uncorrelated). If the pattern of dots slopes from lower left to upper right, it indicates a positive correlation between the variables being studied. If
the pattern of dots slopes from upper left to lower right, it indicates a negative correlation.
A line of best fit (alternatively called 'trendline') can be drawn in order to study the relationship between the variables. An equation for the correlation between the variables can be determined by established
best-fit procedures. For a linear correlation, the best-fit procedure is known as linear regression and is guaranteed to generate a correct solution in a finite time. No universal best-fit procedure is guaranteed to
generate a solution for arbitrary relationships. A scatter plot is also very useful when we wish to see how two comparable data sets agree and to show nonlinear relationships between variables.
Furthermore, if the data are represented by a mixture model of simple relationships, these relationships will be visually evident as superimposed patterns.
Scatter charts can be built in the form of bubble, marker, or/and line charts.
Much of the above is verbatim/adapted from:
https://en.wikipedia.org/wiki/Scatter_plot
Example- Examine the dataset with heights of fathers, mothers and sons
---------------------------------------------------------------------------
FileNotFoundError
Traceback (most recent call last)
Cell
In[13], line 1
----> 1
df
=
pd
.
read_csv(
'galton_subset.csv'
)
2
df[
'child'
]
=
df[
'son'
] ; df
.
drop(
'son'
, axis
=1
, inplace
=
True
)
# rename son to child - got to imagine there are some daughters
3
df
.
head()
File
~/anaconda3/lib/python3.11/site-packages/pandas/util/_decorators.py:211
, in
deprecate_kwarg.<locals>._deprecate_kwarg.<locals>.wrapper
(*args, **kwargs)
209
else
:
210
kwargs[new_arg_name]
=
new_arg_value
--> 211
return
func(
*
args,
**
kwargs)
File
~/anaconda3/lib/python3.11/site-packages/pandas/util/_decorators.py:331
, in
deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper
(*args, **kwargs)
325
if
len
(args)
>
num_allow_args:
326
warnings
.
warn(
327
msg
.
format(arguments
=
_format_argument_list(allow_args)),
328
FutureWarning
,
329
stacklevel
=
find_stack_level(),
330
)
--> 331
return
func(
*
args,
**
kwargs)
File
~/anaconda3/lib/python3.11/site-packages/pandas/io/parsers/readers.py:950
, in
read_csv
(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squee
ze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filte
r, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decima
l, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whites
pace, low_memory, memory_map, float_precision, storage_options)
935
kwds_defaults
=
_refine_defaults_read(
936
dialect,
937
delimiter,
(...)
946
defaults
=
{
"delimiter"
:
","
},
947
)
948
kwds
.
update(kwds_defaults)
--> 950
return
_read(filepath_or_buffer, kwds)
File
~/anaconda3/lib/python3.11/site-packages/pandas/io/parsers/readers.py:605
, in
_read
(filepath_or_buffer, kwds)
602
_validate_names(kwds
.
get(
"names"
,
None
))
604
# Create the parser.
--> 605
parser
=
TextFileReader(filepath_or_buffer,
**
kwds)
607
if
chunksize
or
iterator:
608
return
parser
File
~/anaconda3/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1442
, in
TextFileReader.__init__
(self, f, engine, **kwds)
1439
self
.
options[
"has_index_names"
]
=
kwds[
"has_index_names"
]
1441
self
.
handles: IOHandles
|
None
=
None
-> 1442
self
.
_engine
=
self
.
_make_engine(f,
self
.
engine)
File
~/anaconda3/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1735
, in
TextFileReader._make_engine
(self, f, engine)
1733
if
"b"
not
in
mode:
1734
mode
+=
"b"
-> 1735
self
.
handles
=
get_handle(
1736
f,
1737
mode,
1738
encoding
=
self
.
options
.
get(
"encoding"
,
None
),
1739
compression
=
self
.
options
.
get(
"compression"
,
None
),
1740
memory_map
=
self
.
options
.
get(
"memory_map"
,
False
),
1741
is_text
=
is_text,
1742
errors
=
self
.
options
.
get(
"encoding_errors"
,
"strict"
),
1743
storage_options
=
self
.
options
.
get(
"storage_options"
,
None
),
1744
)
1745
assert
self
.
handles
is
not
None
1746
f
=
self
.
handles
.
handle
File
~/anaconda3/lib/python3.11/site-packages/pandas/io/common.py:856
, in
get_handle
(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_opti
ons)
851
elif
isinstance
(handle,
str
):
852
# Check whether the filename is to be opened in binary mode.
853
# Binary mode does not support 'encoding' and 'newline'.
854
if
ioargs
.
encoding
and
"b"
not
in
ioargs
.
mode:
855
# Encoding
--> 856
handle
=
open
(
857
handle,
858
ioargs
.
mode,
859
encoding
=
ioargs
.
encoding,
860
errors
=
errors,
861
newline
=
""
,
862
)
863
else
:
864
# Binary mode
865
handle
=
open
(handle, ioargs
.
mode)
FileNotFoundError
: [Errno 2] No such file or directory: 'galton_subset.csv'
---------------------------------------------------------------------------
NameError
Traceback (most recent call last)
Cell
In[14], line 2
1
myfamily
=
plt
.
figure(figsize
=
(
10
,
10
))
# build a square drawing canvass from figure class
----> 2
plt
.
scatter(son, dad, c
=
'red'
)
# basic scatter plot
3
plt
.
show()
NameError
: name 'son' is not defined
<Figure size 1000x1000 with 0 Axes>
---------------------------------------------------------------------------
NameError
Traceback (most recent call last)
Cell
In[15], line 3
1
# Looks lousy, needs some labels
2
myfamily
=
plt
.
figure(figsize
=
(
10
,
10
))
# build a square drawing canvass from figure class
----> 3
plt
.
scatter(son, dad, c
=
'red'
, label
=
'Father'
)
# one plot series
4
plt
.
scatter(son, mom, c
=
'blue'
, label
=
'Mother'
)
# two plot series
5
plt
.
xlabel(
"Child's height"
)
NameError
: name 'son' is not defined
<Figure size 1000x1000 with 0 Axes>
---------------------------------------------------------------------------
KeyError
Traceback (most recent call last)
File
~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:3802
, in
Index.get_loc
(self, key, method, tolerance)
3801
try
:
-> 3802
return
self
.
_engine
.
get_loc(casted_key)
3803
except
KeyError
as
err:
File
~/anaconda3/lib/python3.11/site-packages/pandas/_libs/index.pyx:138
, in
pandas._libs.index.IndexEngine.get_loc
()
File
~/anaconda3/lib/python3.11/site-packages/pandas/_libs/index.pyx:165
, in
pandas._libs.index.IndexEngine.get_loc
()
File
pandas/_libs/hashtable_class_helper.pxi:5745
, in
pandas._libs.hashtable.PyObjectHashTable.get_item
()
File
pandas/_libs/hashtable_class_helper.pxi:5753
, in
pandas._libs.hashtable.PyObjectHashTable.get_item
()
KeyError
: 'child'
The above exception was the direct cause of the following exception:
KeyError
Traceback (most recent call last)
Cell
In[16], line 2
1
# Repeat in pandas - The dataframe already is built
----> 2
df
.
plot
.
scatter(x
=
"child"
, y
=
"father"
)
File
~/anaconda3/lib/python3.11/site-packages/pandas/plotting/_core.py:1697
, in
PlotAccessor.scatter
(self, x, y, s, c, **kwargs)
1614
def
scatter
(
self
, x, y, s
=
None
, c
=
None
,
**
kwargs)
->
PlotAccessor:
1615
"""
1616
Create a scatter plot with varying marker point size and color.
1617
(...)
1695
...
colormap='viridis')
1696
"""
-> 1697
return
self
(kind
=
"scatter"
, x
=
x, y
=
y, s
=
s, c
=
c,
**
kwargs)
File
~/anaconda3/lib/python3.11/site-packages/pandas/plotting/_core.py:945
, in
PlotAccessor.__call__
(self, *args, **kwargs)
943
if
kind
in
self
.
_dataframe_kinds:
944
if
isinstance
(data, ABCDataFrame):
--> 945
return
plot_backend
.
plot(data, x
=
x, y
=
y, kind
=
kind,
**
kwargs)
946
else
:
947
raise
ValueError
(
f"plot kind
{
kind
}
can only be used for data frames"
)
File
~/anaconda3/lib/python3.11/site-packages/pandas/plotting/_matplotlib/__init__.py:71
, in
plot
(data, kind, **kwargs)
69
kwargs[
"ax"
]
=
getattr
(ax,
"left_ax"
, ax)
70
plot_obj
=
PLOT_CLASSES[kind](data,
**
kwargs)
---> 71
plot_obj
.
generate()
72
plot_obj
.
draw()
73
return
plot_obj
.
result
File
~/anaconda3/lib/python3.11/site-packages/pandas/plotting/_matplotlib/core.py:452
, in
MPLPlot.generate
(self)
450
self
.
_compute_plot_data()
451
self
.
_setup_subplots()
--> 452
self
.
_make_plot()
453
self
.
_add_table()
454
self
.
_make_legend()
File
~/anaconda3/lib/python3.11/site-packages/pandas/plotting/_matplotlib/core.py:1260
, in
ScatterPlot._make_plot
(self)
1257
else
:
1258
label
=
None
1259
scatter
=
ax
.
scatter(
-> 1260
data[x]
.
values,
1261
data[y]
.
values,
1262
c
=
c_values,
1263
label
=
label,
1264
cmap
=
cmap,
1265
norm
=
norm,
1266
**
self
.
kwds,
1267
)
1268
if
cb:
1269
cbar_label
=
c
if
c_is_column
else
""
File
~/anaconda3/lib/python3.11/site-packages/pandas/core/frame.py:3807
, in
DataFrame.__getitem__
(self, key)
3805
if
self
.
columns
.
nlevels
> 1
:
3806
return
self
.
_getitem_multilevel(key)
-> 3807
indexer
=
self
.
columns
.
get_loc(key)
3808
if
is_integer(indexer):
3809
indexer
=
[indexer]
File
~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:3804
, in
Index.get_loc
(self, key, method, tolerance)
3802
return
self
.
_engine
.
get_loc(casted_key)
3803
except
KeyError
as
err:
-> 3804
raise
KeyError
(key)
from
err
3805
except
TypeError
:
3806
# If we have a listlike key, _check_indexing_error will raise
3807
#
InvalidIndexError. Otherwise we fall through and re-raise
3808
#
the TypeError.
3809
self
.
_check_indexing_error(key)
KeyError
: 'child'
Histograms
Quoting from
https://en.wikipedia.org/wiki/Histogram
"A histogram is an approximate representation of the distribution of numerical data. It was first introduced by Karl Pearson.[1] To construct a histogram, the first step is to "bin" (or "bucket") the range of values—
that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable.
The bins (intervals) must be adjacent, and are often (but not required to be) of equal size.
If the bins are of equal size, a rectangle is erected over the bin with height proportional to the frequency—the number of cases in each bin. A histogram may also be normalized to display "relative" frequencies.
It then shows the proportion of cases that fall into each of several categories, with the sum of the heights equaling 1.
However, bins need not be of equal width; in that case, the erected rectangle is defined to have its area proportional to the frequency of cases in the bin. The vertical axis is then not the frequency but frequency
density—the number of cases per unit of the variable on the horizontal axis. Examples of variable bin width are displayed on Census bureau data below.
As the adjacent bins leave no gaps, the rectangles of a histogram touch each other to indicate that the original variable is continuous.
Histograms give a rough sense of the density of the underlying distribution of the data, and often for density estimation: estimating the probability density function of the underlying variable. The total area of a
histogram used for probability density is always normalized to 1. If the length of the intervals on the x-axis are all 1, then a histogram is identical to a relative frequency plot.
A histogram can be thought of as a simplistic kernel density estimation, which uses a kernel to smooth frequencies over the bins. This yields a smoother probability density function, which will in general more
accurately reflect distribution of the underlying variable. The density estimate could be plotted as an alternative to the histogram, and is usually drawn as a curve rather than a set of boxes. Histograms are
nevertheless preferred in applications, when their statistical properties need to be modeled. The correlated variation of a kernel density estimate is very difficult to describe mathematically, while it is simple for a
histogram where each bin varies independently.
An alternative to kernel density estimation is the average shifted histogram, which is fast to compute and gives a smooth curve estimate of the density without using kernels.
The histogram is one of the seven basic tools of quality control.
Histograms are sometimes confused with bar charts. A histogram is used for continuous data, where the bins represent ranges of data, while a bar chart is a plot of categorical variables. Some authors
recommend that bar charts have gaps between the rectangles to clarify the distinction."
Example- Explore the "top_movies" dataset and draw histograms for Gross and Year.
---------------------------------------------------------------------------
FileNotFoundError
Traceback (most recent call last)
Cell
In[17], line 3
1
import
pandas
as
pd
----> 3
df
=
pd
.
read_csv(
'top_movies.csv'
)
4
df
.
head()
File
~/anaconda3/lib/python3.11/site-packages/pandas/util/_decorators.py:211
, in
deprecate_kwarg.<locals>._deprecate_kwarg.<locals>.wrapper
(*args, **kwargs)
209
else
:
210
kwargs[new_arg_name]
=
new_arg_value
--> 211
return
func(
*
args,
**
kwargs)
File
~/anaconda3/lib/python3.11/site-packages/pandas/util/_decorators.py:331
, in
deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper
(*args, **kwargs)
325
if
len
(args)
>
num_allow_args:
326
warnings
.
warn(
327
msg
.
format(arguments
=
_format_argument_list(allow_args)),
328
FutureWarning
,
329
stacklevel
=
find_stack_level(),
330
)
--> 331
return
func(
*
args,
**
kwargs)
File
~/anaconda3/lib/python3.11/site-packages/pandas/io/parsers/readers.py:950
, in
read_csv
(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squee
ze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filte
r, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decima
l, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whites
pace, low_memory, memory_map, float_precision, storage_options)
935
kwds_defaults
=
_refine_defaults_read(
936
dialect,
937
delimiter,
(...)
946
defaults
=
{
"delimiter"
:
","
},
947
)
948
kwds
.
update(kwds_defaults)
--> 950
return
_read(filepath_or_buffer, kwds)
File
~/anaconda3/lib/python3.11/site-packages/pandas/io/parsers/readers.py:605
, in
_read
(filepath_or_buffer, kwds)
602
_validate_names(kwds
.
get(
"names"
,
None
))
604
# Create the parser.
--> 605
parser
=
TextFileReader(filepath_or_buffer,
**
kwds)
607
if
chunksize
or
iterator:
608
return
parser
File
~/anaconda3/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1442
, in
TextFileReader.__init__
(self, f, engine, **kwds)
1439
self
.
options[
"has_index_names"
]
=
kwds[
"has_index_names"
]
1441
self
.
handles: IOHandles
|
None
=
None
-> 1442
self
.
_engine
=
self
.
_make_engine(f,
self
.
engine)
File
~/anaconda3/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1735
, in
TextFileReader._make_engine
(self, f, engine)
1733
if
"b"
not
in
mode:
1734
mode
+=
"b"
-> 1735
self
.
handles
=
get_handle(
1736
f,
1737
mode,
1738
encoding
=
self
.
options
.
get(
"encoding"
,
None
),
1739
compression
=
self
.
options
.
get(
"compression"
,
None
),
1740
memory_map
=
self
.
options
.
get(
"memory_map"
,
False
),
1741
is_text
=
is_text,
1742
errors
=
self
.
options
.
get(
"encoding_errors"
,
"strict"
),
1743
storage_options
=
self
.
options
.
get(
"storage_options"
,
None
),
1744
)
1745
assert
self
.
handles
is
not
None
1746
f
=
self
.
handles
.
handle
File
~/anaconda3/lib/python3.11/site-packages/pandas/io/common.py:856
, in
get_handle
(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_opti
ons)
851
elif
isinstance
(handle,
str
):
852
# Check whether the filename is to be opened in binary mode.
853
# Binary mode does not support 'encoding' and 'newline'.
854
if
ioargs
.
encoding
and
"b"
not
in
ioargs
.
mode:
855
# Encoding
--> 856
handle
=
open
(
857
handle,
858
ioargs
.
mode,
859
encoding
=
ioargs
.
encoding,
860
errors
=
errors,
861
newline
=
""
,
862
)
863
else
:
864
# Binary mode
865
handle
=
open
(handle, ioargs
.
mode)
FileNotFoundError
: [Errno 2] No such file or directory: 'top_movies.csv'
---------------------------------------------------------------------------
KeyError
Traceback (most recent call last)
Cell
In[18], line 1
----> 1
df[[
"Gross"
]]
.
hist()
File
~/anaconda3/lib/python3.11/site-packages/pandas/core/frame.py:3813
, in
DataFrame.__getitem__
(self, key)
3811
if
is_iterator(key):
3812
key
=
list
(key)
-> 3813
indexer
=
self
.
columns
.
_get_indexer_strict(key,
"columns"
)[
1
]
3815
# take() does not accept boolean indexers
3816
if
getattr
(indexer,
"dtype"
,
None
)
==
bool
:
File
~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:6070
, in
Index._get_indexer_strict
(self, key, axis_name)
6067
else
:
6068
keyarr, indexer, new_indexer
=
self
.
_reindex_non_unique(keyarr)
-> 6070
self
.
_raise_if_missing(keyarr, indexer, axis_name)
6072
keyarr
=
self
.
take(indexer)
6073
if
isinstance
(key, Index):
6074
# GH 42790 - Preserve name from an Index
File
~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:6130
, in
Index._raise_if_missing
(self, key, indexer, axis_name)
6128
if
use_interval_msg:
6129
key
=
list
(key)
-> 6130
raise
KeyError
(
f"None of [
{
key
}
] are in the [
{
axis_name
}
]"
)
6132
not_found
=
list
(ensure_index(key)[missing_mask
.
nonzero()[
0
]]
.
unique())
6133
raise
KeyError
(
f"
{
not_found
}
not in index"
)
KeyError
: "None of [Index(['Gross'], dtype='object')] are in the [columns]"
---------------------------------------------------------------------------
KeyError
Traceback (most recent call last)
Cell
In[19], line 1
----> 1
df[[
"Year"
]]
.
hist()
File
~/anaconda3/lib/python3.11/site-packages/pandas/core/frame.py:3813
, in
DataFrame.__getitem__
(self, key)
3811
if
is_iterator(key):
3812
key
=
list
(key)
-> 3813
indexer
=
self
.
columns
.
_get_indexer_strict(key,
"columns"
)[
1
]
3815
# take() does not accept boolean indexers
3816
if
getattr
(indexer,
"dtype"
,
None
)
==
bool
:
File
~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:6070
, in
Index._get_indexer_strict
(self, key, axis_name)
6067
else
:
6068
keyarr, indexer, new_indexer
=
self
.
_reindex_non_unique(keyarr)
-> 6070
self
.
_raise_if_missing(keyarr, indexer, axis_name)
6072
keyarr
=
self
.
take(indexer)
6073
if
isinstance
(key, Index):
6074
# GH 42790 - Preserve name from an Index
File
~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:6130
, in
Index._raise_if_missing
(self, key, indexer, axis_name)
6128
if
use_interval_msg:
6129
key
=
list
(key)
-> 6130
raise
KeyError
(
f"None of [
{
key
}
] are in the [
{
axis_name
}
]"
)
6132
not_found
=
list
(ensure_index(key)[missing_mask
.
nonzero()[
0
]]
.
unique())
6133
raise
KeyError
(
f"
{
not_found
}
not in index"
)
KeyError
: "None of [Index(['Year'], dtype='object')] are in the [columns]"
---------------------------------------------------------------------------
KeyError
Traceback (most recent call last)
Cell
In[20], line 1
----> 1
df[[
"Gross"
]]
.
hist(bins
=100
)
File
~/anaconda3/lib/python3.11/site-packages/pandas/core/frame.py:3813
, in
DataFrame.__getitem__
(self, key)
3811
if
is_iterator(key):
3812
key
=
list
(key)
-> 3813
indexer
=
self
.
columns
.
_get_indexer_strict(key,
"columns"
)[
1
]
3815
# take() does not accept boolean indexers
3816
if
getattr
(indexer,
"dtype"
,
None
)
==
bool
:
File
~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:6070
, in
Index._get_indexer_strict
(self, key, axis_name)
6067
else
:
6068
keyarr, indexer, new_indexer
=
self
.
_reindex_non_unique(keyarr)
-> 6070
self
.
_raise_if_missing(keyarr, indexer, axis_name)
6072
keyarr
=
self
.
take(indexer)
6073
if
isinstance
(key, Index):
6074
# GH 42790 - Preserve name from an Index
File
~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:6130
, in
Index._raise_if_missing
(self, key, indexer, axis_name)
6128
if
use_interval_msg:
6129
key
=
list
(key)
-> 6130
raise
KeyError
(
f"None of [
{
key
}
] are in the [
{
axis_name
}
]"
)
6132
not_found
=
list
(ensure_index(key)[missing_mask
.
nonzero()[
0
]]
.
unique())
6133
raise
KeyError
(
f"
{
not_found
}
not in index"
)
KeyError
: "None of [Index(['Gross'], dtype='object')] are in the [columns]"
This is a Matplotlib Cheat Sheet
Here are some of the resources used for creating this notebook:
"Discrete distribution as horizontal bar chart"
available at *
https://matplotlib.org/stable/gallery/lines_bars_and_markers/horizontal_barchart_distribution.html
"Bar Plot in Matplotlib"
available at *
https://www.geeksforgeeks.org/bar-plot-in-matplotlib/
Here are some great reads on this topic:
"Python | Introduction to Matplotlib"
available at *
https://www.geeksforgeeks.org/python-introduction-matplotlib/
"Visualization with Matplotlib"
available at *
https://jakevdp.github.io/PythonDataScienceHandbook/04.00-introduction-to-matplotlib.html
"Introduction to Matplotlib — Data Visualization in Python"
by
Ehi Aigiomawu
available at *
https://heartbeat.fritz.ai/introduction-to-matplotlib-data-visualization-in-python-d9143287ae39
"Python Plotting With Matplotlib (Guide)"
by
Brad Solomon
available at *
https://realpython.com/python-matplotlib-guide/
Here are some great videos on these topics:
"Matplotlib Tutorial (Part 1): Creating and Customizing Our First Plots"
by
Corey Schafer
available at *
https://www.youtube.com/watch?v=UO98lJQ3QGI
"Intro to Data Analysis / Visualization with Python, Matplotlib and Pandas | Matplotlib Tutorial"
by
CS Dojo
available at *
https://www.youtube.com/watch?v=a9UrKTVEeZA
"Intro to Data Visualization in Python with Matplotlib! (line graph, bar chart, title, labels, size)"
by
Keith Galli
available at *
https://www.youtube.com/watch?v=DAQNHzOcO5A
Exercise: Bins, Bins, Bins!
Selecting the number of bins is an important decision when working with histograms. Are there any rules or recommendations for choosing the
number or width of bins? What happens if we use too many or too few bins?
* Make sure to cite any resources that you may use.
Explain here...
Too Few Bins: A histogram with insufficient detail may conceal underlying patterns in the data if there are too few bins used. It could simplify the distribution too much and cause crucial
information to be lost.
Too Many Bins: Using too many bins can result in a "noisy" histogram that makes it difficult to see the distribution's overall shape. Overinterpretation may result, making it more difficult to spot important trends
or patterns.
In conclusion, choosing the right number of bins for a histogram requires striking a balance between preserving key details in your data and offering an understandable, instructive visualization. When making
this choice, you should take into account the type of data you have, how it is distributed, and the objectives of your analysis.
In [1]:
# Preamble script block to identify host, user, and kernel
import
sys
!
hostname
!
whoami
print
(
sys
.
executable
)
print
(
sys
.
version
)
print
(
sys
.
version_info
)
In [2]:
ice_cream
=
{
'Chocolate'
:
16
,
'Strawberry'
:
5
,
'Vanilla'
:
9
}
# build a data model
import
matplotlib.pyplot
# the python plotting library
flavors
=
list
(
ice_cream
.
keys
())
# make a list object based on flavors
cartons
=
list
(
ice_cream
.
values
())
# make a list object based on carton count -- assumes 1:1 association!
myfigure
=
matplotlib
.
pyplot
.
figure
(
figsize
=
(
10
,
5
))
# generate a object from the figure class, set aspect ratio
# Built the plot
matplotlib
.
pyplot
.
bar
(
flavors
,
cartons
,
color
=
'maroon'
,
width
=
0.8
)
matplotlib
.
pyplot
.
xlabel
(
"Flavors"
)
matplotlib
.
pyplot
.
ylabel
(
"No. of Cartons in Stock"
)
matplotlib
.
pyplot
.
title
(
"Current Ice Cream in Storage"
)
matplotlib
.
pyplot
.
show
()
In [3]:
ice_cream
=
{
'Chocolate'
:
16
,
'Strawberry'
:
5
,
'Vanilla'
:
9
}
# build a data model
import
matplotlib.pyplot
as
plt
# the python plotting library
flavors
=
list
(
ice_cream
.
keys
())
# make a list object based on flavors
cartons
=
list
(
ice_cream
.
values
())
# make a list object based on carton count -- assumes 1:1 association!
myfigure
=
plt
.
figure
(
figsize
=
(
10
,
5
))
# generate a object from the figure class, set aspect ratio
# Built the plot
plt
.
bar
(
flavors
,
cartons
,
color
=
'orange'
,
width
=
0.8
)
plt
.
xlabel
(
"Flavors"
)
plt
.
ylabel
(
"No. of Cartons in Stock"
)
plt
.
title
(
"Current Ice Cream in Storage"
)
plt
.
show
()
In [4]:
import
pandas
as
pd
my_data
=
{
"Flavor"
: [
'Chocolate'
,
'Strawberry'
,
'Vanilla'
],
"Number of Cartons"
: [
16
,
5
,
9
]
}
df
=
pd
.
DataFrame
(
my_data
)
df
.
head
()
Out[4]:
In [5]:
df
.
plot
.
bar
(
x
=
'Flavor'
,
y
=
'Number of Cartons'
,
color
=
'magenta'
)
Out[5]:
In [6]:
df
.
plot
.
bar
(
x
=
'Flavor'
,
y
=
'Number of Cartons'
,
color
=
"red"
)
# rotate the category labels
Out[6]:
In [7]:
# Code and run your solution here
import
numpy
as
np
import
matplotlib.pyplot
as
plt
# creating the dataset
data
=
{
'C'
:
20
,
'C++'
:
15
,
'Java'
:
30
,
'Python'
:
35
}
courses
=
list
(
data
.
keys
())
values
=
list
(
data
.
values
())
fig
=
plt
.
figure
(
figsize
=
(
10
,
5
))
# creating the bar plot
plt
.
bar
(
courses
,
values
,
color
=
'maroon'
,
width
=
0.4
)
plt
.
xlabel
(
"Courses offered"
)
plt
.
ylabel
(
"No. of students enrolled"
)
plt
.
title
(
"Students enrolled in different courses"
)
plt
.
show
()
In [8]:
# Code and run your solution here
# creating the dataset
data
=
{
'C'
:
20
,
'C++'
:
15
,
'Java'
:
30
,
'Python'
:
35
}
courses
=
list
(
data
.
keys
())
values
=
list
(
data
.
values
())
fig
=
plt
.
figure
(
figsize
=
(
10
,
5
))
# creating the bar plot
plt
.
barh
(
courses
,
values
,
color
=
'maroon'
,
height
=
0.4
)
plt
.
xlabel
(
"Courses offered"
)
plt
.
ylabel
(
"No. of students enrolled"
)
plt
.
title
(
"Students enrolled in different courses"
)
plt
.
show
()
In [9]:
# Create two lists; time
and speed.
time
=
[
0
,
1.0
,
2.0
,
3.0
,
4.0
,
5.0
,
6.0
]
speed
=
[
0
,
3
,
7
,
12
,
20
,
30
,
45.6
]
In [10]:
# Create a line chart of speed on y axis and time on x axis
mydata
=
plt
.
figure
(
figsize
=
(
10
,
5
))
# build a square drawing canvass from figure class
plt
.
plot
(
time
,
speed
,
c
=
'red'
,
marker
=
'v'
,
linewidth
=
1
)
# basic line plot
plt
.
title
(
"Speed over time"
)
plt
.
show
()
In [11]:
# Code and run your solution here:
def
ymodel
(
xmodel
,
slope
,
intercept
):
ymodel
=
slope
*
xmodel
+
intercept
return
(
ymodel
)
yseries
=
[]
slope
=
7.6
intercept
=
0.0
for
i
in
range
(
0
,
len
(
time
)):
yseries
.
append
(
ymodel
(
time
[
i
],
slope
,
intercept
))
# Create a markers only line chart
mydata
=
plt
.
figure
(
figsize
=
(
10
,
5
))
# build a square drawing canvass from figure class
plt
.
plot
(
time
,
speed
,
c
=
'red'
,
marker
=
'^'
,
linewidth
=
0.5
)
# basic line plot
plt
.
plot
(
time
,
yseries
,
c
=
'blue'
)
plt
.
show
()
In [12]:
# Code and run your solution here:
yseries
=
[]
slope
=
7.6
intercept
=
-
8.0
for
i
in
range
(
0
,
len
(
time
)):
yseries
.
append
(
ymodel
(
time
[
i
],
slope
,
intercept
))
# Create a markers only line chart
mydata
=
plt
.
figure
(
figsize
=
(
10
,
5
))
# build a square drawing canvass from figure class
plt
.
plot
(
time
,
speed
,
c
=
'red'
,
marker
=
'^'
,
linewidth
=
0
)
# basic scatter plot
plt
.
plot
(
time
,
yseries
,
c
=
'blue'
)
plt
.
show
()
In [13]:
df
=
pd
.
read_csv
(
'galton_subset.csv'
)
df
[
'child'
]
=
df
[
'son'
] ;
df
.
drop
(
'son'
,
axis
=
1
,
inplace
=
True
)
# rename son to child - got to imagine there are some daughters
df
.
head
()
In [ ]:
# build some lists
dad
=
df
[
'father'
] ;
mom
=
df
[
'mother'
] ;
son
=
df
[
'child'
]
In [14]:
myfamily
=
plt
.
figure
(
figsize
=
(
10
,
10
))
# build a square drawing canvass from figure class
plt
.
scatter
(
son
,
dad
,
c
=
'red'
)
# basic scatter plot
plt
.
show
()
In [15]:
# Looks lousy, needs some labels
myfamily
=
plt
.
figure
(
figsize
=
(
10
,
10
))
# build a square drawing canvass from figure class
plt
.
scatter
(
son
,
dad
,
c
=
'red'
,
label
=
'Father'
)
# one plot series
plt
.
scatter
(
son
,
mom
,
c
=
'blue'
,
label
=
'Mother'
)
# two plot series
plt
.
xlabel
(
"Child's height"
)
plt
.
ylabel
(
"Parents' height"
)
plt
.
legend
()
plt
.
show
()
# render the two plots
In [16]:
# Repeat in pandas - The dataframe already is built
df
.
plot
.
scatter
(
x
=
"child"
,
y
=
"father"
)
In [ ]:
ax
=
df
.
plot
.
scatter
(
x
=
"child"
,
y
=
"father"
,
c
=
"red"
,
label
=
'Father'
)
df
.
plot
.
scatter
(
x
=
"child"
,
y
=
"mother"
,
c
=
"blue"
,
label
=
'Mother'
,
ax
=
ax
)
ax
.
set_xlabel
(
"Child's height"
)
ax
.
set_ylabel
(
"Parents' Height"
)
In [17]:
import
pandas
as
pd
df
=
pd
.
read_csv
(
'top_movies.csv'
)
df
.
head
()
In [18]:
df
[[
"Gross"
]]
.
hist
()
In [19]:
df
[[
"Year"
]]
.
hist
()
In [20]:
df
[[
"Gross"
]]
.
hist
(
bins
=
100
)
Discover more documents: Sign up today!
Unlock a world of knowledge! Explore tailored content for a richer learning experience. Here's what you'll get:
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Recommended textbooks for you

Systems Architecture
Computer Science
ISBN:9781305080195
Author:Stephen D. Burd
Publisher:Cengage Learning

C++ Programming: From Problem Analysis to Program...
Computer Science
ISBN:9781337102087
Author:D. S. Malik
Publisher:Cengage Learning

EBK JAVA PROGRAMMING
Computer Science
ISBN:9781305480537
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT

Microsoft Visual C#
Computer Science
ISBN:9781337102100
Author:Joyce, Farrell.
Publisher:Cengage Learning,
Programming Logic & Design Comprehensive
Computer Science
ISBN:9781337669405
Author:FARRELL
Publisher:Cengage

C++ for Engineers and Scientists
Computer Science
ISBN:9781133187844
Author:Bronson, Gary J.
Publisher:Course Technology Ptr
Recommended textbooks for you
- Systems ArchitectureComputer ScienceISBN:9781305080195Author:Stephen D. BurdPublisher:Cengage LearningC++ Programming: From Problem Analysis to Program...Computer ScienceISBN:9781337102087Author:D. S. MalikPublisher:Cengage LearningEBK JAVA PROGRAMMINGComputer ScienceISBN:9781305480537Author:FARRELLPublisher:CENGAGE LEARNING - CONSIGNMENT
- Microsoft Visual C#Computer ScienceISBN:9781337102100Author:Joyce, Farrell.Publisher:Cengage Learning,Programming Logic & Design ComprehensiveComputer ScienceISBN:9781337669405Author:FARRELLPublisher:CengageC++ for Engineers and ScientistsComputer ScienceISBN:9781133187844Author:Bronson, Gary J.Publisher:Course Technology Ptr

Systems Architecture
Computer Science
ISBN:9781305080195
Author:Stephen D. Burd
Publisher:Cengage Learning

C++ Programming: From Problem Analysis to Program...
Computer Science
ISBN:9781337102087
Author:D. S. Malik
Publisher:Cengage Learning

EBK JAVA PROGRAMMING
Computer Science
ISBN:9781305480537
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT

Microsoft Visual C#
Computer Science
ISBN:9781337102100
Author:Joyce, Farrell.
Publisher:Cengage Learning,
Programming Logic & Design Comprehensive
Computer Science
ISBN:9781337669405
Author:FARRELL
Publisher:Cengage

C++ for Engineers and Scientists
Computer Science
ISBN:9781133187844
Author:Bronson, Gary J.
Publisher:Course Technology Ptr