Lab_Assignment_2 - 2
html
keyboard_arrow_up
School
Pennsylvania State University *
*We aren’t endorsed by this school
Course
200
Subject
Computer Science
Date
Dec 6, 2023
Type
html
Pages
30
Uploaded by GeneralSummer13484
DS200: Introduction to Data Sciences
¶
Lab Assignment 2: Loops, Tables, Visualizations (2
points)
¶
First, let's import the Python modules needed for this assignment. Please remember
that you need to import the modules again every time when you restart your kernel,
runtime, or session.
In [1]:
from datascience import *
import matplotlib
matplotlib.use('Agg')
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import numpy as np
Part 1: Random Choice (Chapter 9)
¶
NumPy has a function
np.random.choice(...)
that can be used to pick one item at
random from a given array. It is equally likely to pick any of the items in the array. Here
is an example. Imagine that one day, when you get home after a long day, you see a
hot bowl of nachos waiting on the dining table! Let's say that whenever you take a
nacho from the bowl, it will either have only
cheese
, only
salsa
,
both
cheese and
salsa, or
neither
cheese nor salsa (a sad tortilla chip indeed). Let's try and simulate
taking nachos from the bowl at random using the function,
np.random.choice(...)
.
Run the cell below three times, and observe how the results may differ between these
runs.
In [2]:
nachos = make_array('cheese', 'salsa', 'both', 'neither')
np.random.choice(nachos)
Out[2]:
'cheese'
In [3]:
np.random.choice(nachos)
Out[3]:
'cheese'
In [4]:
np.random.choice(nachos)
Out[4]:
'both'
To repeat this process multiple times, pass in an int
n
as the second argument. By
default,
np.random.choice
samples
with replacement
and returns an
array
of items.
Run the next cell to see an example of sampling with replacement 10 times from the
nachos
array.
In [5]:
np.random.choice(nachos, 10)
Out[5]:
array(['salsa', 'cheese', 'salsa', 'neither', 'salsa', 'cheese', 'both',
'both', 'neither', 'cheese'],
dtype='<U7')
Next, let's use
np.random.choice
to simulate one roll of a fair die. The following code
cell gives a statement that simulates rolling a die once and records the number of
spots on the die in a variable
x
. You can run it multiple times to see how variable the
results are.
In [6]:
x = np.random.choice(np.arange(1, 7))
x
Out[6]:
6
Problem 1: Rolling a Fair Die 10 Times (0.25 points)
¶
Write an expression that rolls a die 10 times and return the results in an array.
In [7]:
# write code for Problem 1 in this cell
x = np.random.choice(np.arange(1, 7), 10)
x
Out[7]:
array([6, 3, 1, 5, 5, 3, 4, 6, 2, 5])
Part 2: Python Loops (Chapter 9.2)
¶
Iteration
¶
It is often the case in programming – especially when dealing with randomness – that
we want to repeat a process multiple times. For example, let's consider the game of
betting on one roll of a die with the following rules:
•
If the die shows 1 or 2 spots, my net gain is -1 dollar.
•
If the die shows 3 or 4 spots, my net gain is 0 dollars.
•
If the die shows 5 or 6 spots, my net gain is 1 dollar.
The function
bet_on_one_roll
takes no argument. Each time it is called, it simulates
one roll of a fair die and returns the net gain in dollars.
In [8]:
def bet_on_one_roll():
"""Returns my net gain on one bet"""
x = np.random.choice(np.arange(1, 7))
# roll a die once and record the number of
spots
if x <= 2:
return -1
elif x <= 4:
return 0
elif x <= 6:
return 1
Playing this game once is easy:
In [9]:
bet_on_one_roll()
Out[9]:
-1
To get a sense of how variable the results are, we have to play the game over and over
again. We could run the cell repeatedly, but that's tedious, and if we wanted to do it a
thousand times or a million times, forget it.
A more automated solution is to use a
for
statement to loop over the contents of a
sequence. This is called
iteration
. A
for
statement begins with the word
for
, followed
by a name we want to give each item in the sequence, followed by the word
in
, and
ending with an expression that evaluates to a sequence. The
indented
body of the
for
statement is executed once
for each item in that sequence
. The code cell below
gives an example.
In [10]:
for animal in make_array('cat', 'dog', 'rabbit'):
print(animal)
cat
dog
rabbit
It is helpful to write code that exactly replicates a
for
statement, without using the
for
statement. This is called
unrolling
the loop.
A
for
statement simply replicates the code inside it, but before each iteration, it
assigns a new value from the given sequence to the name we chose. For example,
here is an unrolled version of the loop above.
In [11]:
animal = make_array('cat', 'dog', 'rabbit').item(0)
print(animal)
animal = make_array('cat', 'dog', 'rabbit').item(1)
print(animal)
animal = make_array('cat', 'dog', 'rabbit').item(2)
print(animal)
cat
dog
rabbit
Notice that the name
animal
is arbitrary, just like any name we assign with
=
.
Here we use a
for
statement in a more realistic way: we print the results of betting
five times on the die as described earlier. This is called
simulating
the results of five
bets. We use the word
simulating
to remind ourselves that we are not physically rolling
dice and exchanging money but using Python to mimic the process.
To repeat a process
n
times, it is common to use the sequence
np.arange(n)
in the
for
statement. It is also common to use a very short name for each item. In our code
we will use the name
i
to remind ourselves that it refers to an item.
In [12]:
for i in np.arange(5):
print(bet_on_one_roll())
1
1
-1
1
0
In this case, we simply perform exactly the same (random) action several times, so the
code in the body of our
for
statement does not actually refer to
i
.
The iteration variable
i
can be used in the indented body of a loop. The code cell
below gives an example.
In [13]:
nums = np.arange(5)
sum_nums = 0
for i in nums:
sum_nums = sum_nums + i
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
print('Sum of the first four positive integers is: ' + str(sum_nums))
Sum of the first four positive integers is: 10
Problem 2A: Iterating over a Custom Array (0.25 points)
¶
Create an array that contains the items "Apple", "Banana", "Kiwi", and "Orange". Write
a
for
loop to iterate over the items in the array and print them one by one.
In [14]:
# write code for Problem 2A in this cell
fruits = make_array('Apple', 'Banana', 'Kiwi', 'Orange')
for item in fruits:
print(item)
Apple
Banana
Kiwi
Orange
Augmenting Arrays
¶
While the
for
statement above does simulate the results of five bets, the results are
simply printed and are not in a form that we can use for computation. An array of
results would be more useful. Thus a typical use of a
for
statement is to create an
array of results, by augmenting the array each time.
The
append
method in
NumPy
helps us do this. The call
np.append(array_name,
value)
evaluates to a new array that is
array_name
augmented by
value
. When you
use
append
, keep in mind that all the entries of an array must have the same type.
In [15]:
pets = make_array('Cat', 'Dog')
np.append(pets, 'Another Pet')
Out[15]:
array(['Cat', 'Dog', 'Another Pet'],
dtype='<U11')
This keeps the array
pets
unchanged:
In [16]:
pets
Out[16]:
array(['Cat', 'Dog'],
dtype='<U3')
But often while using
for
loops it will be convenient to mutate an array – that is,
change it – when augmenting it. This is done by assigning the augmented array to the
same name as the original.
In [17]:
pets = np.append(pets, 'Another Pet')
pets
Out[17]:
array(['Cat', 'Dog', 'Another Pet'],
dtype='<U11')
Problem 2B: Creating a New Array by Augmenting (0.25 points)
¶
Use
np.append
to create an array with letters
A
through
E
, by adding the letters one by
one to the array, starting from an empty array.
In [18]:
# write code for Problem 2B in this cell
alphabets = make_array()
alphabets = np.append(alphabets, 'A')
alphabets = np.append(alphabets, 'B')
alphabets = np.append(alphabets, 'C')
alphabets = np.append(alphabets, 'D')
alphabets = np.append(alphabets, 'E')
Example: Betting on 5 Rolls
¶
We can now simulate five bets on the die and collect the results in an array that we will
call the
collection array
. We will start out by creating an empty array for this, and then
append the outcome of each bet. Notice that the body of the
for
loop contains two
statements. Both statements are executed for each item in the given sequence.
In [19]:
outcomes = make_array()
for i in np.arange(5):
outcome_of_bet = bet_on_one_roll()
outcomes = np.append(outcomes, outcome_of_bet)
outcomes
Out[19]:
array([ 0.,
1.,
1.,
1.,
1.])
As shown in the example above, the
indented
body of a
for
statement can contain
multiple
statements and each of the statements is executed for each item in the
given sequence.
By capturing the results in an array, we have given ourselves the ability to use array
methods to do computations. For example, we can use
np.count_nonzero
to count the
number of times money changed hands.
In [20]:
np.count_nonzero(outcomes)
Out[20]:
4
Betting on 300 Rolls
¶
Iteration is a powerful technique. For example, we can see the variation in the results
of 300 bets by running
bet_on_one_roll
for 300 times intead of five.
In [21]:
outcomes = make_array()
for i in np.arange(300):
outcome_of_bet = bet_on_one_roll()
outcomes = np.append(outcomes, outcome_of_bet)
outcomes
Out[21]:
array([ 1.,
0.,
0.,
1.,
1.,
1., -1.,
0., -1.,
1., -1.,
1.,
1.,
1., -1.,
1.,
1.,
1.,
0.,
0.,
1., -1., -1.,
1.,
0.,
0.,
0., -1.,
0.,
1.,
0., -1., -1.,
1.,
0.,
0.,
1.,
0., -1.,
0.,
0., -1.,
1.,
1.,
0.,
1., -1., -1.,
1.,
1., -1., -1.,
0.,
0.,
1., -1.,
1.,
1.,
1.,
0., -1.,
1.,
1., -1.,
0.,
1.,
1., -1.,
0.,
1.,
0., -1.,
0.,
0.,
0.,
1.,
0.,
1.,
0.,
1.,
0.,
1., -1., -1., -1.,
0., -1.,
1.,
1., -1., -1.,
-1.,
0.,
1.,
0., -1., -1., -1.,
1.,
0.,
1.,
1.,
0., -1.,
-1., -1.,
1.,
0.,
1., -1., -1.,
1., -1., -1., -1.,
0.,
1.,
-1., -1.,
1.,
1., -1., -1., -1.,
0.,
1., -1., -1., -1.,
1.,
1.,
0.,
1.,
0.,
1.,
1.,
1., -1.,
1.,
0.,
1., -1.,
0.,
0.,
1., -1.,
0.,
1.,
0.,
0.,
1.,
1., -1.,
0.,
1.,
1.,
0.,
1., -1.,
1., -1.,
0., -1., -1., -1., -1., -1., -1., -1.,
1., -1., -1.,
1., -1.,
1.,
1., -1.,
1.,
0.,
1.,
1.,
1.,
0., -1.,
0.,
1., -1., -1., -1.,
0.,
0.,
1.,
0.,
0.,
0.,
0., -1.,
1., -1.,
1., -1.,
0., -1.,
1.,
0.,
1.,
0.,
1.,
-1.,
0., -1.,
0.,
0.,
1.,
1., -1., -1., -1.,
0., -1., -1.,
0., -1.,
0.,
1., -1.,
1., -1.,
1.,
0.,
1., -1.,
1.,
1.,
0.,
1.,
0.,
0.,
1., -1., -1.,
1.,
0., -1.,
1.,
0.,
1.,
1.,
0.,
0.,
0.,
0.,
1.,
0., -1.,
0.,
1.,
1.,
1., -1.,
0.,
1.,
1., -1.,
0.,
1.,
0.,
1.,
0.,
1., -1.,
0.,
1.,
1.,
1., -1., -1.,
0.,
1.,
1.,
0.,
0., -1.,
0.,
1.,
0.,
-1.,
0., -1.,
1.,
1., -1.,
0.,
0.,
1.,
0.,
1.,
0.,
1.,
1.])
The array
outcomes
now contains the results of all 300 bets.
In [22]:
len(outcomes)
Out[22]:
300
In [23]:
for i in np.arange(5):
print(i)
0
1
2
3
4
Problem 2C: Probability of Non-Zero Gain (0.25 points)
¶
Write an expression that uses the 300 simulated outcomes to estimate the probability
that our net gain for a bet is not zero.
In [24]:
# write code for Problem 2C in this cell
outcomes = make_array()
for i in np.arange(300):
outcome_of_bet = bet_on_one_roll()
outcomes = np.append(outcomes, outcome_of_bet)
probability = np.count_nonzero(outcomes) / 300
print(probability)
0.6533333333333333
Part 3: Selecting Rows from Table (Chapter 6.2)
¶
In Lab Assignment 1, we practiced the Table method
take
, which takes a specific set of
rows from a table. Its argument is a row index or an array of indices, and it creates a
new table consisting of only those rows.
For example, if we wanted just the first row of movies, we could use take as follows.
In [25]:
movies = Table.read_table('IMDB_movies.csv')
movies.take(0)
Out[25]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
color
director_
name
num_cri
tic_for_r
eviews
duration
director_f
acebook_l
ikes
actor_3_fa
cebook_li
kes
actor_
2_nam
e
actor_1_fa
cebook_li
kes
gross
ge
Black
and
White
nan
15
30
nan
94
Art
Carney
491
nan
Com
Fam
This is a new table with just the single row that we specified.
We could also get the fourth, fifth, and sixth rows by specifying a range of indices as
the argument.
In [26]:
movies.take(np.arange(3, 6))
Out[26]:
color
director_
name
num_cri
tic_for_r
eviews
duration
director_f
acebook_l
ikes
actor_3_fa
cebook_li
kes
actor_
2_nam
e
actor_1_fa
cebook_li
kes
gross
Black
and
White
Akira
Kurosawa
153
202
0
4
Minoru
Chiaki
304
269061
Black
and
White
Aleksey
German
121
177
23
0
Yuriy
Tsurilo
4
nan
Black
and
White
Alex
Garland
489
108
232
123
Sonoya
Mizuno
149
2.5441e+
07
If we want a table of the five movies with the highest IMDB scores, we can first sort the
list by
imdb_score
and then
take
the first five rows:
In [27]:
movies.sort('imdb_score', descending=True).take(np.arange(5))
Out[27]:
color
director_
name
num_cri
tic_for_r
eviews
duration
director_f
acebook_l
ikes
actor_3_fa
cebook_li
kes
actor_2_na
me
actor_1_fa
cebook_li
kes
gro
Color
John
Blanchard
nan
65
0
176
Andrea
Martin
770
nan
Color
Frank
Darabont
199
142
0
461
Jeffrey
DeMunn
11000
2.8341
07
Color
Francis
Ford
Coppola
208
175
0
3000
Marlon
Brando
14000
1.3482
08
nan
John
Stockwell
2
90
134
354
T.J. Storm
260000
nan
Color
nan
53
55
nan
2
Olaf
Lubaszenko
20
44709
Let's create another, cleaner table with only two columns,
movie_title
and
imdb_score
.
In [28]:
title_score = movies.select('movie_title','imdb_score')
title_score
Out[28]:
movie_title
imdb_score
The Honeymooners 8.7
Bewitched
7.6
movie_title
imdb_score
McHale's Navy
7.5
Seven Samurai
8.7
Hard to Be a God
6.7
Ex Machina
7.7
Nebraska
7.8
Rebecca
8.2
Psycho
8.5
Sands of Iwo Jima
7.2
... (4987 rows omitted)
Rows Corresponding to a Specified Feature
¶
More often, we will want to access data in a set of rows that have a certain feature, but
whose indices we do not know ahead of time. For example, we might want data on all
the movies with
imdb_score
above 8.0, but we do not want to spend time counting
rows in the sorted table.
The method
where
does the job for us. Its output is a table with the same columns as
the original but only the rows
where
the feature occurs.
The first argument of
where
is the label of the column that contains the information
about whether or not a row has the feature we want. If the feature is "imdb_score
above 8.0", the column is
imdb_score
.
The second argument of
where
is a way of specifying the feature. A couple of examples
will make the general method of specification easier to understand.
In the first example, we extract the data for all those movies with
imdb_score
above
8.0.
In [29]:
above8 = title_score.where('imdb_score', are.above(8.0))
above8
Out[29]:
movie_title
imdb_score
The Honeymooners 8.7
Seven Samurai
8.7
Rebecca
8.2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
movie_title
imdb_score
Psycho
8.5
Solaris
8.1
Some Like It Hot
8.3
The Apartment
8.3
Ordet
8.1
Modern Times
8.6
Memento
8.5
... (239 rows omitted)
The use of the argument
are.above(8.0)
ensured that each selected row had a value
of
imdb_score
that was greater than 8.0. Let's check how many rows are in the new
table
above8
.
In [30]:
above8.num_rows
Out[30]:
249
What if we want to know all the movies with the imdb_score equal to 8.1? For the
answer, we have to access the rows where the value of
imdb_score
is equal to
8.1
.
In [31]:
title_score.where('imdb_score', are.equal_to(8.1))
Out[31]:
movie_title
imdb_score
Solaris
8.1
Ordet
8.1
In the Shadow of the Moon
8.1
Sin City
8.1
High Noon
8.1
The Man Who Shot Liberty Valance
8.1
Nothing But a Man
8.1
Kill Bill: Vol. 1
8.1
Gandhi
8.1
movie_title
imdb_score
The Wizard of Oz
8.1
... (58 rows omitted)
To ensure that all rows are shown, not just the first 10 rows, we can use
.show()
at the
end of the line.
In [32]:
title_score.where('imdb_score', are.equal_to(8.1)).show()
movie_title
imdb_score
Solaris
8.1
Ordet
8.1
In the Shadow of the Moon
8.1
Sin City
8.1
High Noon
8.1
The Man Who Shot Liberty Valance
8.1
Nothing But a Man
8.1
Kill Bill: Vol. 1
8.1
Gandhi
8.1
The Wizard of Oz
8.1
The Best Years of Our Lives
8.1
The Missing
8.1
Lilyhammer
8.1
Strangers with Candy
8.1
Animal Kingdom
8.1
The Sea Inside
8.1
Amores Perros
8.1
movie_title
imdb_score
The Revenant
8.1
The Conformist
8.1
A Christmas Story
8.1
Million Dollar Baby
8.1
Gone Girl
8.1
Prisoners
8.1
No Country for Old Men
8.1
Mad Max: Fury Road
8.1
Butch Cassidy and the Sundance Kid
8.1
Pirates of the Caribbean: The Curse of the Black Pearl
8.1
Groundhog Day
8.1
The Terminator
8.1
Guardians of the Galaxy
8.1
Tae Guk Gi: The Brotherhood of War
8.1
The Square
8.1
Rocky
8.1
Elite Squad
8.1
Destiny
8.1
The Avengers
8.1
Akira
8.1
Touching the Void
8.1
Hachi: A Dog's Tale
8.1
The Sixth Sense
8.1
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
movie_title
imdb_score
Shutter Island
8.1
Woodstock
8.1
The Imitation Game
8.1
Del 1 - Män som hatar kvinnor
8.1
Platoon
8.1
The Bourne Ultimatum
8.1
There Will Be Blood
8.1
Monsters, Inc.
8.1
The Truman Show
8.1
Cat on a Hot Tin Roof
8.1
Donnie Darko
8.1
Before Sunrise
8.1
The Martian
8.1
The Princess Bride
8.1
Stand by Me
8.1
Rush
8.1
Archaeology of a Woman
8.1
Network
8.1
Barry Lyndon
8.1
12 Years a Slave
8.1
Jurassic Park
8.1
The Help
8.1
Hotel Rwanda
8.1
movie_title
imdb_score
The Celebration
8.1
Deadpool
8.1
Spotlight
8.1
The Grand Budapest Hotel
8.1
Annie Hall
8.1
It is so common to ask for the rows for which some column is equal to some value that
the
are.equal_to
call is optional. Instead, the
where
method can be called with only a
column name and a value to achieve the same effect.
In [33]:
title_score.where('imdb_score', 8.1) # equivalent to movies.where('imdb_score',
are.equal_to(8.1))
Out[33]:
movie_title
imdb_score
Solaris
8.1
Ordet
8.1
In the Shadow of the Moon
8.1
Sin City
8.1
High Noon
8.1
The Man Who Shot Liberty Valance
8.1
Nothing But a Man
8.1
Kill Bill: Vol. 1
8.1
Gandhi
8.1
The Wizard of Oz
8.1
... (58 rows omitted)
Multiple Features
¶
You can access rows that have multiple specified features, by using
where
repeatedly.
For example, here is a way to extract all the movies with
imdb_score
8.1 and
budget
above $\$1$ million.
In [34]:
movies.where('imdb_score', 8.1).where('budget',
are.above(1000000)).select('movie_title','imdb_score','budget')
Out[34]:
movie_title
imdb_score budget
In the Shadow of the Moon
8.1
2e+06
Sin City
8.1
4e+07
The Man Who Shot Liberty Valance
8.1
3.2e+06
Kill Bill: Vol. 1
8.1
3e+07
Gandhi
8.1
2.2e+07
The Wizard of Oz
8.1
2.8e+06
The Best Years of Our Lives
8.1
2.1e+06
Lilyhammer
8.1
3.4e+07
The Sea Inside
8.1
1e+07
Amores Perros
8.1
2e+06
... (44 rows omitted)
General Form for Selecting Rows from a Table
¶
By now, you may have realized that the general way to create a new table by selecting
rows with a given feature is to use
where
and
are
with the appropriate condition:
original_table_name.where(column_label_string, are.condition)
In [35]:
title_score.where('imdb_score', are.between(8.0, 8.5))
Out[35]:
movie_title
imdb_score
Rebecca
8.2
Solaris
8.1
Some Like It Hot
8.3
The Apartment
8.3
The Lost Weekend
8
Ordet
8.1
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
movie_title
imdb_score
Intolerance: Love's Struggle Throughout the Ages 8
The Elephant Man
8.2
In the Shadow of the Moon
8.1
On the Waterfront
8.2
... (241 rows omitted)
As elsewhere in Python, the range
between
includes the left end but not the right.
If we specify a condition that is not satisfied by any row, we get a table with column
labels but no rows.
In [36]:
movies.where('imdb_score', are.equal_to(-1.0))
Out[36]:
color
director_
name
num_cri
tic_for_r
eviews
duration
director_f
acebook_l
ikes
actor_3_fa
cebook_li
kes
actor_
2_nam
e
actor_1_fa
cebook_li
kes
gross gen
We can also specify the negation of any condition, by using
.not_
before the condition:
Predicate
Description
are.not_equal_to(Z)
Not equal to
Z
are.not_above(x)
Not above
x
... and so on. The usual rules of logic apply – for example, "not above x" is the same as
"below or equal to x".
The use of
are.containing
can help save some typing. For example, you can just
specify part of a movie title
Man
instead of
The Elephant Man
:
In [37]:
title_score.where('movie_title', are.containing('Man')).show()
movie_title
imdb_score
The Elephant Man
8.2
Tin Can Man
6.7
The Man Who Shot Liberty Valance
8.1
Nothing But a Man
8.1
Dead Man's Shoes
7.7
Anger Management
6.7
movie_title
imdb_score
Last Man Standing
7.4
Dead Man on Campus
6
Repo Man
6.9
She's the Man
6.4
A Most Wanted Man
6.8
Man on a Ledge
6.6
Rain Man
8
Man of the Year
6.2
Trust the Man
5.7
The Lawnmower Man
5.4
The Family Man
6.7
Solitary Man
6.4
Brick Mansions
5.7
Bicentennial Man
6.8
Winnie Mandela
6
Insomnia Manica
5.8
No Man's Land: The Rise of Reeker
4.9
Manito
7
A Serious Man
7
A Man Apart
6.1
Maniac
6.1
A Man for All Seasons
7.9
The Man from Snowy River
7.3
movie_title
imdb_score
The Weather Man
6.6
Pirates of the Caribbean: Dead Man's Chest
7.3
The Man with the Golden Gun
6.8
The Man from U.N.C.L.E.
7.3
Man on Wire
7.8
Harvard Man
4.9
Austin Powers: International Man of Mystery
7
Juwanna Mann
4.5
Memoirs of an Invisible Man
5.9
The Glimmer Man
5.3
I Love You, Man
7.1
Captain Corelli's Mandolin
5.9
Medicine Man
6
The Man Who Knew Too Little
6.6
Iron Man 2
7
Iron Man
7.9
The Manchurian Candidate
6.6
All the Boys Love Mandy Lane
5.6
The Railway Man
7.1
Mandela: Long Walk to Freedom
7.1
Delivery Man
6.4
One Man's Hero
6.2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
movie_title
imdb_score
Manderlay
7.4
The Man
5.5
The Best Man
6.7
The Best Man Holiday
6.7
The Amazing Spider-Man 2
6.7
The Amazing Spider-Man
7
The Perfect Man
5.5
Man on the Moon
7.4
Dead Man Down
6.5
The Running Man
6.6
Hollow Man
5.7
Renaissance Man
6.1
Yes Man
6.8
Ant-Man
7.4
The Man with the Iron Fists
5.4
The Man in the Iron Mask
6.4
The Ladies Man
5.1
The Man from Earth
8
Friday the 13th Part VIII: Jason Takes Manhattan
4.5
The Haunted Mansion
4.9
The November Man
6.3
Cinderella Man
8
movie_title
imdb_score
Spider-Man
7.3
Spider-Man 2
7.3
Spider-Man 3
6.2
Spider-Man 3
6.2
Iron Man 3
7.2
The Extra Man
5.9
Harley Davidson and the Marlboro Man
6
Inside Man
7.6
Holy Man
4.9
Man of the House
5.4
Scott Walker: 30 Century Man
7.3
Dead Man Walking
7.6
Think Like a Man Too
5.7
Think Like a Man
6.6
A Single Man
7.6
Man on Fire
7.7
Maid in Manhattan
5.1
Ip Man 3
7.2
Man of Steel
7.2
Problem 3A: Movies with Multiple Features (0.25 points)
¶
Write a statement to create a list of movies with imdb_score below 5.0, budget above
$\$2$ million, and with title containing the word
Man
.
In [38]:
# write code for Problem 3A in this cell
movies.where('imdb_score', are.below(5.0)).where('budget',
are.above(2000000)).where('movie_title',
are.containing('Man')).select('movie_title','imdb_score','budget').show()
movie_title
imdb_score
budget
Harvard Man
4.9
5.5e+06
Juwanna Mann
4.5
1.56e+07
Friday the 13th Part VIII: Jason Takes Manhattan
4.5
5e+06
The Haunted Mansion
4.9
9e+07
Holy Man
4.9
6e+07
Problem 3B: Conditions with Negation
are.not_
(0.25 points)
¶
Write an expression that returns movies with
imdb_score
not below 9.0
In [39]:
# write code for Problem 3B in this cell
title_score.where('imdb_score', are.not_below(9.0)).show()
movie_title
imdb_score
Kickboxer: Vengeance
9.1
Dekalog
9.1
Fargo
9
The Dark Knight
9
The Godfather: Part II
9
The Godfather
9.2
The Shawshank Redemption 9.3
Towering Inferno
9.5
Part 4: Visualization (Chapter 7)
¶
Tables are a powerful way of organizing and visualizing data. However, large tables of
numbers can be difficult to interpret, no matter how organized they are. Sometimes it
is much easier to interpret graphs than numbers.
In this section we will develop some of the fundamental graphical methods of data
analysis. Our source of data is an
actors
table from the
Internet Movie Database
.
Scatter Plots and Line Graphs
The table
actors
contains data on Hollywood actors, both male and female. The
columns are:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Column
Contents
Actor
Name of actor
Total Gross
Total gross domestic box office receipt, in millions of dollars, of all
of the actor's movies
Number of
Movies
The number of movies the actor has been in
Average per
Movie
Total gross divided by number of movies
#1 Movie
The highest grossing movie the actor has been in
Gross
Gross domestic box office receipt, in millions of dollars, of the
actor's
#1 Movie
In [40]:
actors = Table.read_table('actors.csv')
actors
Out[40]:
Actor
Total
Gross
Number of
Movies
Average per
Movie
#1 Movie
Gross
Harrison Ford
4871.7
41
118.8
Star Wars: The
Force Awakens
936.7
Samuel L.
Jackson
4772.8
69
69.2
The Avengers
623.4
Morgan
Freeman
4468.3
61
73.3
The Dark Knight
534.9
Tom Hanks
4340.8
44
98.7
Toy Story 3
415
Robert
Downey, Jr.
3947.3
53
74.5
The Avengers
623.4
Eddie Murphy 3810.4
38
100.3
Shrek 2
441.2
Tom Cruise
3587.2
36
99.6
War of the Worlds
234.3
Johnny Depp
3368.6
45
74.9
Dead Man's Chest
423.3
Michael Caine 3351.5
58
57.8
The Dark Knight
534.9
Scarlett
3341.2
37
90.3
The Avengers
623.4
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Actor
Total
Gross
Number of
Movies
Average per
Movie
#1 Movie
Gross
Johansson
... (40 rows omitted)
In the calculation of the gross receipt, the data tabulators did not include movies
where an actor had a cameo role or a speaking role that did not involve much screen
time.
The table has 50 rows, corresponding to the 50 top grossing actors. The table is
already sorted by
Total Gross
, so it is easy to see that Harrison Ford is the highest
grossing actor. At the time when the table was created, his movies had brought in
more money at the domestic box office than the movies of any other actor in the table.
Terminology.
A
variable
is a formal name for what we have been calling a "feature" or
"attribute", such as 'number of movies.' The term
variable
emphasizes the point that a
feature can have different values for different individuals. For example, the numbers of
movies that actors have been in vary across all the actors.
Variables that have numerical values and can be measured numerically, such as
'number of movies' or 'average gross receipts per movie' are called
quantitative
or
numerical
variables.
Scatter Plots
A
scatter plot
displays the relation between two numerical variables. The Table method
scatter
draws a scatter plot consisting of one point for each row of the table. Its first
argument is the label of the column to be plotted on the horizontal axis, and its second
argument is the label of the column on the vertical.
In [41]:
actors.scatter('Number of Movies', 'Total Gross')
The plot contains 50 points, one point for each actor in the table. You can see that it
slopes upwards, in general. The more movies an actor has been in, the more the total
gross of all of those movies – in general.
Formally, we say that the plot shows an
association
between the variables, and that
the association is
positive
: high values of one variable tend to be associated with high
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
values of the other, and low values of one with low values of the other, in general.
Of course there is some variability. Some actors have high numbers of movies but
middling total gross receipts. Others have middling numbers of movies but high
receipts. That the association is positive is simply a statement about the broad general
trend.
Now that we have explored how the number of movies is related to the
total
gross
receipt, let's turn our attention to how it is related to the
average
gross receipt per
movie.
In [42]:
actors.scatter('Number of Movies', 'Average per Movie')
This is a markedly different picture and shows a
negative
association. In general, the
more movies an actor has been in, the
less
the average receipt per movie.
Also, one of the points is quite high and off to the left of the plot. It corresponds to one
actor who has a low number of movies and high average per movie. This point is an
outlier
. It lies outside the general range of the data. Indeed, it is quite far from all the
other points in the plot.
We will examine the negative association further by looking at points at the right and
left ends of the plot.
For the right end, let's zoom in on the main body of the plot by just looking at the
portion that doesn't have the outlier.
In [43]:
no_outlier = actors.where('Number of Movies', are.above(10))
no_outlier.scatter('Number of Movies', 'Average per Movie')
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
The negative association is still clearly visible. Let's identify the actors corresponding
to the points that lie on the right hand side of the plot where the number of movies is
large:
In [44]:
actors.where('Number of Movies', are.above(60))
Out[44]:
Actor
Total
Gross
Number of
Movies
Average per
Movie
#1 Movie
Gross
Samuel L.
Jackson
4772.8
69
69.2
The Avengers
623.4
Morgan
Freeman
4468.3
61
73.3
The Dark Knight
534.9
Robert DeNiro 3081.3
79
39
Meet the Fockers
279.3
Liam Neeson
2942.7
63
46.7
The Phantom
Menace
474.5
Next, let us now take a look at the outlier.
In [45]:
actors.where('Number of Movies', are.below(10))
Out[45]:
Actor
Total
Gross
Number of
Movies
Average per
Movie
#1 Movie
Gross
Anthony
Daniels
3162.9
7
451.8
Star Wars: The Force
Awakens
936.7
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Problem 4A: Scatter Plot (0.25 points)
¶
Create a scatter plot to look at the association between
Gross
and
Total Gross
of
actors. Analyze the plot and comment on the association between these two variables.
In [46]:
# write code for Problem 4A in this cell
actors.scatter('Gross', 'Total Gross')
df = actors.to_df()
df.corr(method='pearson', min_periods=1)
Out[46]:
Total Gross
Number of
Movies
Average per
Movie
Gross
Total Gross
1.000000
0.474609
0.014250
0.385570
Number of Movies
0.474609
1.000000
-0.627345
-0.158148
Average per
Movie
0.014250
-0.627345
1.000000
0.474866
Gross
0.385570
-0.158148
0.474866
1.000000
The plot contains 50 points, one point for each actor in the table. The slope tends to be
mostly upward, but you can see some of the outliers at the very right side.
Formally, we say that the plot shows an association between the variables, and that
the association is slightly positive: high values of one variable tend to be associated
with high values of the other, and low values of one with low values of the other, in
general.
In context, if the actor has high gross, the actor's total gross also tends to be high. For
some outliers mentioned above, it may caused by actors who had relatively high single
gross just temporarily. This causes actors to have high gross, with low total gross.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Line Plots
Line plots, sometimes known as line graphs, are among the most common
visualizations. They are often used to study chronological trends and patterns.
The table
movies_by_year
contains data on movies produced by U.S. studios in each
of the years 1980 through 2015. The columns are:
Column
Content
Year
Year
Total Gross
Total domestic box office gross, in millions of dollars, of all movies
released
Number of
Movies
Number of movies released
#1 Movie
Highest grossing movie
In [47]:
movies_by_year = Table.read_table('movies_by_year.csv')
movies_by_year
Out[47]:
Year
Total Gross
Number of Movies
#1 Movie
2015 11128.5
702
Star Wars: The Force Awakens
2014 10360.8
702
American Sniper
2013 10923.6
688
Catching Fire
2012 10837.4
667
The Avengers
2011 10174.3
602
Harry Potter / Deathly Hallows (P2)
2010 10565.6
536
Toy Story 3
2009 10595.5
521
Avatar
2008 9630.7
608
The Dark Knight
2007 9663.8
631
Spider-Man 3
2006 9209.5
608
Dead Man's Chest
... (26 rows omitted)
The Table method
plot
produces a line plot. Its two arguments are the same as those
for
scatter
: first the column on the horizontal axis, then the column on the vertical.
Here is a line plot of the number of movies released each year over the years 1980
through 2015.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In [48]:
movies_by_year.plot('Year', 'Number of Movies')
Let's focus on more recent years. In keeping with the theme of movies, the table of
rows corresponding to the years 2000 through 2015 have been assigned to the name
century_21
.
In [49]:
century_21 = movies_by_year.where('Year', are.above(1999))
In [50]:
century_21.plot('Year', 'Number of Movies')
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Problem 4B: Line Plot (0.25 points)
¶
Write an expression that produces a line plot of two variables
year
and
Total Gross
for all movies released between 2000 and 2015.
In [85]:
# write code for Problem 4B in this cell
movies_year = Table.read_table('movies_by_year.csv')
between_00_15 = movies_year.where('Year', are.between(2000, 2016))
between_00_15.plot('Year', 'Total Gross')
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Documents
Recommended textbooks for you

C++ Programming: From Problem Analysis to Program...
Computer Science
ISBN:9781337102087
Author:D. S. Malik
Publisher:Cengage Learning

Systems Architecture
Computer Science
ISBN:9781305080195
Author:Stephen D. Burd
Publisher:Cengage Learning
Programming Logic & Design Comprehensive
Computer Science
ISBN:9781337669405
Author:FARRELL
Publisher:Cengage

C++ for Engineers and Scientists
Computer Science
ISBN:9781133187844
Author:Bronson, Gary J.
Publisher:Course Technology Ptr

Microsoft Visual C#
Computer Science
ISBN:9781337102100
Author:Joyce, Farrell.
Publisher:Cengage Learning,

EBK JAVA PROGRAMMING
Computer Science
ISBN:9781337671385
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT
Recommended textbooks for you
- C++ Programming: From Problem Analysis to Program...Computer ScienceISBN:9781337102087Author:D. S. MalikPublisher:Cengage LearningSystems ArchitectureComputer ScienceISBN:9781305080195Author:Stephen D. BurdPublisher:Cengage LearningProgramming Logic & Design ComprehensiveComputer ScienceISBN:9781337669405Author:FARRELLPublisher:Cengage
- C++ for Engineers and ScientistsComputer ScienceISBN:9781133187844Author:Bronson, Gary J.Publisher:Course Technology PtrMicrosoft Visual C#Computer ScienceISBN:9781337102100Author:Joyce, Farrell.Publisher:Cengage Learning,EBK JAVA PROGRAMMINGComputer ScienceISBN:9781337671385Author:FARRELLPublisher:CENGAGE LEARNING - CONSIGNMENT

C++ Programming: From Problem Analysis to Program...
Computer Science
ISBN:9781337102087
Author:D. S. Malik
Publisher:Cengage Learning

Systems Architecture
Computer Science
ISBN:9781305080195
Author:Stephen D. Burd
Publisher:Cengage Learning
Programming Logic & Design Comprehensive
Computer Science
ISBN:9781337669405
Author:FARRELL
Publisher:Cengage

C++ for Engineers and Scientists
Computer Science
ISBN:9781133187844
Author:Bronson, Gary J.
Publisher:Course Technology Ptr

Microsoft Visual C#
Computer Science
ISBN:9781337102100
Author:Joyce, Farrell.
Publisher:Cengage Learning,

EBK JAVA PROGRAMMING
Computer Science
ISBN:9781337671385
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT