Lab_Assignment_2 - 2

html

School

Pennsylvania State University *

*We aren’t endorsed by this school

Course

200

Subject

Computer Science

Date

Dec 6, 2023

Type

html

Pages

30

Uploaded by GeneralSummer13484

Report
DS200: Introduction to Data Sciences Lab Assignment 2: Loops, Tables, Visualizations (2 points) First, let's import the Python modules needed for this assignment. Please remember that you need to import the modules again every time when you restart your kernel, runtime, or session. In [1]: from datascience import * import matplotlib matplotlib.use('Agg') %matplotlib inline import matplotlib.pyplot as plots plots.style.use('fivethirtyeight') import numpy as np Part 1: Random Choice (Chapter 9) NumPy has a function np.random.choice(...) that can be used to pick one item at random from a given array. It is equally likely to pick any of the items in the array. Here is an example. Imagine that one day, when you get home after a long day, you see a hot bowl of nachos waiting on the dining table! Let's say that whenever you take a nacho from the bowl, it will either have only cheese , only salsa , both cheese and salsa, or neither cheese nor salsa (a sad tortilla chip indeed). Let's try and simulate taking nachos from the bowl at random using the function, np.random.choice(...) . Run the cell below three times, and observe how the results may differ between these runs. In [2]: nachos = make_array('cheese', 'salsa', 'both', 'neither') np.random.choice(nachos) Out[2]: 'cheese' In [3]: np.random.choice(nachos) Out[3]: 'cheese' In [4]: np.random.choice(nachos) Out[4]: 'both' To repeat this process multiple times, pass in an int n as the second argument. By default, np.random.choice samples with replacement and returns an array of items. Run the next cell to see an example of sampling with replacement 10 times from the nachos array. In [5]: np.random.choice(nachos, 10) Out[5]: array(['salsa', 'cheese', 'salsa', 'neither', 'salsa', 'cheese', 'both', 'both', 'neither', 'cheese'], dtype='<U7')
Next, let's use np.random.choice to simulate one roll of a fair die. The following code cell gives a statement that simulates rolling a die once and records the number of spots on the die in a variable x . You can run it multiple times to see how variable the results are. In [6]: x = np.random.choice(np.arange(1, 7)) x Out[6]: 6 Problem 1: Rolling a Fair Die 10 Times (0.25 points) Write an expression that rolls a die 10 times and return the results in an array. In [7]: # write code for Problem 1 in this cell x = np.random.choice(np.arange(1, 7), 10) x Out[7]: array([6, 3, 1, 5, 5, 3, 4, 6, 2, 5]) Part 2: Python Loops (Chapter 9.2) Iteration It is often the case in programming – especially when dealing with randomness – that we want to repeat a process multiple times. For example, let's consider the game of betting on one roll of a die with the following rules: If the die shows 1 or 2 spots, my net gain is -1 dollar. If the die shows 3 or 4 spots, my net gain is 0 dollars. If the die shows 5 or 6 spots, my net gain is 1 dollar. The function bet_on_one_roll takes no argument. Each time it is called, it simulates one roll of a fair die and returns the net gain in dollars. In [8]: def bet_on_one_roll(): """Returns my net gain on one bet""" x = np.random.choice(np.arange(1, 7)) # roll a die once and record the number of spots if x <= 2: return -1 elif x <= 4: return 0 elif x <= 6: return 1 Playing this game once is easy: In [9]: bet_on_one_roll() Out[9]: -1 To get a sense of how variable the results are, we have to play the game over and over again. We could run the cell repeatedly, but that's tedious, and if we wanted to do it a thousand times or a million times, forget it. A more automated solution is to use a for statement to loop over the contents of a sequence. This is called iteration . A for statement begins with the word for , followed by a name we want to give each item in the sequence, followed by the word in , and ending with an expression that evaluates to a sequence. The indented body of the
for statement is executed once for each item in that sequence . The code cell below gives an example. In [10]: for animal in make_array('cat', 'dog', 'rabbit'): print(animal) cat dog rabbit It is helpful to write code that exactly replicates a for statement, without using the for statement. This is called unrolling the loop. A for statement simply replicates the code inside it, but before each iteration, it assigns a new value from the given sequence to the name we chose. For example, here is an unrolled version of the loop above. In [11]: animal = make_array('cat', 'dog', 'rabbit').item(0) print(animal) animal = make_array('cat', 'dog', 'rabbit').item(1) print(animal) animal = make_array('cat', 'dog', 'rabbit').item(2) print(animal) cat dog rabbit Notice that the name animal is arbitrary, just like any name we assign with = . Here we use a for statement in a more realistic way: we print the results of betting five times on the die as described earlier. This is called simulating the results of five bets. We use the word simulating to remind ourselves that we are not physically rolling dice and exchanging money but using Python to mimic the process. To repeat a process n times, it is common to use the sequence np.arange(n) in the for statement. It is also common to use a very short name for each item. In our code we will use the name i to remind ourselves that it refers to an item. In [12]: for i in np.arange(5): print(bet_on_one_roll()) 1 1 -1 1 0 In this case, we simply perform exactly the same (random) action several times, so the code in the body of our for statement does not actually refer to i . The iteration variable i can be used in the indented body of a loop. The code cell below gives an example. In [13]: nums = np.arange(5) sum_nums = 0 for i in nums: sum_nums = sum_nums + i
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
print('Sum of the first four positive integers is: ' + str(sum_nums)) Sum of the first four positive integers is: 10 Problem 2A: Iterating over a Custom Array (0.25 points) Create an array that contains the items "Apple", "Banana", "Kiwi", and "Orange". Write a for loop to iterate over the items in the array and print them one by one. In [14]: # write code for Problem 2A in this cell fruits = make_array('Apple', 'Banana', 'Kiwi', 'Orange') for item in fruits: print(item) Apple Banana Kiwi Orange Augmenting Arrays While the for statement above does simulate the results of five bets, the results are simply printed and are not in a form that we can use for computation. An array of results would be more useful. Thus a typical use of a for statement is to create an array of results, by augmenting the array each time. The append method in NumPy helps us do this. The call np.append(array_name, value) evaluates to a new array that is array_name augmented by value . When you use append , keep in mind that all the entries of an array must have the same type. In [15]: pets = make_array('Cat', 'Dog') np.append(pets, 'Another Pet') Out[15]: array(['Cat', 'Dog', 'Another Pet'], dtype='<U11') This keeps the array pets unchanged: In [16]: pets Out[16]: array(['Cat', 'Dog'], dtype='<U3') But often while using for loops it will be convenient to mutate an array – that is, change it – when augmenting it. This is done by assigning the augmented array to the same name as the original. In [17]: pets = np.append(pets, 'Another Pet') pets Out[17]: array(['Cat', 'Dog', 'Another Pet'], dtype='<U11') Problem 2B: Creating a New Array by Augmenting (0.25 points) Use np.append to create an array with letters A through E , by adding the letters one by one to the array, starting from an empty array. In [18]: # write code for Problem 2B in this cell alphabets = make_array() alphabets = np.append(alphabets, 'A')
alphabets = np.append(alphabets, 'B') alphabets = np.append(alphabets, 'C') alphabets = np.append(alphabets, 'D') alphabets = np.append(alphabets, 'E') Example: Betting on 5 Rolls We can now simulate five bets on the die and collect the results in an array that we will call the collection array . We will start out by creating an empty array for this, and then append the outcome of each bet. Notice that the body of the for loop contains two statements. Both statements are executed for each item in the given sequence. In [19]: outcomes = make_array() for i in np.arange(5): outcome_of_bet = bet_on_one_roll() outcomes = np.append(outcomes, outcome_of_bet) outcomes Out[19]: array([ 0., 1., 1., 1., 1.]) As shown in the example above, the indented body of a for statement can contain multiple statements and each of the statements is executed for each item in the given sequence. By capturing the results in an array, we have given ourselves the ability to use array methods to do computations. For example, we can use np.count_nonzero to count the number of times money changed hands. In [20]: np.count_nonzero(outcomes) Out[20]: 4 Betting on 300 Rolls Iteration is a powerful technique. For example, we can see the variation in the results of 300 bets by running bet_on_one_roll for 300 times intead of five. In [21]: outcomes = make_array() for i in np.arange(300): outcome_of_bet = bet_on_one_roll() outcomes = np.append(outcomes, outcome_of_bet) outcomes Out[21]: array([ 1., 0., 0., 1., 1., 1., -1., 0., -1., 1., -1., 1., 1., 1., -1., 1., 1., 1., 0., 0., 1., -1., -1., 1., 0., 0., 0., -1., 0., 1., 0., -1., -1., 1., 0., 0., 1., 0., -1., 0., 0., -1., 1., 1., 0., 1., -1., -1., 1., 1., -1., -1., 0., 0., 1., -1., 1., 1., 1., 0., -1., 1., 1., -1., 0., 1., 1., -1., 0., 1., 0., -1., 0., 0., 0., 1., 0., 1., 0., 1., 0., 1., -1., -1., -1., 0., -1., 1., 1., -1., -1., -1., 0., 1., 0., -1., -1., -1., 1., 0., 1., 1., 0., -1., -1., -1., 1., 0., 1., -1., -1., 1., -1., -1., -1., 0., 1., -1., -1., 1., 1., -1., -1., -1., 0., 1., -1., -1., -1., 1.,
1., 0., 1., 0., 1., 1., 1., -1., 1., 0., 1., -1., 0., 0., 1., -1., 0., 1., 0., 0., 1., 1., -1., 0., 1., 1., 0., 1., -1., 1., -1., 0., -1., -1., -1., -1., -1., -1., -1., 1., -1., -1., 1., -1., 1., 1., -1., 1., 0., 1., 1., 1., 0., -1., 0., 1., -1., -1., -1., 0., 0., 1., 0., 0., 0., 0., -1., 1., -1., 1., -1., 0., -1., 1., 0., 1., 0., 1., -1., 0., -1., 0., 0., 1., 1., -1., -1., -1., 0., -1., -1., 0., -1., 0., 1., -1., 1., -1., 1., 0., 1., -1., 1., 1., 0., 1., 0., 0., 1., -1., -1., 1., 0., -1., 1., 0., 1., 1., 0., 0., 0., 0., 1., 0., -1., 0., 1., 1., 1., -1., 0., 1., 1., -1., 0., 1., 0., 1., 0., 1., -1., 0., 1., 1., 1., -1., -1., 0., 1., 1., 0., 0., -1., 0., 1., 0., -1., 0., -1., 1., 1., -1., 0., 0., 1., 0., 1., 0., 1., 1.]) The array outcomes now contains the results of all 300 bets. In [22]: len(outcomes) Out[22]: 300 In [23]: for i in np.arange(5): print(i) 0 1 2 3 4 Problem 2C: Probability of Non-Zero Gain (0.25 points) Write an expression that uses the 300 simulated outcomes to estimate the probability that our net gain for a bet is not zero. In [24]: # write code for Problem 2C in this cell outcomes = make_array() for i in np.arange(300): outcome_of_bet = bet_on_one_roll() outcomes = np.append(outcomes, outcome_of_bet) probability = np.count_nonzero(outcomes) / 300 print(probability) 0.6533333333333333 Part 3: Selecting Rows from Table (Chapter 6.2) In Lab Assignment 1, we practiced the Table method take , which takes a specific set of rows from a table. Its argument is a row index or an array of indices, and it creates a new table consisting of only those rows. For example, if we wanted just the first row of movies, we could use take as follows. In [25]: movies = Table.read_table('IMDB_movies.csv') movies.take(0) Out[25]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
color director_ name num_cri tic_for_r eviews duration director_f acebook_l ikes actor_3_fa cebook_li kes actor_ 2_nam e actor_1_fa cebook_li kes gross ge Black and White nan 15 30 nan 94 Art Carney 491 nan Com Fam This is a new table with just the single row that we specified. We could also get the fourth, fifth, and sixth rows by specifying a range of indices as the argument. In [26]: movies.take(np.arange(3, 6)) Out[26]: color director_ name num_cri tic_for_r eviews duration director_f acebook_l ikes actor_3_fa cebook_li kes actor_ 2_nam e actor_1_fa cebook_li kes gross Black and White Akira Kurosawa 153 202 0 4 Minoru Chiaki 304 269061 Black and White Aleksey German 121 177 23 0 Yuriy Tsurilo 4 nan Black and White Alex Garland 489 108 232 123 Sonoya Mizuno 149 2.5441e+ 07 If we want a table of the five movies with the highest IMDB scores, we can first sort the list by imdb_score and then take the first five rows: In [27]: movies.sort('imdb_score', descending=True).take(np.arange(5)) Out[27]:
color director_ name num_cri tic_for_r eviews duration director_f acebook_l ikes actor_3_fa cebook_li kes actor_2_na me actor_1_fa cebook_li kes gro Color John Blanchard nan 65 0 176 Andrea Martin 770 nan Color Frank Darabont 199 142 0 461 Jeffrey DeMunn 11000 2.8341 07 Color Francis Ford Coppola 208 175 0 3000 Marlon Brando 14000 1.3482 08 nan John Stockwell 2 90 134 354 T.J. Storm 260000 nan Color nan 53 55 nan 2 Olaf Lubaszenko 20 44709 Let's create another, cleaner table with only two columns, movie_title and imdb_score . In [28]: title_score = movies.select('movie_title','imdb_score') title_score Out[28]: movie_title imdb_score The Honeymooners 8.7 Bewitched 7.6
movie_title imdb_score McHale's Navy 7.5 Seven Samurai 8.7 Hard to Be a God 6.7 Ex Machina 7.7 Nebraska 7.8 Rebecca 8.2 Psycho 8.5 Sands of Iwo Jima 7.2 ... (4987 rows omitted) Rows Corresponding to a Specified Feature More often, we will want to access data in a set of rows that have a certain feature, but whose indices we do not know ahead of time. For example, we might want data on all the movies with imdb_score above 8.0, but we do not want to spend time counting rows in the sorted table. The method where does the job for us. Its output is a table with the same columns as the original but only the rows where the feature occurs. The first argument of where is the label of the column that contains the information about whether or not a row has the feature we want. If the feature is "imdb_score above 8.0", the column is imdb_score . The second argument of where is a way of specifying the feature. A couple of examples will make the general method of specification easier to understand. In the first example, we extract the data for all those movies with imdb_score above 8.0. In [29]: above8 = title_score.where('imdb_score', are.above(8.0)) above8 Out[29]: movie_title imdb_score The Honeymooners 8.7 Seven Samurai 8.7 Rebecca 8.2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
movie_title imdb_score Psycho 8.5 Solaris 8.1 Some Like It Hot 8.3 The Apartment 8.3 Ordet 8.1 Modern Times 8.6 Memento 8.5 ... (239 rows omitted) The use of the argument are.above(8.0) ensured that each selected row had a value of imdb_score that was greater than 8.0. Let's check how many rows are in the new table above8 . In [30]: above8.num_rows Out[30]: 249 What if we want to know all the movies with the imdb_score equal to 8.1? For the answer, we have to access the rows where the value of imdb_score is equal to 8.1 . In [31]: title_score.where('imdb_score', are.equal_to(8.1)) Out[31]: movie_title imdb_score Solaris 8.1 Ordet 8.1 In the Shadow of the Moon 8.1 Sin City 8.1 High Noon 8.1 The Man Who Shot Liberty Valance 8.1 Nothing But a Man 8.1 Kill Bill: Vol. 1 8.1 Gandhi 8.1
movie_title imdb_score The Wizard of Oz 8.1 ... (58 rows omitted) To ensure that all rows are shown, not just the first 10 rows, we can use .show() at the end of the line. In [32]: title_score.where('imdb_score', are.equal_to(8.1)).show() movie_title imdb_score Solaris 8.1 Ordet 8.1 In the Shadow of the Moon 8.1 Sin City 8.1 High Noon 8.1 The Man Who Shot Liberty Valance 8.1 Nothing But a Man 8.1 Kill Bill: Vol. 1 8.1 Gandhi 8.1 The Wizard of Oz 8.1 The Best Years of Our Lives 8.1 The Missing 8.1 Lilyhammer 8.1 Strangers with Candy 8.1 Animal Kingdom 8.1 The Sea Inside 8.1 Amores Perros 8.1
movie_title imdb_score The Revenant 8.1 The Conformist 8.1 A Christmas Story 8.1 Million Dollar Baby 8.1 Gone Girl 8.1 Prisoners 8.1 No Country for Old Men 8.1 Mad Max: Fury Road 8.1 Butch Cassidy and the Sundance Kid 8.1 Pirates of the Caribbean: The Curse of the Black Pearl 8.1 Groundhog Day 8.1 The Terminator 8.1 Guardians of the Galaxy 8.1 Tae Guk Gi: The Brotherhood of War 8.1 The Square 8.1 Rocky 8.1 Elite Squad 8.1 Destiny 8.1 The Avengers 8.1 Akira 8.1 Touching the Void 8.1 Hachi: A Dog's Tale 8.1 The Sixth Sense 8.1
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
movie_title imdb_score Shutter Island 8.1 Woodstock 8.1 The Imitation Game 8.1 Del 1 - Män som hatar kvinnor 8.1 Platoon 8.1 The Bourne Ultimatum 8.1 There Will Be Blood 8.1 Monsters, Inc. 8.1 The Truman Show 8.1 Cat on a Hot Tin Roof 8.1 Donnie Darko 8.1 Before Sunrise 8.1 The Martian 8.1 The Princess Bride 8.1 Stand by Me 8.1 Rush 8.1 Archaeology of a Woman 8.1 Network 8.1 Barry Lyndon 8.1 12 Years a Slave 8.1 Jurassic Park 8.1 The Help 8.1 Hotel Rwanda 8.1
movie_title imdb_score The Celebration 8.1 Deadpool 8.1 Spotlight 8.1 The Grand Budapest Hotel 8.1 Annie Hall 8.1 It is so common to ask for the rows for which some column is equal to some value that the are.equal_to call is optional. Instead, the where method can be called with only a column name and a value to achieve the same effect. In [33]: title_score.where('imdb_score', 8.1) # equivalent to movies.where('imdb_score', are.equal_to(8.1)) Out[33]: movie_title imdb_score Solaris 8.1 Ordet 8.1 In the Shadow of the Moon 8.1 Sin City 8.1 High Noon 8.1 The Man Who Shot Liberty Valance 8.1 Nothing But a Man 8.1 Kill Bill: Vol. 1 8.1 Gandhi 8.1 The Wizard of Oz 8.1 ... (58 rows omitted) Multiple Features You can access rows that have multiple specified features, by using where repeatedly. For example, here is a way to extract all the movies with imdb_score 8.1 and budget above $\$1$ million.
In [34]: movies.where('imdb_score', 8.1).where('budget', are.above(1000000)).select('movie_title','imdb_score','budget') Out[34]: movie_title imdb_score budget In the Shadow of the Moon 8.1 2e+06 Sin City 8.1 4e+07 The Man Who Shot Liberty Valance 8.1 3.2e+06 Kill Bill: Vol. 1 8.1 3e+07 Gandhi 8.1 2.2e+07 The Wizard of Oz 8.1 2.8e+06 The Best Years of Our Lives 8.1 2.1e+06 Lilyhammer 8.1 3.4e+07 The Sea Inside 8.1 1e+07 Amores Perros 8.1 2e+06 ... (44 rows omitted) General Form for Selecting Rows from a Table By now, you may have realized that the general way to create a new table by selecting rows with a given feature is to use where and are with the appropriate condition: original_table_name.where(column_label_string, are.condition) In [35]: title_score.where('imdb_score', are.between(8.0, 8.5)) Out[35]: movie_title imdb_score Rebecca 8.2 Solaris 8.1 Some Like It Hot 8.3 The Apartment 8.3 The Lost Weekend 8 Ordet 8.1
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
movie_title imdb_score Intolerance: Love's Struggle Throughout the Ages 8 The Elephant Man 8.2 In the Shadow of the Moon 8.1 On the Waterfront 8.2 ... (241 rows omitted) As elsewhere in Python, the range between includes the left end but not the right. If we specify a condition that is not satisfied by any row, we get a table with column labels but no rows. In [36]: movies.where('imdb_score', are.equal_to(-1.0)) Out[36]: color director_ name num_cri tic_for_r eviews duration director_f acebook_l ikes actor_3_fa cebook_li kes actor_ 2_nam e actor_1_fa cebook_li kes gross gen We can also specify the negation of any condition, by using .not_ before the condition: Predicate Description are.not_equal_to(Z) Not equal to Z are.not_above(x) Not above x ... and so on. The usual rules of logic apply – for example, "not above x" is the same as "below or equal to x". The use of are.containing can help save some typing. For example, you can just specify part of a movie title Man instead of The Elephant Man : In [37]: title_score.where('movie_title', are.containing('Man')).show() movie_title imdb_score The Elephant Man 8.2 Tin Can Man 6.7 The Man Who Shot Liberty Valance 8.1 Nothing But a Man 8.1 Dead Man's Shoes 7.7 Anger Management 6.7
movie_title imdb_score Last Man Standing 7.4 Dead Man on Campus 6 Repo Man 6.9 She's the Man 6.4 A Most Wanted Man 6.8 Man on a Ledge 6.6 Rain Man 8 Man of the Year 6.2 Trust the Man 5.7 The Lawnmower Man 5.4 The Family Man 6.7 Solitary Man 6.4 Brick Mansions 5.7 Bicentennial Man 6.8 Winnie Mandela 6 Insomnia Manica 5.8 No Man's Land: The Rise of Reeker 4.9 Manito 7 A Serious Man 7 A Man Apart 6.1 Maniac 6.1 A Man for All Seasons 7.9 The Man from Snowy River 7.3
movie_title imdb_score The Weather Man 6.6 Pirates of the Caribbean: Dead Man's Chest 7.3 The Man with the Golden Gun 6.8 The Man from U.N.C.L.E. 7.3 Man on Wire 7.8 Harvard Man 4.9 Austin Powers: International Man of Mystery 7 Juwanna Mann 4.5 Memoirs of an Invisible Man 5.9 The Glimmer Man 5.3 I Love You, Man 7.1 Captain Corelli's Mandolin 5.9 Medicine Man 6 The Man Who Knew Too Little 6.6 Iron Man 2 7 Iron Man 7.9 The Manchurian Candidate 6.6 All the Boys Love Mandy Lane 5.6 The Railway Man 7.1 Mandela: Long Walk to Freedom 7.1 Delivery Man 6.4 One Man's Hero 6.2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
movie_title imdb_score Manderlay 7.4 The Man 5.5 The Best Man 6.7 The Best Man Holiday 6.7 The Amazing Spider-Man 2 6.7 The Amazing Spider-Man 7 The Perfect Man 5.5 Man on the Moon 7.4 Dead Man Down 6.5 The Running Man 6.6 Hollow Man 5.7 Renaissance Man 6.1 Yes Man 6.8 Ant-Man 7.4 The Man with the Iron Fists 5.4 The Man in the Iron Mask 6.4 The Ladies Man 5.1 The Man from Earth 8 Friday the 13th Part VIII: Jason Takes Manhattan 4.5 The Haunted Mansion 4.9 The November Man 6.3 Cinderella Man 8
movie_title imdb_score Spider-Man 7.3 Spider-Man 2 7.3 Spider-Man 3 6.2 Spider-Man 3 6.2 Iron Man 3 7.2 The Extra Man 5.9 Harley Davidson and the Marlboro Man 6 Inside Man 7.6 Holy Man 4.9 Man of the House 5.4 Scott Walker: 30 Century Man 7.3 Dead Man Walking 7.6 Think Like a Man Too 5.7 Think Like a Man 6.6 A Single Man 7.6 Man on Fire 7.7 Maid in Manhattan 5.1 Ip Man 3 7.2 Man of Steel 7.2 Problem 3A: Movies with Multiple Features (0.25 points) Write a statement to create a list of movies with imdb_score below 5.0, budget above $\$2$ million, and with title containing the word Man . In [38]: # write code for Problem 3A in this cell
movies.where('imdb_score', are.below(5.0)).where('budget', are.above(2000000)).where('movie_title', are.containing('Man')).select('movie_title','imdb_score','budget').show() movie_title imdb_score budget Harvard Man 4.9 5.5e+06 Juwanna Mann 4.5 1.56e+07 Friday the 13th Part VIII: Jason Takes Manhattan 4.5 5e+06 The Haunted Mansion 4.9 9e+07 Holy Man 4.9 6e+07 Problem 3B: Conditions with Negation are.not_ (0.25 points) Write an expression that returns movies with imdb_score not below 9.0 In [39]: # write code for Problem 3B in this cell title_score.where('imdb_score', are.not_below(9.0)).show() movie_title imdb_score Kickboxer: Vengeance 9.1 Dekalog 9.1 Fargo 9 The Dark Knight 9 The Godfather: Part II 9 The Godfather 9.2 The Shawshank Redemption 9.3 Towering Inferno 9.5 Part 4: Visualization (Chapter 7) Tables are a powerful way of organizing and visualizing data. However, large tables of numbers can be difficult to interpret, no matter how organized they are. Sometimes it is much easier to interpret graphs than numbers. In this section we will develop some of the fundamental graphical methods of data analysis. Our source of data is an actors table from the Internet Movie Database . Scatter Plots and Line Graphs The table actors contains data on Hollywood actors, both male and female. The columns are:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Column Contents Actor Name of actor Total Gross Total gross domestic box office receipt, in millions of dollars, of all of the actor's movies Number of Movies The number of movies the actor has been in Average per Movie Total gross divided by number of movies #1 Movie The highest grossing movie the actor has been in Gross Gross domestic box office receipt, in millions of dollars, of the actor's #1 Movie In [40]: actors = Table.read_table('actors.csv') actors Out[40]: Actor Total Gross Number of Movies Average per Movie #1 Movie Gross Harrison Ford 4871.7 41 118.8 Star Wars: The Force Awakens 936.7 Samuel L. Jackson 4772.8 69 69.2 The Avengers 623.4 Morgan Freeman 4468.3 61 73.3 The Dark Knight 534.9 Tom Hanks 4340.8 44 98.7 Toy Story 3 415 Robert Downey, Jr. 3947.3 53 74.5 The Avengers 623.4 Eddie Murphy 3810.4 38 100.3 Shrek 2 441.2 Tom Cruise 3587.2 36 99.6 War of the Worlds 234.3 Johnny Depp 3368.6 45 74.9 Dead Man's Chest 423.3 Michael Caine 3351.5 58 57.8 The Dark Knight 534.9 Scarlett 3341.2 37 90.3 The Avengers 623.4
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Actor Total Gross Number of Movies Average per Movie #1 Movie Gross Johansson ... (40 rows omitted) In the calculation of the gross receipt, the data tabulators did not include movies where an actor had a cameo role or a speaking role that did not involve much screen time. The table has 50 rows, corresponding to the 50 top grossing actors. The table is already sorted by Total Gross , so it is easy to see that Harrison Ford is the highest grossing actor. At the time when the table was created, his movies had brought in more money at the domestic box office than the movies of any other actor in the table. Terminology. A variable is a formal name for what we have been calling a "feature" or "attribute", such as 'number of movies.' The term variable emphasizes the point that a feature can have different values for different individuals. For example, the numbers of movies that actors have been in vary across all the actors. Variables that have numerical values and can be measured numerically, such as 'number of movies' or 'average gross receipts per movie' are called quantitative or numerical variables. Scatter Plots A scatter plot displays the relation between two numerical variables. The Table method scatter draws a scatter plot consisting of one point for each row of the table. Its first argument is the label of the column to be plotted on the horizontal axis, and its second argument is the label of the column on the vertical. In [41]: actors.scatter('Number of Movies', 'Total Gross') The plot contains 50 points, one point for each actor in the table. You can see that it slopes upwards, in general. The more movies an actor has been in, the more the total gross of all of those movies – in general. Formally, we say that the plot shows an association between the variables, and that the association is positive : high values of one variable tend to be associated with high
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
values of the other, and low values of one with low values of the other, in general. Of course there is some variability. Some actors have high numbers of movies but middling total gross receipts. Others have middling numbers of movies but high receipts. That the association is positive is simply a statement about the broad general trend. Now that we have explored how the number of movies is related to the total gross receipt, let's turn our attention to how it is related to the average gross receipt per movie. In [42]: actors.scatter('Number of Movies', 'Average per Movie') This is a markedly different picture and shows a negative association. In general, the more movies an actor has been in, the less the average receipt per movie. Also, one of the points is quite high and off to the left of the plot. It corresponds to one actor who has a low number of movies and high average per movie. This point is an outlier . It lies outside the general range of the data. Indeed, it is quite far from all the other points in the plot. We will examine the negative association further by looking at points at the right and left ends of the plot. For the right end, let's zoom in on the main body of the plot by just looking at the portion that doesn't have the outlier. In [43]: no_outlier = actors.where('Number of Movies', are.above(10)) no_outlier.scatter('Number of Movies', 'Average per Movie')
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
The negative association is still clearly visible. Let's identify the actors corresponding to the points that lie on the right hand side of the plot where the number of movies is large: In [44]: actors.where('Number of Movies', are.above(60)) Out[44]: Actor Total Gross Number of Movies Average per Movie #1 Movie Gross Samuel L. Jackson 4772.8 69 69.2 The Avengers 623.4 Morgan Freeman 4468.3 61 73.3 The Dark Knight 534.9 Robert DeNiro 3081.3 79 39 Meet the Fockers 279.3 Liam Neeson 2942.7 63 46.7 The Phantom Menace 474.5 Next, let us now take a look at the outlier. In [45]: actors.where('Number of Movies', are.below(10)) Out[45]: Actor Total Gross Number of Movies Average per Movie #1 Movie Gross Anthony Daniels 3162.9 7 451.8 Star Wars: The Force Awakens 936.7
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Problem 4A: Scatter Plot (0.25 points) Create a scatter plot to look at the association between Gross and Total Gross of actors. Analyze the plot and comment on the association between these two variables. In [46]: # write code for Problem 4A in this cell actors.scatter('Gross', 'Total Gross') df = actors.to_df() df.corr(method='pearson', min_periods=1) Out[46]: Total Gross Number of Movies Average per Movie Gross Total Gross 1.000000 0.474609 0.014250 0.385570 Number of Movies 0.474609 1.000000 -0.627345 -0.158148 Average per Movie 0.014250 -0.627345 1.000000 0.474866 Gross 0.385570 -0.158148 0.474866 1.000000 The plot contains 50 points, one point for each actor in the table. The slope tends to be mostly upward, but you can see some of the outliers at the very right side. Formally, we say that the plot shows an association between the variables, and that the association is slightly positive: high values of one variable tend to be associated with high values of the other, and low values of one with low values of the other, in general. In context, if the actor has high gross, the actor's total gross also tends to be high. For some outliers mentioned above, it may caused by actors who had relatively high single gross just temporarily. This causes actors to have high gross, with low total gross.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Line Plots Line plots, sometimes known as line graphs, are among the most common visualizations. They are often used to study chronological trends and patterns. The table movies_by_year contains data on movies produced by U.S. studios in each of the years 1980 through 2015. The columns are: Column Content Year Year Total Gross Total domestic box office gross, in millions of dollars, of all movies released Number of Movies Number of movies released #1 Movie Highest grossing movie In [47]: movies_by_year = Table.read_table('movies_by_year.csv') movies_by_year Out[47]: Year Total Gross Number of Movies #1 Movie 2015 11128.5 702 Star Wars: The Force Awakens 2014 10360.8 702 American Sniper 2013 10923.6 688 Catching Fire 2012 10837.4 667 The Avengers 2011 10174.3 602 Harry Potter / Deathly Hallows (P2) 2010 10565.6 536 Toy Story 3 2009 10595.5 521 Avatar 2008 9630.7 608 The Dark Knight 2007 9663.8 631 Spider-Man 3 2006 9209.5 608 Dead Man's Chest ... (26 rows omitted) The Table method plot produces a line plot. Its two arguments are the same as those for scatter : first the column on the horizontal axis, then the column on the vertical. Here is a line plot of the number of movies released each year over the years 1980 through 2015.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [48]: movies_by_year.plot('Year', 'Number of Movies') Let's focus on more recent years. In keeping with the theme of movies, the table of rows corresponding to the years 2000 through 2015 have been assigned to the name century_21 . In [49]: century_21 = movies_by_year.where('Year', are.above(1999)) In [50]: century_21.plot('Year', 'Number of Movies')
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Problem 4B: Line Plot (0.25 points) Write an expression that produces a line plot of two variables year and Total Gross for all movies released between 2000 and 2015. In [85]: # write code for Problem 4B in this cell movies_year = Table.read_table('movies_by_year.csv') between_00_15 = movies_year.where('Year', are.between(2000, 2016)) between_00_15.plot('Year', 'Total Gross')
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help