copy_of_lab03_spotify

py

School

University of Michigan *

*We aren’t endorsed by this school

Course

206

Subject

Computer Science

Date

Feb 20, 2024

Type

py

Pages

10

Uploaded by erzas088

Report
# -*- coding: utf-8 -*- """Copy of lab03_spotify.ipynb Automatically generated by Colaboratory. Original file is located at https://colab.research.google.com/drive/18ur4iO68n8VS8z0yY53_BiUUb7pjeig6 # Unordered Collections, Tuples and Spotify Songs ## Unordered Collections We saw *ordered* collections when we introduced strings and lists. """ string = "this is a collection of characters" string[8:20] ## lists can store aribitrary kinds of data lst = [True, ["a", "nested", "list"], 12345] len(lst) """**Unordered** collections are those where we don't use order to find items, but use some other method of retrieving values. ### Sets The first unordered collection we introduce is a set. """ dord = {"Department", "of", "Redundancy", "Department"} dord len(dord) """Only once copy of each value can be included in a set. It is useful for creating collections that include only unique elements. Create a set that has uses the three pieces of a phone number for these two numbers: (i.e., 734 is a value that should appear in the set) * 734-123-4568 * 734-613-2640 """ {734, 123, 4568, 613, 2640} """<details> ``` {734, 123, 4568, 734, 613, 2640} ``` </details> We often want to compare sets and see where they are the same and where they are different """
states1 = {"MI", "AZ", "FL", "DE", "OR"} states2 = {"FL", "MI", "MN", "AZ", "AK"} states1.intersection(states2) """Or where they differ:""" states1.difference(states2) states2.difference(states1) states1.symmetric_difference(states2) """What course might Alice recommend to Bob? (Do it with Python!)""" alice = {"Stats 206", "Stats 306", "Econ 101", "EECS 183"} bob = {"EECS 183", "Stats 206", "Math 241" ,"Econ 101"} alice.difference(bob) """<details> ``` alice.difference(bob) ``` </details> We've seen the use of `+` to join ordered collections before. For sets we use `|`. Between Bob and Alice, how many unique classes have they take in total? """ classes = bob | alice len(classes) """<details> ``` len(alice | bob) ``` </details> ### Dictionaries Dictionaries connect *keys* and *values*. For dictionaries, all keys must be distinct, but the values can be duplicated. We specify them like this: """ number_of_legs = {"dog": 4, "human": 2, "centipede": 100, "slug": 0, "cow": 4} """or""" number_of_legs = { "dog": 4, "human": 2,
"centipede": 100, "slug": 0, "cow": 4 } """As with ordered collections, we retrieve using square brackets `[]`.""" number_of_legs["centipede"] """Dictonaries are "mutable" or "changeable", meaning we can add values.""" number_of_legs["pirate"] = 1 number_of_legs """Add a new entry to the `number_of_legs` dictionary for ants. Use a comparison to prove that ants have more legs than cows.""" number_of_legs["ants"] = 6 number_of_legs["ants"] > number_of_legs["cow"] """<details> ``` number_of_legs["ant"] = 6 number_of_legs["ant"] > number_of_legs["cow"] ``` </details> Occasionally it is helpful to get the *set* of keys from a dictionary with the `.keys()` method. Show the set of things for which we have the number of legs. """ number_of_legs.keys() """<details> ``` number_of_legs.keys() ``` </details> Likewise, as you probably guessed, we can get the values with `.values()`. Output just the values, without the keys. Then call the `set()` function on the result to show the *unique* set of leg values. """ print(number_of_legs.values()) set(number_of_legs.values()) """<details> ``` print(number_of_legs.values())
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
set(number_of_legs.values()) ``` </details> ### Tuples Tuples are fixed length lists in Python. Typically, they are small (2 to 5 items). They use round braces to denote: """ (3, 2, 1) """It's not uncommon to encounter a function that will return a tuple, where each item in the tuple has a given value.""" def first_last(w): return (w[0], w[-1]) first_last("hello") """This is convenient because we can easily assign items in a tuple to variable names using *multiple assignment*: """ a, b = first_last("elephant") print(a) print(b) """Write a function that finds the minimum and maximum value of a collection as a tuple. Find the difference of these two values for this collection: """ nums = [9, -1, 105, 44, -23, 2001, 4] def min_and_max(lst): lst.sort() return (lst[0], lst[-1]) print(min_and_max(nums)) """<details> ``` def minmax(lst): return (min(lst), max(lst)) mn, mx = minmax(nums) mx - mn ``` </details> ### Dictionaries of Lists For many data analysis problems, we would wish to study a **population**, but
instead for reasons of feasibility or possibility, must study a **sample**, a portion of the population. We think of a sample as composed as **units** (we often write $n$ for the number of units we have). For each unit we measure several things (we will write $k$ to represent the number of things we measure). To organize this data, we create a table with $n$ rows and $k$ columns (a $n \times k$ table). As we discussed in class, our mental model of tables in Python and Pandas will be "dictionaries of lists (each of the same type, all of the same length)". Here is a dictionary of lists that might represent the results of vaccine trial. """ vaccine = { "age" : [38, 22, 56, 41, 29, 29], "got_vaccine" : [True, True, False, False, True, False], "got_sick" : [False, True, True, True, False, True] } """Use this table to find out the following (using Python): * The age of the youngest person in the study. * How many people got the vaccine? (See hint) """ ## hint remember: [True + True, True + False, False + False] vaccine["age"].sort() print((vaccine["age"])[0]) print(sum(vaccine["got_vaccine"])) """<details> ``` print(min(vaccine["age"])) print(sum(vaccine["got_vaccine"])) ``` </details> ### DataFrame and Series While we can ask a lot of questions using this technique, we can't easily ask questions like, "How many people who got the vaccine got sick?" For this we need more advanced tools. """ import pandas as pd vdf = pd.DataFrame(vaccine) vdf """To find out how many people who got the vaccine also got sick we can use a few
different techniques.""" ### pull out just the people who got the vaccine from the "got_sick" column vdf["got_sick"][vdf["got_vaccine"]].sum() ### use the logical & ("and") operator to combine both got_sick (vdf["got_vaccine"] & vdf["got_sick"]).sum() """Use Python to find how many people above the age 30 got sick.""" (vdf["got_sick"] & (vdf["age"] > 30)).sum() """<details> ``` vdf["got_sick"][vdf["age"] > 30].sum() ``` </details> ## Spotify Data Let's use these techniques on some real data! [Spotify provides an API](https://developer.spotify.com/documentation/web-api/) (application programmer interface) that allows retrieval of data related to songs, albums, users and several other things. This data base of songs was retreived in 2019 from the popular songs that week. """ from google.colab import drive drive.mount('/content/gdrive') spotify = pd.read_csv("/content/gdrive/MyDrive/Stats 206 Winter 2024/data/spotify.csv") type(spotify) """Usually the first thing we want to do with any data set is ask how many rows (units) and columns (variables) are contained in the data set. Use the `.shape` and `.column` attributes to retrieve this information. """ print(spotify.shape) spotify.columns """<details> ``` print(spotify.shape) spotify.columns ``` </details> We often have to use the number of rows and columns later. A useful trick is to save these values into variables. Use the *multiple assignment* we learned about earlier to save the `n` rows and `k` columns. Use these to compute the total number
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
of *cells* in the table. """ n, k = spotify.shape print(n*k) """<details> ``` n, k = spotify.shape n * k ``` </details> A common mistake is using the `.size` attribute of the table in place of the `.shape` attribute. This sounds like it should contain the number of rows and columns, but it actually tells us many cells are in the table. Verify (using `==`) that this gives the same answer as what we just computed. """ spotify.size == n*k """<details> ``` spotify.size == n * k ``` </details> Recall that as a *dictonary of lists*, we can access columns using `table_name["column_name"]`. Use the `spotify` table to access the `"tempo"` column. What is the tempo of the slowest song (in beats per minute). (Hint: you can use either the `min` function or `.min()` method. """ spotify["tempo"].min() """<details> ``` spotify["tempo"].min() ``` </details> Are slower songs more or less popular than faster songs? Create a column in the spotify column called `slow` that is `True` if the song has a tempo less than 120 beats per minute, and is `False` otherwise. """ spotify["slow"] = spotify["tempo"] < 120 """<details>
``` spotify["slow"] = spotify["tempo"] < 120 ``` </details> How many slow songs are there? """ spotify["slow"].sum() """<details> ``` spotify["slow"].sum() ``` </details> We can use the slow column to create a "fast" column that is the *negation* of the slow call. Here is a brief demonstration: """ tf = pd.Series([True, False, False, True]) ~tf """Create a fast column that is the negation of the slow column.""" spotify["fast"] = ~spotify["slow"] """<details> ``` spotify["fast"] = ~ spotify["slow"] ``` </details> Using the fast and slow columns to *index* the `"track.popularity"` column, find the value of the most popular fast song and the most popular slow song. Which had the more popular song? """ print(spotify["track.popularity"][spotify["fast"]].max()) print(spotify["track.popularity"][spotify["slow"]].max()) """<details> ``` print(spotify["track.popularity"][spotify["slow"]].max()) print(spotify["track.popularity"][spotify["fast"]].max()) ## slow songs had the most popular song ``` </details> We used implicit list comprehensions to create the fast and slow columns, but remember we call do things like arithmetic operations.
Create a new `"duration_min"` column that turns the `"duration_ms"` (miliseconds) into minutes by dividing by 1000, then dividing by 60. """ spotify["duration_min"] = (spotify["duration_ms"] / 1000) / 60 """<details> ``` spotify["duration_min"] = spotify["duration_ms"] / 1000 / 60 ``` </details> What is the duration in minutes of the longest slow song? """ spotify["duration_min"][spotify["slow"]].max() """<details> ``` spotify["duration_min"][spotify["slow"]].max() ``` </details> We can also combine columns using arithmetic. Recall that `"tempo"` is measured in "beats per minute." If we multiply "`tempo`" by `"duration_min"`, we should get (approximately) the total number of beats in the song. Create a new column called `"beats"` to hold this information. What was the maximum number of beats in a song? """ spotify["beats"] = spotify["tempo"] * spotify["duration_min"] print(spotify["beats"].max()) """<details> ``` spotify["beats"] = spotify["duration_min"] * spotify["tempo"] spotify["beats"].max() ``` </details> We might wonder if the song with the most beats is also the longest song. The `.argmax()` method tells us the position where the maximum value is found. Use `.argmax()` see if the longest song also has the most beats. What would we conclude about the song with the most beats? """ spotify["duration_min"].max() == spotify["beats"].argmax() """<details> ``` spotify["beats"].argmax() == spotify["duration_min"].max() ## conclusion: the song with the most beats must be faster than the longest song
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
``` </details> """