Lab04

html

School

Temple University *

*We aren’t endorsed by this school

Course

1013

Subject

Computer Science

Date

Dec 6, 2023

Type

html

Pages

19

Uploaded by samzahroun

Report
Lab 04 Functions and Visualization Elements of Data Science Welcome to lab 4! This week, we will focus on functions and visualization. Functions are described in Chapter 8 of the Inferential Thinking text. Visualizations is covered in Chapter 7 . First, set up the tests and imports by running the cell below. In [121]: # Enter your name as a string name = "Sam Zahroun" In [122]: import numpy as np from datascience import * %matplotlib inline import matplotlib.pyplot as plt plt.style.use('fivethirtyeight') # This line loads the tests. from gofer.ok import check Let's explore the most recent COVID data from the New York Times This data is updated and stored at GitHub: https://github.com/nytimes/covid-19-data US rolling average: https://raw.githubusercontent.com/nytimes/covid-19- data/master/rolling-averages/us.csv US States rolling average: https://raw.githubusercontent.com/nytimes/covid-19- data/master/rolling-averages/us-states.csv In [123]: COVID_data = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/rolling- averages/us.csv' COVID=Table.read_table(COVID_data) COVID=COVID.set_format(0, DateFormatter(format='%Y-%m-%d',)) If the above read does not work we can use the data handling packages pandas as first discussed in the introduction to Lab 03. It can be run by removing comments, #, in front of the below lines. In [124]: import pandas as pd data_db = pd.read_csv(COVID_data) # Read data with pandas COVID = Table.from_df(data_db) # Create datascience Table object COVID=COVID.set_format(0, DateFormatter(format='%Y-%m-%d',)) In [125]: COVID.sort("date",descending=False) # Display most recent first Out[125]: date geoid cases cases_av g cases_avg_pe r_100k deaths deaths_a vg deaths_avg_per _100k 2020-01- 21 USA 1 0.14 0 0 0 0 2020-01- 22 USA 0 0.14 0 0 0 0
date geoid cases cases_av g cases_avg_pe r_100k deaths deaths_a vg deaths_avg_per _100k 2020-01- 23 USA 0 0.14 0 0 0 0 2020-01- 24 USA 1 0.29 0 0 0 0 2020-01- 25 USA 1 0.43 0 0 0 0 2020-01- 26 USA 2 0.71 0 0 0 0 2020-01- 27 USA 0 0.71 0 0 0 0 2020-01- 28 USA 0 0.71 0 0 0 0 2020-01- 29 USA 0 0.71 0 0 0 0 2020-01- 30 USA 1 0.86 0 0 0 0 ... (1148 rows omitted) Use where to select data from November - December 2021 Here are the possible arguments for the where Table method: Predicate Example Result are.equal_to are.equal_to(50) Find rows with values equal to 50 are.not_equal_to are.not_equal_to(50) Find rows with values not equal to 50 are.above are.above(50) Find rows with values above (and not equal to) 50 are.above_or_equal _to are.above_or_equal_to (50) Find rows with values above 50 or equal to 50 are.below are.below(50) Find rows with values below 50 are.between are.between(2, 10) Find rows with values above or equal to 2 and below 10
In [126]: COVID.where("deaths",are.between(0,1)) Out[126]: date geoid cases cases_av g cases_avg_pe r_100k deaths deaths_a vg deaths_avg_per _100k 2020-01- 21 USA 1 0.14 0 0 0 0 2020-01- 22 USA 0 0.14 0 0 0 0 2020-01- 23 USA 0 0.14 0 0 0 0 2020-01- 24 USA 1 0.29 0 0 0 0 2020-01- 25 USA 1 0.43 0 0 0 0 2020-01- 26 USA 2 0.71 0 0 0 0 2020-01- 27 USA 0 0.71 0 0 0 0 2020-01- 28 USA 0 0.71 0 0 0 0 2020-01- 29 USA 0 0.71 0 0 0 0 2020-01- 30 USA 1 0.86 0 0 0 0 ... (41 rows omitted) Dates produce an error as you will see in the next cell, below we will see the steps needed to work with dates In [127]: # Dates produce an error, below we will see the steps needed to work with dates COVID.where("date",are.between("11/01/2021","12/31/2021")) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[127], line 2 1 # Dates produce an error, below we will see the steps needed to work with dates ----> 2 COVID . where( "date" ,are . between( "11/01/2021" , "12/31/2021" )) File /opt/conda/lib/python3.10/site-packages/datascience/tables.py:1415, in
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Table.where(self, column_or_label, value_or_predicate, other) 1413 else : 1414 predicate = value_or_predicate -> 1415 column = [predicate(x) for x in column] 1416 return self . take(np . nonzero(column)[ 0 ]) File /opt/conda/lib/python3.10/site-packages/datascience/tables.py:1415, in <listcomp>(.0) 1413 else : 1414 predicate = value_or_predicate -> 1415 column = [predicate(x) for x in column] 1416 return self . take(np . nonzero(column)[ 0 ]) File /opt/conda/lib/python3.10/site-packages/datascience/predicates.py:225, in _combinable.__call__(self, x) 224 def __call__ ( self , x): --> 225 return self . f(x) File /opt/conda/lib/python3.10/site-packages/datascience/predicates.py:143, in are.between.<locals>.<lambda>(x) 140 @staticmethod 141 def between (y, z): 142 """Greater than or equal to y and less than z.""" --> 143 return _combinable( lambda x: (y <= x < z) or _equal_or_float_equal(x, y)) TypeError: '>=' not supported between instances of 'numpy.ndarray' and 'str' Dates and times in Tables One thing that is more complicated then we would like but is a common need for a data scientist is encoding data and time. Computer operating systems like Windows and Linux use as a reference point or epoch January 1, 1970 and determine the number of seconds since midnight the start of 1970 to do time computations. Seconds since January 1, 1970 today is given by time.time() after importing time module: In [128]: import time # Python time functions from time import strptime time.time() # Seconds since common epoch Out[128]: 1695775133.869719 We can also use a string containing the year, month & day using the strptime function. Why the "p" in the name strptime? It stands for "parse." To parse a sentence is to break it into its grammatical parts. Here the function strptime parses a date sting into year, month, day to calculate a time. In [129]: time0 = strptime('2020-01-21', '%Y-%m-%d') time.mktime(time0) Out[129]: 1579564800.0 We can also use the data type of the numpy array to generate an appropriate conversion from the number of seconds since midnight the start of 1970 to legible dates. In [130]:
np.array([0, 1579564800], dtype='datetime64[s]') # See above for seconds resulting from string '2020-01-21' Out[130]: array(['1970-01-01T00:00:00', '2020-01-21T00:00:00'], dtype='datetime64[s]') Now we can recast example as a datetime64 object In [131]: example = np.array([0, 1579564800]) print('Original example: ',example) example = example.astype('datetime64[s]') print('example as datetime64 object: ',example) Original example: [ 0 1579564800] example as datetime64 object: ['1970-01-01T00:00:00' '2020-01-21T00:00:00'] Question 1: Determine the number of seconds between January 21, 2020 (considered the start of the COVID pandemic in the US) and December 31, 2021 (both at midnight). Remember doing this in Lab02 for seconds over a period of years. Use two methods: A) Multiplying 60 seconds * 60 minutes * 24 hours * ... this is difftimeA B) Using strptime and time.mktime() ... this is difftimeB In [132]: difftimeA = 60 * 60 * 24 * (365 + 366 - 21) # Compute through multiplaction 60 seconds * 60 minutes * 24 hours * ... time1a = strptime('2020-01-21', '%Y-%m-%d') time1 = time.mktime(time1a) time2a = strptime('2021-12-31', '%Y-%m-%d') time2 = time.mktime(time2a) difftimeB = time2 - time1 print(difftimeB, difftimeA) 61344000.0 61344000 In [133]: check('tests/q1a.py') Out[133]: All tests passed! Question 2: The date in the COVID table evaluates to a time in seconds since the epoch like evaluated above. Now define a subset of the data to examine trends during all of November and December of 2021. In [134]: time1 = time.mktime(strptime('2021-11-01', '%Y-%m-%d')) # Seconds since epoch time2 = time.mktime(strptime('2021-12-31', '%Y-%m-%d')) # Seconds since epoch Late2021 = COVID.where(0,are.between(time1,time2)) Late2021 Out[134]:
date geoid cases cases_av g cases_avg_pe r_100k deaths deaths_a vg deaths_avg_pe r_100k 2021-11- 01 USA 124570 73390.2 22.12 1151 1347.29 0.41 2021-11- 02 USA 76943 71604.6 21.58 1516 1337.15 0.4 2021-11- 03 USA 84844 71166.6 21.45 1880 1328.74 0.4 2021-11- 04 USA 83087 70636 21.29 1123 1295.66 0.39 2021-11- 05 USA 91768 71224.9 21.47 2321 1238.27 0.37 2021-11- 06 USA 31939 71185.9 21.45 402 1228.55 0.37 2021-11- 07 USA 22085 71786.1 21.63 132 1221.68 0.37 2021-11- 08 USA 126461 74097.1 22.33 1221 1208.3 0.36 2021-11- 09 USA 83651 75324.4 22.7 1716 1224.58 0.37 2021-11- 10 USA 98390 76860.2 23.16 1636 1213.69 0.37 ... (50 rows omitted) In [135]: check('tests/q2a.py') Out[135]: All tests passed! Plot If we attempt to plot using the 'date' column we get the seconds from the epoch (January 1, 1970). This does not work well so we will address this below. In [136]: Late2021.plot('date','cases_avg_per_100k')
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Plot with Plotly Plotly is an alternative to matplotlib that has become a mainstay in visualizing data when you want interactive plots but can become very slow when datasets have thousands of points. The plotly figure can be saved by clicking on the camera icon in the upper right corner of the figure. In [137]: import plotly.express as px series1 = Late2021.column('cases_avg_per_100k') date = Late2021.column('date') fig = px.line(x=date, y=[series1]) fig.show() Converting the date column to a new data type, datetime64[s], gives the plot correctly formatted dates In [138]: import plotly.express as px
series1 = Late2021.column('cases_avg_per_100k') date = Late2021.column('date').astype('datetime64[s]') # Using the retyping of the column which is a nump array fig = px.line(x=date, y=[series1]) fig.show() Histogram A histogram method is realized by appending .hist('column name') In [139]: Late2021.hist('deaths') We can also access summary statistics for the datascience table In [140]: Late2021.stats() Out[140]: statistic date geoid cases cases_avg cases_a vg_per_ 100k deaths deaths_ avg deaths_ avg_per_ 100k min 1.63572e+ 09 USA 22085 70636 21.29 79 917.71 0.28 max 1.64082e+ 09 USA 585055 344970 103.97 3331 1382.4 0.42 median 1.63827e+ 115608 94628.2 28.52 1383.5 1223.13 0.37
statistic date geoid cases cases_avg cases_a vg_per_ 100k deaths deaths_ avg deaths_ avg_per_ 100k 09 sum 9.82964e+ 10 8.3087e+ 06 7.09819e+ 06 2139.24 77398 72478.9 21.88 Question 3 Construct a histogram and stats for November - December 2020 and compare this to those from November - December 2021 in a markdown cell below the histogram and statistics. In [141]: time1 = time.mktime(strptime('2020-11-01', '%Y-%m-%d')) # Seconds since epoch time2 = time.mktime(strptime('2020-12-31', '%Y-%m-%d')) # Seconds since epoch Late2020 = COVID.where(0,are.between(time1,time2)) Late2020 Out[141]: date geoid cases cases_av g cases_avg_pe r_100k deaths deaths_a vg deaths_avg_pe r_100k 2020-11- 01 USA 74195 82648.8 24.91 428 820.8 0.25 2020-11- 02 USA 94006 85497.1 25.77 540 824.6 0.25 2020-11- 03 USA 92416 88094.1 26.55 1130 844.56 0.25 2020-11- 04 USA 108078 91858.4 27.68 1616 869.25 0.26 2020-11- 05 USA 121349 96266.1 29.01 1108 888.73 0.27 2020-11- 06 USA 132823 100960 30.43 1248 930.9 0.28 2020-11- 07 USA 125931 106991 32.24 1007 955.62 0.29 2020-11- 08 USA 103323 111334 33.55 464 949.33 0.29
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
date geoid cases cases_av g cases_avg_pe r_100k deaths deaths_a vg deaths_avg_pe r_100k 2020-11- 09 USA 130444 116338 35.06 745 979.68 0.3 2020-11- 10 USA 139684 123090 37.1 1464 1030.93 0.31 ... (50 rows omitted) In [142]: check('tests/q3a.py') Out[142]: All tests passed! In [143]: Late2020.hist('deaths') Late2020.stats() Out[143]: statistic date geoid cases cases_avg cases_ avg_pe r_100k deaths deaths_ avg deaths_ avg_per _100k min 1.60419e+ 09 USA 74195 82648.8 24.91 428 820.8 0.25 max 1.60929e+ 09 USA 280016 218502 65.85 3808 2714.25 0.82 median 1.60674e+ 09 181043 174187 52.495 1611.5 1633.35 0.49 sum 9.64043e+ 10 1.05848e+ 07 1.02044e+ 07 3075.39 112076 106619 32.15
Your comparison in this markdown cell (double click to edit) In 2020, the number of deaths, on any given day was likely to be in the range of 1250- 1450 or 2900-3100, and the peak can hit as high as 0.082%. In contrast, in 2021, the hiest peak is 0.063% and that was for the range between 0 and 400, we see consistent percdnt per unit for the ranges between 1000 and 1900. This essentially shows that there are more deaths on a given day in 2020 in comparison to 2021, which makes sense since that was the height of the pandamic. Although the histograms can be a little hard to understand I think the meadian as represented in the statistics highlights the point clearly. Plotting with dates Dates can also be tricky to get a good x-axis. This is particularly complicated with the time being defined as seconds since the common epoch of January 1, 1970 @12 midnight. The key is first to convert the array containing the date data to a datetime64[s] format then using some special Matplotlib codes to get the best date formatting illustrated below: In [144]: # Input Data to plot date = Late2021.column('date').astype('datetime64[s]') # Need to convert to a datetime64[s] object deaths = Late2021.column('deaths') ## DATE PLOTTING CODE TEMPLATE TO COPY ## import matplotlib.dates as mdates loc = mdates.AutoDateLocator() # Fancy function for dates fmt = mdates.AutoDateFormatter(loc) plt.gca().xaxis.set_major_formatter(fmt) plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%b')) # Control format with
codes ## END: DATE PLOTTING CODE TEMPLATE TO COPY ## # Now plot plt.plot(date,deaths) plt.gcf().autofmt_xdate() Question 4 Now use the same plotting template (copy from above) and modify to plot your Late2020 data. In the markdown cell below, describe the differences in the line graphs between 2020 and 2021. In [145]: # Plot of November - December 2020 COVID data dates = Late2020.column('date').astype('datetime64[s]') deaths = Late2020.column('deaths') ## DATE PLOTTING CODE TEMPLATE TO COPY ## import matplotlib.dates as mdates loc = mdates.AutoDateLocator() # Fancy function for dates fmt = mdates.AutoDateFormatter(loc) plt.gca().xaxis.set_major_formatter(fmt) plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%b')) # Control format with codes ## END: DATE PLOTTING CODE TEMPLATE TO COPY ## # Now plot plt.plot(dates,deaths) plt.gcf().autofmt_xdate()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Your comparison in this markdown cell (double click to edit) In 2021, it is easily distinguishable that most days,the number of deaths is a consistent number around 2000, with a couple of outliers reaching as high as 3000. However, in 2020, from the beginning of November to the end of December, we can see that the number of deaths steadily rises from 1500 to around 3500 over the two month period. In [146]: check('tests/q4a.py') Out[146]: All tests passed! 2. Defining functions Let's write a very simple function that converts a proportion to a percentage by multiplying it by 100. For example, the value of to_percentage(.5) should be the number 50. (No percent sign.) A function definition has a few parts. def It always starts with def (short for def ine): def Name Next comes the name of the function. Let's call our function to_percentage . def to_percentage Signature Next comes something called the signature of the function. This tells Python how many arguments your function should have, and what names you'll use to refer to those arguments in the function's code. to_percentage should take one argument, and we'll call that argument proportion since it should be a proportion. ''' def
to_percentage(proportion) ''' We put a colon after the signature to tell Python it's over. def to_percentage(proportion): Documentation Functions can do complicated things, so you should write an explanation of what your function does. For small functions, this is less important, but it's a good habit to learn from the start. Conventionally, Python functions are documented by writing a triple- quoted string: def to_percentage(proportion): """Converts a proportion to a percentage.""" Body Now we start writing code that runs when the function is called. This is called the body of the function. We can write anything we could write anywhere else. First let's give a name to the number we multiply a proportion by to get a percentage. def to_percentage(proportion): """Converts a proportion to a percentage.""" factor = 100 return The special instruction return in a function's body tells Python to make the value of the function call equal to whatever comes right after return . We want the value of to_percentage(.5) to be the proportion .5 times the factor 100, so we write: def to_percentage(proportion): """Converts a proportion to a percentage.""" factor = 100 return proportion * factor Question 5. Define to_percentage in the cell below. Call your function to convert the proportion .2 to a percentage. Name that percentage twenty_percent . In [147]: def to_percentage(proportion): """ Converts decimal to percentage. """ factor = 100 return proportion * factor twenty_percent = to_percentage(.2) twenty_percent Out[147]: 20.0 In [148]: check('tests/q5a.py') Out[148]: All tests passed! Question 6. Now define another function which takes the ratio of two number and then uses the 'to_percentage' function above to convert it into a percentage. One issue is when the denominator is zero we get a result which is not a number or nan in Python. This can be changed to a zero as a place holder with one of the two little tricks shown below that can be incorporated as two lines of your code. In [149]:
# First approach to deal with dividing by zero from math import nan z = nan print("First: ",z) # Use this part in your function if z != z: # if conditional statement z = 0 # Up to here print("Now: ", z) First: nan Now: 0 A second approach which uses Python try: and except: . The except: is executing if the try: fails due to an exception such as computing as nan . In [150]: # Second approach: use this part in your function with z as the ratio try: z = 1/0 except: z = 0 print("Now: ", z) Now: 0 In [151]: # Now your function... def ratio(x1,x2): """ Computes a ratio of x1 to x2 """ z = (x1)/(x2) r = to_percentage(z) return r One_To_Five = ratio(1,5) One_To_Five Out[151]: 20.0 In [152]: check('tests/q6a.py') Out[152]: All tests passed! COVID cases leading to bad outcomes Now we will apply the function to our COVID data. Here we will need to use the with_columns method of a Table object to add the result of applying the ratio function with two columns as arguments. These columns will be deaths and cases . The percentage return by the function will create a new column. See Inferential Thinking 8.1.1for inspiration: https://inferentialthinking.com/chapters/08/1/Applying_a_Function_to_a_Column.html Question 7. Now apply your function to create a new column, deathrate . Examine the histogram for deathrate. Now plot the trend for deathrate for the entire timeperiod of the dataset. Remember the special codes from above to define the x ('date') and y ('deathrate') data to plot. Discuss the results in the markdown cell below. In [153]: COVID = COVID.with_columns("deathrate",COVID.apply(ratio,"deaths_avg", "cases_avg")).sort("deathrate")
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
# Check that there are no nan... y = nan if y != y: y = 0 COVID Out[153]: date geoid cases cases_a vg cases_avg _per_100k deaths deaths_a vg deaths_avg _per_100k deathrate 2020- 01-21 USA 1 0.14 0 0 0 0 0 2020- 01-22 USA 0 0.14 0 0 0 0 0 2020- 01-23 USA 0 0.14 0 0 0 0 0 2020- 01-24 USA 1 0.29 0 0 0 0 0 2020- 01-25 USA 1 0.43 0 0 0 0 0 2020- 01-26 USA 2 0.71 0 0 0 0 0 2020- 01-27 USA 0 0.71 0 0 0 0 0 2020- 01-28 USA 0 0.71 0 0 0 0 0 2020- 01-29 USA 0 0.71 0 0 0 0 0 2020- 01-30 USA 1 0.86 0 0 0 0 0 ... (1148 rows omitted) In [154]: # Histogram COVID.hist("deathrate")
In [155]: # Plot # Be sure to re-sort data by date, the plot connect subsequent data points COVID = COVID.sort("date") # Input Data to plot dates = COVID.column('date').astype('datetime64[s]') deathrate = COVID.column('deathrate') # mdates does the trick! ## USE DATE PLOTTING CODE TEMPLATE HERE ### import matplotlib.dates as mdates loc = mdates.AutoDateLocator() # Fancy function for dates fmt = mdates.AutoDateFormatter(loc) plt.gca().xaxis.set_major_formatter(fmt) plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%b')) plt.plot(dates,deathrate) plt.gcf().autofmt_xdate()
Your discussion of results from question 7 in this markdown cell (double click to edit) ............. The plot clearly represnts the deathrate from the beginning of January 2020 to may of 2023. These is a huge mountain in the plot starting around marh of 2020, peaking around May, and dying off around July or August, followed by a couple more hills and then a quite significant increase in the deathrae in April of 2020 but is only half of that in 2020. In [156]: check('tests/q7a.py') Out[156]: All tests passed! Congratulations , you're done with lab 4! Be sure to run all the tests and verify that they all pass (the next cell has a shortcut for that), Save and Checkpoint from the File menu Run the last two cells for partial grading. Comments and markdown will be graded separately. In [157]: # For your convenience, you can run this cell to run all the tests at once! import glob from gofer.ok import check correct = 0 for x in range(1, 8): print('Testing question {}: '.format(str(x))) g = check('tests/q{}a.py'.format(str(x))) if g.grade == 1.0: print("Passed") correct += 1 else: print('Failed')
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
display(g) print('Grade: {}'.format(str(correct/7))) Testing question 1: Passed Testing question 2: Passed Testing question 3: Passed Testing question 4: Passed Testing question 5: Passed Testing question 6: Passed Testing question 7: Passed Grade: 1.0 In [158]: print("Nice work ",name) import time; localtime = time.asctime( time.localtime(time.time()) ) print("Submitted @ ", localtime) Nice work Sam Zahroun Submitted @ Wed Sep 27 01:16:23 2023 In [ ]: