project2
html
keyboard_arrow_up
School
University of California, Berkeley *
*We aren’t endorsed by this school
Course
C8
Subject
Geography
Date
Dec 6, 2023
Type
html
Pages
29
Uploaded by ProfessorField11727
In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("project2.ipynb")
Project 2: Climate Change—Temperatures
and Precipitation
¶
In this project, you will investigate data on climate change, or the long-term shifts in temperatures and weather patterns!
Logistics
¶
Deadline.
This project is due at 11:00pm PT on Friday, 11/10
. You can receive 5 bonus points for submitting the project by 11:00pm PT on Thursday, 11/9.
Projects will be accepted up to 2 days (48 hours) late. Projects submitted fewer than 24 hours after the deadline will receive 2/3 credit, and projects submitted between 24 and 48 hours after the deadline will receive 1/3 credit. We will not accept
any projects that are submitted 48 hours or
more after the deadline. It's much
better to be early than late, so we recommend starting early.
Please note that there may not be any office hours on 11/10 due to Veteran's Day.
Checkpoint.
For full credit on the checkpoint, you must complete the questions up to the checkpoint, pass all public
autograder tests
for those sections, and submit to the Gradescope Project 2 Checkpoint assignment by 11:00pm PT on Friday, 11/3
. The checkpoint is worth 5% of your entire project grade
. After you've submitted the checkpoint, you may still change your project answers before the final project deadline - only your final submission, to the "Project 2" assignment, will be graded for correctness
(including questions from before the checkpoint). You will have some lab time to work on these questions, but we recommend that you start the project before lab and leave time to finish the checkpoint afterward.
Partners.
You may work with one other partner; your partner must be from your assigned lab section. Only one
partner should submit the project notebook to Gradescope. If both partners submit, you will be docked 10% of your project grade. On Gradescope, the person who submits should also designate their partner so that both of you receive credit.
Once you submit, click into your submission, and there will be an option to Add Group Member in the top right corner. You may also reference this walkthrough video
on how to add partners on Gradescope.
Rules.
Don't share your code with anybody but your partner. You are welcome to discuss questions with other students, but don't share the answers. The experience of solving the problems in this project will prepare you for
exams (and life). If someone asks you for the answer, resist! Instead, you can demonstrate how you would solve
a similar problem.
Support.
You are not alone! Come to office hours, post on Ed, and talk to your classmates. If you want to ask about the details of your solution to a problem, make a private Ed post and the staff will respond. If you're ever feeling overwhelmed or don't know how to make progress, email your TA or tutor for help. You can find contact
information for the staff on the course website
.
Tests.
The tests that are given are not comprehensive
and passing the tests for a question does not
mean that
you answered the question correctly. Tests usually only check that your table has the correct column labels. However, more tests will be applied to verify the correctness of your submission in order to assign your final score, so be careful and check your work! You might want to create your own checks along the way to see if your answers make sense. Additionally, before you submit, make sure that none of your cells take a very long time to run (several minutes).
Free Response Questions:
Make sure that you put the answers to the written questions in the indicated cell we provide. Every free response question should include an explanation
that adequately answers the question. Your written work will be uploaded to Gradescope automatically after the project deadline; there is no action required on your part for this.
Advice.
Develop your answers incrementally. To perform a complicated task, break it up into steps, perform each step on a different line, give a new name to each result, and check that each intermediate result is what you
expect. You can add any additional names or functions you want to the provided cells. Make sure that you are using distinct and meaningful variable names throughout the notebook. Along that line, DO NOT
reuse the variable names that we use when we grade your answers.
You never
have to use just one line in this project or any others. Use intermediate variables and multiple lines as much as you would like!
All of the concepts necessary for this project are found in the textbook. If you are stuck on a particular problem,
reading through the relevant textbook section will often help to clarify concepts.
To get started, load datascience
, numpy
, and matplotlib
. Make sure to also run the first cell of this notebook to load otter
.
In [2]:
# Run this cell to set up the notebook, but please don't change it.
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
np.set_printoptions(legacy='1.13')
import warnings
warnings.simplefilter('ignore')
Part 1: Temperatures
¶
In the following analysis, we will investigate one of the 21st century's most prominent issues: climate change. While the details of climate science are beyond the scope of this course, we can start to learn about climate change just by analyzing public records of different cities' temperature and precipitation over time.
We will analyze a collection of historical daily temperature and precipitation measurements from weather stations in 210 U.S. cities. The dataset was compiled by Yuchuan Lai and David Dzombak [1]; a description of the data from the original authors and the data itself is available here
.
[1] Lai, Yuchuan; Dzombak, David (2019): Compiled historical daily temperature and precipitation data for selected 210 U.S. cities. Carnegie Mellon University. Dataset.
Part 1, Section 1: Cities
¶
Run the following cell to load information about the cities
and preview the first few rows.
In [3]:
cities = Table.read_table('city_info.csv', index_col=0)
cities.show(3)
Name
ID
Lat
Lon
Stn.Name
Stn.stDate Stn.edDate
Lander USW00024021
42.8153
-108.726
LANDER WBO 1892-01-01 1946-05-28
Lander USW00024021
42.8153
-108.726
LANDER HUNT FIELD
1946-05-29 2021-12-31
Cheyenne
USW00024018
41.1519
-104.806
CHEYENNE WBO 1871-01-01 1935-08-31
... (458 rows omitted)
The cities
table has one row per weather station and the following columns:
1.
"Name"
: The name of the US city 2.
"ID"
: The unique identifier for the US city 3.
"Lat"
: The latitude of the US city (measured in degrees of latitude) 4.
"Lon"
: The longitude of the US city (measured in degrees of longitude) 5.
"Stn.Name"
: The name of the weather station in which the data was collected 6.
"Stn.stDate"
: A string representing the date of the first recording at that particular station 7.
"Stn.edDate"
: A string representing the date of the last recording at that particular station The data lists the weather stations at which temperature and precipitation data were collected. Note that although some cities have multiple weather stations, only one is collecting data for that city at any given point in time. Thus, we are able to just focus on the cities themselves.
Question 1.1.1:
In the cell below, produce a scatter plot that plots the latitude and longitude of every city in the cities
table so that the result places northern cities at the top and western cities at the left.
Note
: It's okay to plot the same point multiple times!
Hint
: A latitude is the set of horizontal lines that measures distances north or south
of the equator. A longitude is
the set of vertical lines that measures distances east or west
of the prime meridian.
In [4]:
cities.scatter("Lon", "Lat")
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
These cities are all within the continental U.S., and so the general shape of the U.S. should be visible in your plot. The shape will appear distorted compared to most maps for two reasons: the scatter plot is square even though the U.S. is wider than it is tall, and this scatter plot is an equirectangular projection
of the spherical Earth. A geographical map of the same data uses the common Pseudo-Mercator projection
.
Note:
If this visualization doesn't load for you, please view a version of it online here
.
In [ ]:
# Just run this cell
Marker.map_table(cities.select('Lat', 'Lon', 'Name').relabeled('Name', 'labels'))
Question 1.1.2
Does it appear that these city locations are sampled uniformly at random from all the locations in the U.S.? Why or why not?
Yes, it does appears that these city locations are sampled uniformly at random from all the locations in the U.S. due having a dense population among the east and the west coast which has many cities. Furthermore, the Midwest seems to be less densely populated which makes sense as there are less cities and more open spaces within the Midwest. This seems to be a represenative random sample of the U.S.
Question 1.1.3:
Assign num_unique_cities
to the number of unique cities that appear in the cities
table.
In [6]:
num_unique_cites = cities.group("Name").num_rows
# Do not change this line
print(f"There are {num_unique_cites} unique cities that appear within our dataset.")
There are 210 unique cities that appear within our dataset.
In [7]:
grader.check("q1_1_3")
Out[7]:
q1_1_3
passed!
?
In order to investigate further, it will be helpful to determine what region of the United States each city was located in: Northeast, Northwest, Southeast, or Southwest. For our purposes, we will be using the following geographical boundaries:
1.
A station is located in the "Northeast"
region if its latitude is above or equal to 40 degrees and its longtitude is greater than or equal to -100 degrees. 2.
A station is located in the "Northwest"
region if its latitude is above or equal to 40 degrees and its longtitude is less than -100 degrees. 3.
A station is located in the "Southeast"
region if its latitude is below 40 degrees and its longtitude is greater than or equal to -100 degrees. 4.
A station is located in the "Southwest"
region if its latitude is below 40 degrees and its longtitude is less than -100 degrees. Question 1.1.4
: Define the coordinates_to_region
function below. It should take in two arguments, a city's latitude (
lat
) and longitude (
lon
) coordinates, and output a string representing the region it is located in.
In [8]:
def coordinates_to_region(lat, lon):
if lat >= 40 and lon >= -100: return "Northeast"
elif lat >= 40 and lon <-100: return "Northwest"
elif lat < 40 and lon >= -100: return "Southeast"
elif lat < 40 and lon < -100: return "Southwest"
In [9]:
grader.check("q1_1_4")
Out[9]:
q1_1_4
passed!
?
Question 1.1.5
: Add a new column in cities
labeled Region
that contains the region in which the city is located. For full credit, you must use the coordinates_to_region
function you defined rather than reimplementing its logic.
In [11]:
regions_array = cities.apply(coordinates_to_region, "Lat", "Lon")
cities = cities.with_column("Region", regions_array)
cities.show(5)
Name
ID
Lat
Lon
Stn.Name
Stn.stDate Stn.edDate
Region
Lander USW00024021
42.8153
-108.726
LANDER WBO 1892-01-01 1946-05-28 Northwest
Lander USW00024021
42.8153
-108.726
LANDER HUNT FIELD 1946-05-29 2021-12-31 Northwest
Cheyenne
USW00024018
41.1519
-104.806
CHEYENNE WBO 1871-01-01 1935-08-31 Northwest
Cheyenne USW00024018 41.1519 -104.806
CHEYENNE MUNICIPAL ARPT
1935-09-01 2021-12-31 Northwest
Wausau USW00014897
44.9258
-89.6256
Wausau Record Herald 1896-01-01 1941-12-31 Northeast
... (456 rows omitted)
In [12]:
grader.check("q1_1_5")
Out[12]:
q1_1_5
passed!
?
To confirm that you've defined your coordinates_to_region
function correctly and successfully added the Region
column to the cities
table, run the following cell. Each region should have a different color in the result.
In [14]:
# Just run this cell
cities.scatter("Lon", "Lat", group="Region")
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Challenge Question 1.1.6 (OPTIONAL, ungraded)
: Create a new table called cities_nearest
. It should contain the same columns as the cities
table and an additional column called "Nearest"
that contains the name of the nearest city
that is in a different region from the city described by the row.
To approximate the distance between two cities, take the square root of the sum of the squared difference between their latitudes and the square difference between their longitudes. Don't use a for
statement; instead, use the apply
method and array arithmetic.
Hint
: We have defined a distance
function for you, which can be called on numbers lat0
and lon0
and arrays lat1
and lon1
.
In [15]:
def distance(lat0, lon0, lat1, lon1):
"Approximate the distance between point (lat0, lon0) and (lat1, lon1) pairs in the arrays."
return np.sqrt((lat0 - lat1) * (lat0 - lat1) + (lon0 - lon1) * (lon0 - lon1))
...
cities_nearest = ...
# Note: remove the comment(#) on the next line if you choose to do this question
#cities_nearest.show(5)
Part 1, Section 2: Welcome to Phoenix, Arizona
¶
Each city has a different CSV file full of daily temperature and precipitation measurements. The file for Phoenix, Arizona is included with this project as phoenix.csv
. The files for other cities can be downloaded here
by matching them to the ID of the city in the cities
table.
Since Phoenix is located on the upper edge of the Sonoran Desert, it has some impressive temperatures.
Run the following cell to load in the phoenix
table. It has one row per day and the following columns:
1.
"Date"
: The date (a string) representing the date of the recording in YYYY-MM-DD
format 2.
"tmax"
: The maximum temperature for the day (°F) 3.
"tmin"
: The minimum temperature for the day (°F) 4.
"prcp"
: The recorded precipitation for the day (inches) In [16]:
phoenix = Table.read_table("phoenix.csv", index_col=0)
phoenix.show(3)
Date
tmax tmin prcp
1896-01-01 66 30 0 1896-01-02 64 30 0 1896-01-03 68 30 0 ... (46018 rows omitted)
Question 1.2.1:
Assign the variable largest_2010_range_date
to the date of the largest temperature range
in Phoenix, Arizona for any day between January 1st, 2010 and December 31st, 2010. To get started, use the variable phoenix_with_ranges_2010
to filter the phoenix table to days in 2010 with an additional column corresponding to the temperature range for that day.
Your answer should be a string in the "YYYY-MM-DD" format.
Feel free to use as many lines as you need. A temperature range is calculated as the difference between the max and min temperatures for the day.
Hint
: To limit the values in a column to only those that contain
a certain string, pick the right are.
predicate from the Python Reference Sheet
.
Note:
Do not
re-assign the phoenix
variable; please use the phoenix_with_ranges_2010
variable instead.
In [17]:
phoenix_with_ranges_2010 = phoenix.where("Date", are.containing("2010"))
phoenix_with_ranges_2010 = phoenix_with_ranges_2010.with_column("abs diff", abs(phoenix_with_ranges_2010.column("tmax") - phoenix_with_ranges_2010.column("tmin"))).sort("abs diff", descending = True)
largest_2010_range_date = phoenix_with_ranges_2010.column("Date").item(0)
largest_2010_range_date
Out[17]:
'2010-06-24'
In [18]:
grader.check("q1_2_1")
Out[18]:
q1_2_1
passed!
?
We can look back to our phoenix
table to check the temperature readings for our largest_2010_range_date
to see if anything special is going on. Run the cell below to find the row of the phoenix
table that corresponds to the date we found above.
In [19]:
# Just run this cell
phoenix.where("Date", largest_2010_range_date)
Out[19]:
Date
tmax tmin prcp
2010-06-24 113 79 0 ZOO WEE MAMA! Look at the maximum temperature for that day. That's hot.
The function extract_year_from_date
takes a date string in the YYYY-MM-DD
format and returns an integer representing the year
. The function extract_month_from_date
takes a date string and returns a string describing the month. Run this cell, but you do not need to understand how this code works or edit it.
In [20]:
# Just run this cell
import calendar
def extract_year_from_date(date):
"""Returns an integer corresponding to the year of the input string's date."""
return int(date[:4])
def extract_month_from_date(date):
"Return an abbreviation of the name of the month for a string's date."
month = date[5:7]
return f'{month} ({calendar.month_abbr[int(date[5:7])]})'
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
# Example
print('2022-04-01 has year', extract_year_from_date('2022-04-01'),
'and month', extract_month_from_date('2022-04-01'))
2022-04-01 has year 2022 and month 04 (Apr)
Question 1.2.2:
Add two new columns called Year
and Month
to the phoenix
table that contain the year as an integer
and the month as a string
(such as "04 (Apr)"
) for each day, respectively.
Note
: The functions above may be helpful!
In [20]:
years_array = phoenix.apply(extract_year_from_date, "Date")
months_array = phoenix.apply(extract_month_from_date, "Date")
phoenix = phoenix.with_columns("Year", years_array, "Month", months_array)
phoenix.show(5)
Date
tmax tmin prcp Year
Month
1896-01-01 66 30 0 1896
01 (Jan)
1896-01-02 64 30 0 1896
01 (Jan)
1896-01-03 68 30 0 1896
01 (Jan)
1896-01-04 69 34 0 1896
01 (Jan)
1896-01-05 70 46 0 1896
01 (Jan)
... (46016 rows omitted)
In [21]:
grader.check("q1_2_2")
Out[21]:
q1_2_2
passed!
?
Question 1.2.3:
Using the phoenix
table, create an overlaid line plot of the average maximum temperature
and average minimum temperature
for each year between 1900 and 2020 (inclusive).
Hint:
To draw a line plot with more than one line, call plot
on the column label of the x-axis values and all other columns will be treated as y-axis values.
In [22]:
phoenix.group("Year", np.mean).select("Year", "tmax mean", "tmin mean").plot("Year")
Question 1.2.4:
Although still hotly debated (pun intended), many climate scientists agree that the effects of climate change began to surface in the early 1960s as a result of elevated levels of greenhouse gas emissions. How does the graph you produced in Question 1.2.3 support the claim that modern-day global warming began in the early 1960s?
The graph produced in the previous question supports the claim that modern-day global wamring began in the early 1960s because during the 1960s the graph shows the annual mean min temperature began to increase. The graph shows a steady increase of the annual mean min from the early 1960s to the 1980s. The annual mean
max began to show a slow rise. These observations within the graph between the annual mean min and annual mean max shows a correlation between the modern-day global wamring and greenhouse emmisions.
Averaging temperatures across an entire year can obscure some effects of climate change. For example, if summers get hotter but winters get colder, the annual average may not change much. Let's investigate how average monthly
maximum temperatures have changed over time in Phoenix.
Question 1.2.5:
Create a monthly_increases
table with one row per month and the following four columns in order:
1.
"Month"
: The month (such as "02 (Feb)"
) 2.
"Past"
: The average max temperature in that month from 1900-1960 (inclusive) 3.
"Present"
: The average max temperature in that month from 2019-2021 (inclusive) 4.
"Increase"
: The difference between the present and past average max temperatures in that month First, make a copy of the phoenix
table and add a new column containing the corresponding period
for each row. The period refers to whether the year is in the "Past"
, "Present"
, or "Other"
category. You may find the period function helpful to see which years correspond to each period. Then, use this new table to construct monthly_increases
. Feel free to use as many lines as you need.
Hint
: What table method can we use to get each unique value
as its own column?
Note
: Please do not
re-assign the phoenix
variable!
In [23]:
def period(year):
"Output if a year is in the Past, Present, or Other."
if 1900 <= year <= 1960:
return "Past"
elif 2019 <= year <= 2021:
return "Present"
else:
return "Other"
phoenix_copy = phoenix.with_column("Period", phoenix.apply(period, "Year"))
monthly_increases = phoenix_copy.pivot("Period", "Month", "tmax", np.mean).drop("Other")
monthly_increases = monthly_increases.with_column("Increase", monthly_increases.column("Present") - monthly_increases.column("Past"))
monthly_increases.show()
Month
Past
Present Increase
01 (Jan)
65.0164 67.8312 2.81479 02 (Feb)
68.8485 69.1859 0.337362
03 (Mar)
74.6499 75.9796 1.32965 04 (Apr)
82.6421 88.4 5.75792 05 (May)
91.4299 94.571 3.14104 06 (Jun)
101.166 105.734 4.56832 07 (Jul)
103.599 107.245 3.64654 08 (Aug)
101.416 107.384 5.96769
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Month
Past
Present Increase
09 (Sep)
97.6874 101.238 3.55035 10 (Oct)
86.798 90.1667 3.36868 11 (Nov)
74.6273 80.5178 5.89046 12 (Dec)
65.9064 67.4548 1.54844 In [24]:
grader.check("q1_2_5")
Out[24]:
q1_2_5
passed!
?
February in Phoenix
¶
The "Past"
column values are averaged over many decades, and so they are reliable estimates of the average high temperatures in those months before the effects of modern climate change. However, the "Present"
column is based on only three years of observations. February, the shortest month, has the fewest total observations: only 85 days. Run the following cell to see this.
In [25]:
# Just run this cell
feb_present = phoenix.where('Year', are.between_or_equal_to(2019, 2021)).where('Month', '02 (Feb)')
feb_present.num_rows
Out[25]:
85
Look back to your monthly_increases
table. Compared to the other months, the increase for the month of February is quite small; the February difference is very close to zero. Run the following cell to print out our observed difference.
In [26]:
# Just run this cell
print(f"February Difference: {monthly_increases.row(1).item('Increase')}")
February Difference: 0.3373623297258632
Perhaps that small difference is somehow due to chance! To investigate this idea requires a thought experiment.
We can observe all of the February maximum temperatures from 2019 to 2021 (the present period), so we have access to the census; there's no random sampling involved. But, we can imagine that if more years pass with the same present-day climate, there would be different but similar maximum temperatures in future February days. From the data we observe, we can try to estimate the average maximum February temperature
in this imaginary collection of all future February days that would occur in our modern climate, assuming the climate doesn't change any further and many years pass.
We can also imagine that the maximum temperature each day is like a random draw from a distribution of temperatures for that month
. Treating actual observations of natural events as if they were each randomly
sampled from some unknown distribution is a simplifying assumption. These temperatures were not actually sampled at random—instead they occurred due to the complex interactions of the Earth's climate—but treating them as if they were random abstracts away the details of this naturally occuring process and allows us to carry out statistical inference. Conclusions are only as valid as the assumptions upon which they rest, but in this case thinking of daily temperatures as random samples from some unknown climate distribution seems at least plausible.
If we assume that the actual temperatures were drawn at random from some large population of possible February days
in our modern climate, then we can not only estimate the population average of this distribution, but also quantify our uncertainty about that estimate using a confidence interval.
We will just compute the lower bound of this confidence interval.
We intend to compare our confidence interval to the historical average (ie. the "Past"
value in our monthly_increases
table). In all months, the sample average we will consider (i.e. the "Present"
value in our monthly_increases
table) is larger
than the historical average. Since we are essentially interested in seeing if the average February temperatures have warmed
since the past, we are only concerned with the lower bound of the confidence interval being higher than the average max temperature of the past. As a result, we know in advance that the upper bound of the confidence interval will be larger as well, and there is no need to compute the upper bound explicitly. (But you can if you wish!)
Question 1.2.6.
Complete the implementation of the function ci_lower
, which takes a one-column table t
containing sample observations and a confidence level
percentage such as 95 or 99. It returns the lower bound of a confidence interval for the population mean constructed using 5,000 bootstrap resamples.
After defining ci_lower
, we have provided a line of code that calls ci_lower
on the present-day February max temperatures to output the lower bound of a 99% confidence interval for the average of daily max temperatures in February. The result should be around 67 degrees.
In [27]:
def ci_lower(t, level):
"""Compute a lower bound of a level% confidence interval of the average of the population for which column 0 of Table t contains a sample.
"""
stats = make_array()
for k in np.arange(5000):
stat = np.mean(t.sample(with_replacement = True).column("tmax"))
stats = np.append(stats, stat) lower = (100 - level)/2
return percentile(lower, stats)
# Call ci_lower on the max temperatures in present-day February to find the lower bound of
a 99% confidence interval.
feb_present_ci = ci_lower(feb_present.select('tmax'), 99)
feb_present_ci
Out[27]:
66.870588235294122
In [28]:
grader.check("q1_2_6")
Out[28]:
q1_2_6
passed!
?
Question 1.2.7
The lower bound of the feb_present_ci
99% confidence interval is below the observed past February average maximum temperature of 68.8485 (from the monthly_increases
table). What conclusion can you draw about the effect of climate change on February maximum temperatures in Phoenix from this information? Use a 1% p-value cutoff.
Note
: If you're stuck on this question, re-reading the paragraphs under the February
heading (particularly the first few) may be helpful.
From the data we can conclude that the difference of February average maximum is actually due to chance. The average max temperature does not have a significant increase than past years, and this cannot support climate change. In order to make a claim the confidence interval must be 100% or greater. This criteria is met by the 99% confidence interval and the 1% p-value cutoff which equals 100%.
All Months
¶
Question 1.2.8.
Repeat the process of comparing the lower bound of a 99% confidence interval
to the past average
for each month. For each month, print out the name of the month (e.g., 02 (Feb)
), the observed past average, and the lower bound of a confidence interval for the present average.
Use the provided call to print
in order to format the result as one line per month.
Hint
: Your code should follow the same format as our code from above (i.e. the February
section).
In [32]:
comparisons = make_array()
months = monthly_increases.column("Month")
for month in months:
past_average = monthly_increases.where("Month", month).column("Past").item(0)
present_observations = phoenix_copy.where("Month", month).where("Period", "Present").select("tmax")
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
present_lower_bound = ci_lower(present_observations, 99)
# Do not change the code below this line
below = past_average < present_lower_bound
if below:
comparison = '**below**'
else:
comparison = '*above*'
comparisons = np.append(comparisons, comparison)
print('For', month, 'the past avg', round(past_average, 1), 'is', comparison, 'the lower bound', round(present_lower_bound, 1),
'of the 99% CI of the present avg. \n')
For 01 (Jan) the past avg 65.0 is **below** the lower bound 66.3 of the 99% CI of the present avg. For 02 (Feb) the past avg 68.8 is *above* the lower bound 67.0 of the 99% CI of the present avg. For 03 (Mar) the past avg 74.6 is *above* the lower bound 74.1 of the 99% CI of the present avg. For 04 (Apr) the past avg 82.6 is **below** the lower bound 86.4 of the 99% CI of the present avg. For 05 (May) the past avg 91.4 is **below** the lower bound 92.4 of the 99% CI of the present avg. For 06 (Jun) the past avg 101.2 is **below** the lower bound 104.3 of the 99% CI of the present avg. For 07 (Jul) the past avg 103.6 is **below** the lower bound 105.4 of the 99% CI of the present avg. For 08 (Aug) the past avg 101.4 is **below** the lower bound 105.7 of the 99% CI of the present avg. For 09 (Sep) the past avg 97.7 is **below** the lower bound 99.2 of the 99% CI of the present avg. For 10 (Oct) the past avg 86.8 is **below** the lower bound 87.8 of the 99% CI of the present avg. For 11 (Nov) the past avg 74.6 is **below** the lower bound 78.1 of the 99% CI of the present avg. For 12 (Dec) the past avg 65.9 is *above* the lower bound 65.8 of the 99% CI of the present avg.
In [31]:
grader.check("q1_2_8")
Out[31]:
q1_2_8
passed!
?
Question 1.2.9.
Summarize your findings. After comparing the past average to the 99% confidence interval's lower bound for each month, what conclusions can we make about the monthly average maximum temperature in historical (1900-1960) vs. modern (2019-2021) times in the twelve months? In other words, what null hypothesis should you consider, and for which months would you reject or fail to reject the null hypothesis? Use
a 1% p-value cutoff.
Hint
: Do you notice any seasonal patterns?
Our findings show that 9 out of 12 months are in support of the alternative hypothesis which favors this idea that climate change is happening in present day. When comparing the present day lower bounds using a 99% confidence interval to the past averages of Spring, Summer, and Fall, the present day lower bound is hotter than past averages. This is in support of and consistent with the alternative hypothesis which will reject the null
hypothesis that climate change is not happening. The months of Feb, March, and December which can be considered part of winter months or early spring do not show a statistical significance which means we would fail to reject the null hypothesis. This could be related to in the seasonal trend of winter which has lower temperatures. In those months of Feb, March, and December we would fail to reject the null hypothesis as we cannot conlcude a significant difference between past and modern day average maximum temperatures.
Checkpoint (due Friday, 11/3 by 11:00pm PT)
¶
Congrats on reaching the checkpoint! Bella
is proud of you!
Run the following cells and submit to the Project 2 Checkpoint
Gradescope assignment.
To earn full credit for this checkpoint, you must pass all the public autograder tests above this cell. The cell below will re-run all of the autograder tests for Part 1 to double check your work.
In [30]:
checkpoint_tests = ["q1_1_3", "q1_1_4", "q1_1_5", "q1_2_1", "q1_2_2", "q1_2_5", "q1_2_6", "q1_2_8"]
for test in checkpoint_tests:
display(grader.check(test))
q1_1_3
passed!
?
q1_1_4
passed! ✨
q1_1_5
passed!
?
q1_2_1
passed!
?
q1_2_2
passed! ✨
q1_2_5
passed!
?
q1_2_6
passed!
?
q1_2_8
passed!
?
Submission
¶
Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. Please save before
exporting!
Reminders
:
•
If you worked on Project 2 with a partner, please remember to add your partner to your Gradescope submission. If you resubmit, make sure to re-add your partner, as Gradescope does not save any partner information. •
Make sure to wait until the autograder finishes running to ensure that your submission was processed properly and that you submitted to the correct assignment. In [33]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)
Your submission has been exported. Click here
to download the zip file.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In [34]:
# Run this cell to set up the notebook, but please don't change it.
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
np.set_printoptions(legacy='1.13')
import warnings
warnings.simplefilter('ignore')
Part 2: Drought
¶
According to the United States Environmental Protection Agency
, "Large portions of the Southwest have experienced drought conditions since weekly Drought Monitor records began in 2000. For extended periods from 2002 to 2005 and from 2012 to 2020, nearly the entire region was abnormally dry or even drier."
Assessing the impact of drought is challenging with just city-level data because so much of the water that people use is transported from elsewhere, but we'll explore the data we have and see what we can learn.
Let's first take a look at the precipitation data in the Southwest region. The southwest.csv
file contains total annual precipitation for 13 cities in the southwestern United States for each year from 1960 to 2021. This dataset is aggregated from the daily data and includes only the Southwest cities from the original dataset that have consistent precipitation records back to 1960.
In [21]:
southwest = Table.read_table('southwest.csv')
southwest.show(5)
City
Year
Total Precipitation
Albuquerque 1960 8.12 Albuquerque 1961 8.87 Albuquerque 1962 5.39 Albuquerque 1963 7.47 Albuquerque 1964 7.44 ... (788 rows omitted)
Question 2.1.
Create a table totals
that has one row for each year in chronological order. It should contain
the following columns:
1.
"Year"
: The year (a number) 2.
"Precipitation"
: The total precipitation in all 13 southwestern cities that year In [22]:
totals = southwest.group("Year", sum).drop("City sum").relabel("Total Precipitation sum", "Precipitation")
totals
Out[22]:
Year Precipitation
1960 149.58 1961 134.82 1962 130.41 1963 132.18 1964 123.41 1965 187.53 1966 120.27 1967 179.02 1968 136.25 1969 191.72 ... (51 rows omitted)
In [23]:
grader.check("q2_1")
Out[23]:
q2_1
passed! ✨
Run the cell below to plot the total precipitation in these cities over time, so that we can try to spot the drought visually. As a reminder, the drought years given by the EPA were (2002-2005) and (2012-2020).
In [24]:
# Just run this cell
totals.plot("Year", "Precipitation")
This plot isn't very revealing. Each year has a different amount of precipitation, and there is quite a bit of variability across years, as if each year's precipitation is a random draw from a distribution of possible outcomes.
Could it be that these so-called "drought conditions" from 2002-2005 and 2012-2020 can be explained by chance? In other words, could it be that the annual precipitation amounts in the Southwest for these drought years are like random draws from the same underlying distribution
as for other years? Perhaps nothing about the Earth's precipitation patterns has really changed, and the Southwest U.S. just happened to experience a few dry years close together.
To assess this idea, let's conduct an A/B test in which each year's total precipitation
is an outcome, and the condition is whether or not the year is in the EPA's drought period
.
This drought_label
function distinguishes between drought years as described in the U.S. EPA statement above (2002-2005 and 2012-2020) and other years. Note that the label "other" is perhaps misleading, since there
were other droughts before 2000, such as the massive 1988 drought
that affected much of the U.S. However, if
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
we're interested in whether these modern drought periods (2002-2005 and 2012-2020) are normal
or abnormal
, it makes sense to distinguish the years in this way.
In [25]:
def drought_label(n):
"""Return the label for an input year n."""
if 2002 <= n <= 2005 or 2012 <= n <= 2020:
return 'drought'
else:
return 'other'
Question 2.2.
Define null and alternative hypotheses for an A/B test that investigates whether drought years are drier
(have less precipitation) than other years.
Note:
Please format your answer using the following structure.
•
Null hypothesis:
... •
Alternative hypothesis:
... _Null hypothesis: The average total precipitation in "Drought years" is the same as the average total precipitation in other years.
Alternative hypothesis: The average total precipitation in "Drought years" is less than the average total precipitation in other years._
Question 2.3.
First, define the table drought
. It should contain one row per year and the following two columns:
•
"Label"
: Denotes if a year is part of a "drought"
year or an "other"
year •
"Precipitation"
: The sum of the total precipitation in 13 Southwest cities that year Then, construct an overlaid histogram of two observed distributions: the total precipitation in drought years and the total precipitation in other years.
Note
: Use the provided bins
when creating your histogram, and do not re-assign the southwest
table. Feel free to use as many lines as you need!
Hint
: The optional group
argument in a certain function might be helpful!
In [27]:
bins = np.arange(85, 215+1, 13)
drought = Table().with_columns("Label", totals.apply(drought_label, "Year"), "Precipitation", totals.column("Precipitation"))
drought.hist("Precipitation", bins = bins, group = "Label")
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Before you continue, inspect the histogram you just created and try to guess the conclusion of the A/B test. Building intuition about the result of hypothesis testing from visualizations is quite useful for data science applications.
Question 2.4.
Our next step is to choose a test statistic based on our alternative hypothesis in Question 2.2. Which of the following options are valid choices for the test statistic? Assign ab_test_stat
to an array of integers
corresponding to valid choices. Assume averages and totals are taken over the total precipitation sums for each year.
1.
The difference between the total
precipitation in drought
years and the total
precipitation in other
years. 2.
The difference between the total
precipitation in others
years and the total
precipitation in drought
years. 3. The absolute
difference between the total
precipitation in others years and the total
precipitation in drought years. 4.
The difference between the average
precipitation in drought
years and the average
precipitation in other
years. 5.
The difference between the average
precipitation in others
years and the average
precipitation in drought
years. 6. The absolute
difference between the average
precipitation in others years and the average
precipitation in drought years. In [30]:
ab_test_stat = make_array(4, 5)
In [31]:
grader.check("q2_4")
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Out[31]:
q2_4
passed!
?
Question 2.5.
Fellow climate scientists Noah and Sarah point out that there are more other
years than drought
years, and so measuring the difference between total precipitation will always favor the other
years. They conclude that all of the options above involving total
precipitation are invalid test statistic choices. Do you agree with them? Why or why not?
Hint:
Think about how permutation tests work with imbalanced classes!
While Noah and Sarah make a valid point, I think it's important to consider how the permutation tests work. We
disagree with Noah and Sarah's claim because the permutation test will equally favor both other and drought years. This is because when using a permutation test the number of data points are preserves between each label. Thus permutation test are used with imbalanced groups like we have here between the "drought" and "other" groups. When comparing the total precipitation between the two groups of drought and other, the test statistic will take into account the relative values within each group.
Before going on, check your drought
table. It should have two columns Label
and Precipitation
with 61 rows, 13 of which are for "drought"
years.
In [32]:
drought.show(3)
Label Precipitation
other
149.58 other
134.82 other
130.41 ... (58 rows omitted)
In [33]:
drought.group('Label')
Out[33]:
Label
count
drought 13 other 48 Question 2.6.
For our A/B test, we'll use the difference between the average precipitation in drought years and
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
the average precipitation in other years as our test statistic:
$$\text{average precipitation in "drought" years} - \text{average precipitation in "other" years}$$
First, complete the function test_statistic
. It should take in a two-column table t
with one row per year and two columns:
•
Label
: the label for that year (either 'drought'
or 'other'
) •
Precipitation
: the total precipitation in the 13 Southwest cities that year. Then, use the function you define to assign observed_statistic
to the observed test statistic.
In [34]:
def test_statistic(t):
return np.mean(t.where("Label", "drought").column("Precipitation")) - np.mean(t.where("Label", "other").column("Precipitation"))
observed_statistic = test_statistic(drought)
observed_statistic
Out[34]:
-15.856714743589748
In [35]:
grader.check("q2_6")
Out[35]:
q2_6
passed!
?
Now that we have defined our hypotheses and test statistic, we are ready to conduct our hypothesis test. We’ll start by defining a function to simulate the test statistic under the null hypothesis, and then call that function 5,000 times to construct an empirical distribution under the null hypothesis.
Question 2.7.
Write a function to simulate the test statistic under the null hypothesis. The simulate_precipitation_null
function should simulate the null hypothesis once (not 5,000 times) and return the value of the test statistic for that simulated sample.
Hint
: Using t.with_column(...)
with a column name that already exists in a table t
will replace that column with the newly specified values.
In [36]:
def simulate_precipitation_null():
sampled_table = drought.with_column("Precipitation", drought.sample().column("Precipitation"))
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
return test_statistic(sampled_table)
# Run your function a couple times to make sure that it works
simulate_precipitation_null()
Out[36]:
-19.529150641025609
In [37]:
grader.check("q2_7")
Out[37]:
q2_7
passed!
?
Question 2.8.
Fill in the blanks below to complete the simulation for the hypothesis test. Your simulation should compute 5,000 values of the test statistic under the null hypothesis and store the result in the array sampled_stats
.
Hint:
You should use the simulate_precipitation_null
function you wrote in the previous question!
Note:
Running this cell may take a few seconds. If it takes more than a minute, try to find a different (faster) way to implement your simulate_precipitation_null
function.
In [43]:
sampled_stats = make_array()
repetitions = 5000
for i in np.arange(repetitions):
sampled_stats = np.append(sampled_stats, simulate_precipitation_null())
# Do not change these lines
Table().with_column('Difference Between Means', sampled_stats).hist()
plt.scatter(observed_statistic, 0, c="r", s=50);
plt.ylim(-0.01);
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In [44]:
grader.check("q2_8")
Out[44]:
q2_8
passed!
?
Question 2.9.
Compute the p-value for this hypothesis test, and assign it to the variable precipitation_p_val
.
In [45]:
precipitation_p_val = np.count_nonzero(sampled_stats <= observed_statistic)/5000
precipitation_p_val
Out[45]:
0.0246
In [46]:
grader.check("q2_9")
Out[46]:
q2_9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
passed!
?
Question 2.10.
State a conclusion from this test using a p-value cutoff of 5%. What have you learned about the EPA's statement on drought?
With the p-value cutoff being 5%, we reject the null hypothesis since the precipitation p-value was found to be 0.0246. The null hypothesis was that the average total precipitation in "Drought years" is the same as the average precipitation in other years. Thus, we reject the null hypothesis as there is evidence for the alternative with the p-value. We've learned that the EPA statement on drought that the drought years were between (2002-
2005) and (2012-2020) and that the entire region was abnormally dry throughout the drought years.
Question 2.11.
Does your conclusion from Question 2.10 apply to the entire Southwest region of the U.S.? Why
or why not?
Note:
Feel free to do some research into geographical features of this region of the U.S.!
Based on the EPAs statement the entire Southwest has been abnoramlly dry during the drought years. Here, within our data we have randomly sampled cities within the Southwest. We believe that we can apply this conclusion to the entire Southwest region based on the climate of the Southwest region no differening much from area to area. Furthermore, due to our p-value of 0.0246 this shows a significance as there are drought years. We think we should be able to generalize back to the entire Southwest region given this information.
Conclusion
¶
Data science plays a central role in climate change research because massive simulations of the Earth's climate are necessary to assess the implications of climate data recorded from weather stations, satellites, and other sensors. Berkeley Earth
is a common source of data for these kinds of projects.
In this project, we found ways to apply our statistical inference technqiues that rely on random sampling even in
situations where the data were not generated randomly, but instead by some complicated natural process that appeared random. We made assumptions about randomness and then came to conclusions based on those assumptions. Great care must be taken to choose assumptions that are realistic, so that the resulting conclusions are not misleading. However, making assumptions about data can be productive when doing so allows inference
techniques to apply to novel situations.
Congratulations -- Sylvester is proud of you for finishing Project 2!
Important Reminders:
•
If you worked on Project 2 with a partner, please remember to add your partner to your Gradescope submission. If you resubmit, make sure to re-add your partner, as Gradescope does not save any partner information. •
Make sure to wait until the autograder finishes running to ensure that your submission was processed properly and that you submitted to the correct assignment.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Submission
¶
Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. Please save before
exporting!
In [47]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)
Running your submission against local test cases...
Your submission received the following results when run against available test cases:
q1_1_3 results: All test cases passed!
q1_1_4 results: All test cases passed!
q1_1_5 results: All test cases passed!
q1_2_1 results: All test cases passed!
q1_2_2 results: All test cases passed!
q1_2_5 results: All test cases passed!
q1_2_6 results: All test cases passed!
q1_2_8 results: All test cases passed!
q2_1 results: All test cases passed!
q2_4 results: All test cases passed!
q2_6 results: All test cases passed!
q2_7 results: All test cases passed!
q2_8 results: All test cases passed!
q2_9 results: All test cases passed!
Your submission has been exported. Click here
to download the zip file.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help