problem21

pdf

School

University of Michigan *

*We aren’t endorsed by this school

Course

215

Subject

Computer Science

Date

Nov 24, 2024

Type

pdf

Pages

Uploaded by TareqA5

Problem 21: Final exam, Fall 2020: The legacy of "redlining" Version 1.2 (Added clarification on regression part) This problem builds on your knowledge of the Python data stack to do analyze data that contains geographic information. It has 6 exercises, numbered 0 to 5 . There are 13 available points. However, to earn 100%, the threshold is just 10 points. (Therefore, once you hit 10 points, you can stop. There is no extra credit for exceeding this threshold.) Each exercise builds logically on the previous one, but you may solve them in any order. That is, if you can't solve an exercise, you can still move on and try the next one. However, if you see a code cell introduced by the phrase, "Sample result for ...", please run it. Some demo cells in the notebook may depend on these precomputed results. The point values of individual exercises are as follows: Exercise 0: 2 points Exercise 1: 3 points Exercise 2: 2 points Exercise 3: 2 points Exercise 4: 2 points Exercise 5: 2 points Pro-tips. All test cells use randomly generated inputs. Therefore, try your best to write solutions that do not assume too much. To help you debug, when a test cell does fail, it will often tell you exactly what inputs it was using and what output it expected, compared to yours. If you need a complex SQL query, remember that you can define one using a triple-quoted (multiline) string (https://docs.python.org/3.7/tutorial/introduction.html#strings) . If your program behavior seem strange, try resetting the kernel and rerunning everything. If you mess up this notebook or just want to start from scratch, save copies of all your partial responses and use Actions Reset Assignment to get a fresh, original copy of this notebook. (Resetting will wipe out any answers you've written so far, so be sure to stash those somewhere safe if you intend to keep or reuse them!) If you generate excessive output (e.g., from an ill-placed print statement) that causes the notebook to load slowly or not at all, use Actions Clear Notebook Output to get a clean copy. The clean copy will retain your code but remove any generated output. However , it will also rename the notebook to clean.xxx.ipynb . Since the autograder expects a notebook file with the original name, you'll need to rename the clean notebook accordingly. Good luck! Background During the economic Great Depression of the 1930s, the United States government began "rating" neighborhoods, on a letter-grade scale of "A" ("good") to "D" ("bad"). The purpose was to use such grades to determine which neighborhoods would qualify for new investments, in the form of residential and business loans. But these grades also reflected racial and ethnic bias toward the residents of their neighborhoods. Nearly 100 years later, the effects have taken the form of environmental and economic disparaties. In this notebook, you will get an idea of how such an analysis can come together using publicly available data and the basic computational data processing techniques that appeared in this course. (And after you finish the exam, we hope you will try the optional exercise at the end and refer to the "epilogue" for related reading.)

Goal and workflow. Your goal is to see if there is a relationship between the rating a neighborhood received in the 1930s and two attributes we can observe today: the average temperature of a neighborhood and the average home price. Temperature tells you something about the local environment. Areas with more parks, trees, and green space tend to experience more moderate temperatures. The average home price tells you something about the wealth or economic well-being of the neighborhood's residents. Your workflow will consist of the following steps: 1. You'll start with neighborhood rating data, which was collected from public records as part of a University of Richmond study on redlining policies (https://dsl.richmond.edu/panorama/redlining) 2. You'll then combine these data with satellite images, which give information about climate. These data come from the US Geological Survey (https://usgs.gov/) . 3. Lastly, you'll merge these data with home prices from the real estate website, Zillow (https://zillow.com/) . Note: The analysis you will perform is correlational, but the deeper research that inspired this problem tries to control for a variety of factors and suggests causal effects. Part 0: Setup At a minimum, you will need the following modules in this problem. They include a new one we did not cover called geopandas . While it may be new to you, if you have mastered pandas , then you know almost everything you need to use geopandas . Anything else you need will be given to you as part of this problem, so don't be intimidated! In [1]: import sys print(f"* Python version: {sys.version} ") # Standard packages you know and love: import pandas as pd import numpy as np import scipy as sp import matplotlib.pyplot as plt import geopandas print("* geopandas version:", geopandas.__version__) Run the next code cell, which will load some tools needed by the test cells. In [2]: ### BEGIN HIDDEN TESTS % load_ext autoreload % autoreload 2 ### END HIDDEN TESTS from testing_tools import data_fn, load_geopandas, load_df, load_pickle from testing_tools import f_ex0__sample_result from testing_tools import f_ex1__sample_result from testing_tools import f_ex2__sample_result from testing_tools import f_ex3__sample_result from testing_tools import f_ex4__sample_result from testing_tools import f_ex5__sample_result Part 1: Neighborhood ratings The neighborhood rating data is stored in a special extension of a pandas DataFrame called a GeoDataFrame . Let's load the data into a variable named neighborhood_ratings and have a peek at the first few rows: * Python version: 3.7.5 (default, Dec 18 2019, 06:24:58) [GCC 5.5.0 20171010] * geopandas version: 0.6.2

In [3]: neighborhood_ratings = load_geopandas('fullDownload.geojson') print(type(neighborhood_ratings)) neighborhood_ratings.head() Each row is a neighborhood. Its location is given by name, city, and a two-letter state abbreviation code (the name , city , and state columns, respectively). The rating assigned to a neighborhood is a letter, 'A' , 'B' , 'C' , or 'D' , given by the holc_grade column. In addition, there is special column called geometry . It contains a geographic outline of the boundaries of this neighborhood. Let's take a look at row 4 (last row shown above): In [4]: g4_example = neighborhood_ratings.loc[4, 'geometry'] print("* Type of `g4_example`:", type(g4_example)) print(" \n * Contents of `g4_example`:", g4_example) print(" \n * A quick visual preview:") display(g4_example) The output indicates that this boundary is stored a special object type called a MultiPolygon . It is usually a single connected polygon, but may also be the union of multiple such polygons. The coordinates of the multipolygon's corners are floating-point values, and correspond to longitude and latitude values (https://www.latlong.net/) . But for this notebook, the exact format won't be important. Simply treat the shapes as being specified in some way via a collection of two-dimensional coordinates measured in arbitrary units. Lastly, observe that calling display() on a MultiPolygon renders a small picture of it. Opening geopandas data file, './resource/asnlib/publicdata/fullDownload.geojson' ... <class 'geopandas.geodataframe.GeoDataFrame'> Out[3]: state city name holc_id holc_grade area_description_data geometry 0 AL Birmingham Mountain Brook Estates and Country Club Garden... A1 A {'5': 'Both sales and rental prices in 1929 we... MULTIPOLYGON (((-86.75678 33.49754, -86.75692 ... 1 AL Birmingham Redmont Park, Rockridge Park, Warwick Manor, a... A2 A {'5': 'Both sales and rental prices in 1929 we... MULTIPOLYGON (((-86.75867 33.50933, -86.76093 ... 2 AL Birmingham Colonial Hills, Pine Crest (outside city limits) A3 A {'5': 'Generally speaking, houses are not buil... MULTIPOLYGON (((-86.75678 33.49754, -86.75196 ... 3 AL Birmingham Grove Park, Hollywood, Mayfair, and Edgewood s... B1 B {'5': 'Both sales and rental prices in 1929 we... MULTIPOLYGON (((-86.80111 33.48071, -86.80099 ... 4 AL Birmingham Best section of Woodlawn Highlands B10 B {'5': 'Both sales and rental prices in 1929 we... MULTIPOLYGON (((-86.74923 33.53332, -86.74916 ... * Type of `g4_example`: <class 'shapely.geometry.multipolygon.MultiPolygon'> * Contents of `g4_example`: MULTIPOLYGON (((-86.749227 33.533325, -86.749156 33.530809, -86.75388599 999999 33.529075, -86.754373 33.529382, -86.754729 33.529769, -86.754729 33.530294, -86.756048000000 01 33.531225, -86.75539499999999 33.532008, -86.754456 33.532335, -86.753196 33.531483, -86.749714 3 3.533295, -86.749227 33.533325))) * A quick visual preview:

Your preview ends here