problem14

pdf

School

Georgia Institute Of Technology *

*We aren’t endorsed by this school

Course

CS6040

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

Uploaded by ChefStraw5566

11/28/23, 8:13 PM problem14 file:///Users/dannie/Downloads/pmt1-sample-solutions-su21/problem14-sample-solutions.html 1/15 Problem 14: Scraping data from "FiveThirtyEight" There are a ton of fun interactive visualizations at the website, FiveThirtyEight (http://fivethirtyeight.com). For example, consider the one that tracks the US President's approval ratings: https://projects.fivethirtyeight.com/trump-approval-ratings/ (https://projects.fivethirtyeight.com/trump-approval- ratings/) Here is a screenshot of the interactive graph it contains: In it, you can select each day ("movable cursor") and get information about the approval ratings for that day.

11/28/23, 8:13 PM problem14 file:///Users/dannie/Downloads/pmt1-sample-solutions-su21/problem14-sample-solutions.html 2/15 As it turns out, this visualization is implemented in JavaScript and all of the individual data items are embedded within the web page itself. For example, here is a 132-page PDF file, which is the source code for the web page taken on September 6, 2018: PDF file (https://cse6040.gatech.edu/datasets/538-djt-pop/2018-09-06.pdf). The raw data being rendered in the visualization starts on page 50. Of course, that means you can use your Python-fu to try to extract this data for your own purposes! Indeed, that is your task for this problem. Although the data in this problem comes from an HTML file with embedded JavaScript, you do not need to know anything about HTML or JavaScript to solve this problem. It is purely an exercise of rudimentary Python and computational problem solving. Reading the raw HTML file Let's read the raw contents of the FiveThirtyEight approval ratings page (i.e., the same contents as the PDF) into a variable named raw_html . Like the groceries problem in Notebook 2, this cell contains a bunch of code for getting the data file you need, which you can ignore.

11/28/23, 8:13 PM problem14 file:///Users/dannie/Downloads/pmt1-sample-solutions-su21/problem14-sample-solutions.html 3/15 In [1]: def download(url, local_file, overwrite= False ): import os , requests if not os.path.exists(local_file) or overwrite: print("Downloading: {} ...".format(url)) r = requests.get(url) with open(local_file, 'wb') as f: f.write(r.content) return True return False # File existed already def get_checksum(local_file): import io , hashlib with io.open(local_file, 'rb') as f: body = f.read() body_checksum = hashlib.md5(body).hexdigest() return body_checksum def download_or_load_locally(file, local_dir="", url_base= None , checks um= None ): if url_base is None : url_base = "https://cse6040.gatech.edu/datase ts/" local_file = " {}{} ".format(local_dir, file) remote_url = " {}{} ".format(url_base, file) download(remote_url, local_file) if checksum is not None : body_checksum = get_checksum(local_file) assert body_checksum == checksum, \ "Downloaded file ' {} ' has incorrect checksum: ' {} ' instead of ' {} '".format(local_file, body_checksum, checksum) print("' {} ' is ready!".format(file)) def on_vocareum(): import os return os.path.exists('.voc') if on_vocareum(): URL_BASE = None DATA_PATH = "./resource/asnlib/publicdata/538-djt-pop/" else : URL_BASE = "https://cse6040.gatech.edu/datasets/538-djt-pop/" DATA_PATH = "" datasets = {'2018-09-06.html': '291a7c1cbf15575a48b0be8d77b7a1d6'} for filename, checksum in datasets.items(): download_or_load_locally(filename, url_base=URL_BASE, local_dir=DA TA_PATH, checksum=checksum) with open(' {}{} '.format(DATA_PATH, '2018-09-06.html')) as fp: raw_html = fp.read() print(" \n (All data appears to be ready.)")

Your preview ends here