STEP 1: Begin work within your Jupyter Notebook by importing the following modules: import numpy as np import pandas as pd from matplotlib import pyplot as plt import re Jupyter Notebooks Q1. Within your Jupyter Notebook, write the code for a Python function called def parseWeatherByYear(year) : This function will parse an html page containing weather for an entire year of data for the city of Toronto. The html pages containing weather data can be downloaded from: https://www.extremeweatherwatch.com/cities/toronto/year-2023 The file to parse for this lab however can be downloaded here: https://matrix.senecacollege.ca/~danny.abesdris/prg550.232/labs/lab6/torontoWeather.2023.html The html file itself contains markers as where to begin parsing the data to extract. The 3 pieces of data that must be extracted consist of the high and low temperatures (in degrees Celsius) as well as the amount of precipitation (in cm) for every day so far in the current year (2023). A series of lines containing where to begin extracting data is listed below: January 1 5.0 2.7 0.15 Notice the marker in the lines above: /cities/toronto/day/month-n In the example above, the data to extract would be: 5.0, 2.7, and 0.15. The extraction can be achieved in several ways, but a carefully structured regular expression (using the match.group( ) directive as well as the re.S and re.M flags) is recommended for speed and simplicity. The trick here is to match text up to the point where the data begins (as groups) and then forming another regular expression that matches the data (again as a group). As always, the website https://regex101.com will be invaluable in helping you to achieve your solution with this. The data to be extracted must range from january 1, 2023 to the cutoff date for this file of march 16, 2023. It would be helpful to create a Numpy array of the number of days in each month of the year and then to investigate the Pandas date_range( ) function and the Series.dt.month_name attribute to allow you to programmatically capture the month names. https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.month_name.html The html file itself must be opened and the entire contents read into a string. As the data from the html file is extracted, your function must also write the data into a CSV (comma separated values) file using the initial heading (title) of: City,dayOfyear,month,dayOfMonth,Year,highTemp,lowTemp,precipitation You are to write each field separated by commas (,) and followed by the new line. The first 10 records of the resultant file should be exactly as listed below: City,dayOfyear,month,dayOfMonth,Year,highTemp,lowTemp,precipitation Toronto,1,january,1,2023,5.0,2.7,0.15 Toronto,2,january,2,2023,5.6,3.5,0.00 Toronto,3,january,3,2023,4.4,2.8,0.33 Toronto,4,january,4,2023,4.4,2.5,2.11 Toronto,5,january,5,2023,4.8,3.2,0.02 Toronto,6,january,6,2023,5.1,2.9,0.00 Toronto,7,january,7,2023,3.2,-4.1,0.00 Toronto,8,january,8,2023,-1.5,-4.8,0.00 Toronto,9,january,9,2023,2.2,-1.7,0.01 There are exactly 75 records in the html file to extract and therefore 75 records are to be written to the CSV file. Once the file has been created and all records written, your function must load the CSV file into a Pandas data frame and display ALL records in the data frame using the functions: pd.read_csv(csvFile) # read csv file into Data Frame pd.set_option('display.max_rows', None) # set a flag to display all rows in the output The data frame's shape attribute and describe( ) method must also be invoked and displayed. The exact output on the command line should be as listed below: City dayOfyear month dayOfMonth Year highTemp lowTemp precipitation 0 Toronto 1 january 1 2023 5.0 2.7 0.15 1 Toronto 2 january 2 2023 5.6 3.5 0.00 2 Toronto 3 january 3 2023 4.4 2.8 0.33 3 Toronto 4 january 4 2023 4.4 2.5 2.11 4 Toronto 5 january 5 2023 4.8 3.2 0.02 5 Toronto 6 january 6 2023 5.1 2.9 0.00 ...TO 75
STEP 1: Begin work within your Jupyter Notebook by importing the following modules:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import re
Jupyter Notebooks
Q1. Within your Jupyter Notebook, write the code for a Python function called
def parseWeatherByYear(year) :
This function will parse an html page containing weather for an entire year of data for the city of Toronto.
The html pages containing weather data can be downloaded from: https://www.extremeweatherwatch.com/cities/toronto/year-2023
The file to parse for this lab however can be downloaded here: https://matrix.senecacollege.ca/~danny.abesdris/prg550.232/labs/lab6/torontoWeather.2023.html
The html file itself contains markers as where to begin parsing the data to extract. The 3 pieces of data that must be extracted consist of the high and low temperatures (in degrees Celsius) as well as the amount of precipitation (in cm) for every day so far in the current year (2023).
A series of lines containing where to begin extracting data is listed below:
<td><div class='width-130'><a href='/cities/toronto/day/january-1'>January 1</a></div></td>
<td class='text-right temp40'>5.0</td>
<td class='text-right temp30'>2.7</td>
<td class='text-right rainsnow1'>0.15</td>
</tr>
Notice the marker in the lines above:
/cities/toronto/day/month-n
In the example above, the data to extract would be: 5.0, 2.7, and 0.15. The extraction can be achieved in several ways, but a carefully structured regular expression (using the match.group( ) directive as well as the re.S and re.M flags) is recommended for speed and simplicity. The trick here is to match text up to the point where the data begins (as groups) and then forming another regular expression that matches the data (again as a group).
As always, the website https://regex101.com will be invaluable in helping you to achieve your solution with this.
The data to be extracted must range from january 1, 2023 to the cutoff date for this file of march 16, 2023.
It would be helpful to create a Numpy array of the number of days in each month of the year and then to investigate the Pandas date_range( ) function and the Series.dt.month_name attribute to allow you to programmatically capture the month names.
https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.month_name.html
The html file itself must be opened and the entire contents read into a string.
As the data from the html file is extracted, your function must also write the data into a CSV (comma separated values) file using the initial heading (title) of:
City,dayOfyear,month,dayOfMonth,Year,highTemp,lowTemp,precipitation
You are to write each field separated by commas (,) and followed by the new line.
The first 10 records of the resultant file should be exactly as listed below:
City,dayOfyear,month,dayOfMonth,Year,highTemp,lowTemp,precipitation
Toronto,1,january,1,2023,5.0,2.7,0.15
Toronto,2,january,2,2023,5.6,3.5,0.00
Toronto,3,january,3,2023,4.4,2.8,0.33
Toronto,4,january,4,2023,4.4,2.5,2.11
Toronto,5,january,5,2023,4.8,3.2,0.02
Toronto,6,january,6,2023,5.1,2.9,0.00
Toronto,7,january,7,2023,3.2,-4.1,0.00
Toronto,8,january,8,2023,-1.5,-4.8,0.00
Toronto,9,january,9,2023,2.2,-1.7,0.01
There are exactly 75 records in the html file to extract and therefore 75 records are to be written to the CSV file.
Once the file has been created and all records written, your function must load the CSV file into a Pandas data frame and display ALL records in the data frame using the functions:
pd.read_csv(csvFile) # read csv file into Data Frame
pd.set_option('display.max_rows', None) # set a flag to display all rows in the output
The data frame's shape attribute and describe( ) method must also be invoked and displayed.
The exact output on the command line should be as listed below:
City dayOfyear month dayOfMonth Year highTemp lowTemp precipitation
0 Toronto 1 january 1 2023 5.0 2.7 0.15
1 Toronto 2 january 2 2023 5.6 3.5 0.00
2 Toronto 3 january 3 2023 4.4 2.8 0.33
3 Toronto 4 january 4 2023 4.4 2.5 2.11
4 Toronto 5 january 5 2023 4.8 3.2 0.02
5 Toronto 6 january 6 2023 5.1 2.9 0.00
...TO 75
Step by step
Solved in 4 steps with 3 images