1. This project needs to import the pandas, matplotlib.pyplot, and numpy modules libraries. 2. Open the data file using the pandas read_csv function. Research how to convert decimal thousands to regular numbers. There is an input parameter available to handle this formatting. 3. Perform high level data summaries: a. Use the pandas shape() function to print the number of rows and columns in the data set b. Use the pandas head() and tail() function to print the first and last row sets. c. Use the pandas describe() function to provide some simple statistics about the data. Research options to show all statistics available. Print this. 4. Clean the data a. Check for missing data values. Is the isna() with sum() method to identify missing cells. The result should look like this year state o month 0 numher 0

Database System Concepts
7th Edition
ISBN:9780078022159
Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Chapter1: Introduction
Section: Chapter Questions
Problem 1PE
icon
Related questions
Question

Python code

.... .... - 21%
i
9:05
1 Vo) 12.92
LTE2 KB/S
< 6/4,498
Computer Time remaining: 00:06:02
Project 5 Data Plots
Objective: To work with a data set and generate graphical charts. To gain experience with pandas and
matplotlib modules/libraries.
Description: You will perform a simple data analysis given a CSV file of Amazon rain forest data. The data
set contains the number of fires that have occurred over an 18 year period in the Amazon. Working with raw
data is typical in Data Science. You will need to clean the data before visualizing the data in the form of a plot
or bar chart. About the data:
year is the year when the forest fire happened.
state is the Brazilian state.
month is the month when the forest fire happened.
number is the number of forest fires reported. You will notice that the format of the numbers is not US standard.
In many countries, thousands are represented with a period. So 2.588 is really 2588 forest fires. Don't fix the
data in Excel. You will correct for this on the data import
date is the date when the forest fire was reported
●
In this project, you will first perform a high level analysis of the data using pandas functions.
.
Clean the data looking for missing values
Create subsets of the data.
●
Then plot the data
Detailed Instruction Steps
It is recommended that you go through this project step by step. Make sure each step works before moving to
the next one. Use a lot of print() commands to show results at each intermediate step. These can be commented
out once you have the plot working.
1. This project needs to import the pandas, matplotlib.pyplot, and numpy modules/libraries.
2. Open the data file using the pandas read_csv function. Research how to convert decimal thousands to
regular numbers. There is an input parameter available to handle this formatting.
3. Perform high level data summaries:
a. Use the pandas shape() function to print the number of rows and columns in the data set
b. Use the pandas head()) and tail() function to print the first and last row sets.
c. Use the pandas describe() function to provide some simple statistics about the data. Research
options to show all statistics available. Print this.
4. Clean the data
a.
Check for missing data values. Is the isna() with sum() method to identify missing cells. The
result should look like this
year 0
state 0
month 0
number 0
date 0
dtype: int64
b. The goal is to generate a bar chart with a count of the number of fires per month. Since there are
months with 0 fires, you can eliminate these values from the data set. First, use the replace
function to replace Os with NaN values (Not a Number). Use the np.nan value as the replacement
value. Do a print of the head() of the data to now see NaN values.
c. To remove the lines, use the dropna() function. This function looks for NaN values in a specific
column. Research how to specify a column as the input parameter. Use the "number" column.
5. Group the data
a. The goal in this step is to create a pandas series to be used in the chart. The data must be
transformed so that there are totals by month. Research the pandas groupby() function syntax.
The goal is to specify the number column as a list key and then sum() function to get the totals.
Assign the results of the groupby() function to a new variable which is the data series.
b. Use the print() command for the variable in (a). This should show you totals for each month - in
alphabetical order.
c. The data needs to be sorted with January being first. Note that the original CSV file is sorted
correctly by month. Use the following command to create a list of unique months from the data
set - months_unique = list(data.month.unique())
d. Use the pandas reindex() function on the variable in (a). Use months_unique from (c) as input
parameter. The second parameter is axis-0. This function then sorts the data in the correct month
0
Edit
Delete
More
Share
Transcribed Image Text:.... .... - 21% i 9:05 1 Vo) 12.92 LTE2 KB/S < 6/4,498 Computer Time remaining: 00:06:02 Project 5 Data Plots Objective: To work with a data set and generate graphical charts. To gain experience with pandas and matplotlib modules/libraries. Description: You will perform a simple data analysis given a CSV file of Amazon rain forest data. The data set contains the number of fires that have occurred over an 18 year period in the Amazon. Working with raw data is typical in Data Science. You will need to clean the data before visualizing the data in the form of a plot or bar chart. About the data: year is the year when the forest fire happened. state is the Brazilian state. month is the month when the forest fire happened. number is the number of forest fires reported. You will notice that the format of the numbers is not US standard. In many countries, thousands are represented with a period. So 2.588 is really 2588 forest fires. Don't fix the data in Excel. You will correct for this on the data import date is the date when the forest fire was reported ● In this project, you will first perform a high level analysis of the data using pandas functions. . Clean the data looking for missing values Create subsets of the data. ● Then plot the data Detailed Instruction Steps It is recommended that you go through this project step by step. Make sure each step works before moving to the next one. Use a lot of print() commands to show results at each intermediate step. These can be commented out once you have the plot working. 1. This project needs to import the pandas, matplotlib.pyplot, and numpy modules/libraries. 2. Open the data file using the pandas read_csv function. Research how to convert decimal thousands to regular numbers. There is an input parameter available to handle this formatting. 3. Perform high level data summaries: a. Use the pandas shape() function to print the number of rows and columns in the data set b. Use the pandas head()) and tail() function to print the first and last row sets. c. Use the pandas describe() function to provide some simple statistics about the data. Research options to show all statistics available. Print this. 4. Clean the data a. Check for missing data values. Is the isna() with sum() method to identify missing cells. The result should look like this year 0 state 0 month 0 number 0 date 0 dtype: int64 b. The goal is to generate a bar chart with a count of the number of fires per month. Since there are months with 0 fires, you can eliminate these values from the data set. First, use the replace function to replace Os with NaN values (Not a Number). Use the np.nan value as the replacement value. Do a print of the head() of the data to now see NaN values. c. To remove the lines, use the dropna() function. This function looks for NaN values in a specific column. Research how to specify a column as the input parameter. Use the "number" column. 5. Group the data a. The goal in this step is to create a pandas series to be used in the chart. The data must be transformed so that there are totals by month. Research the pandas groupby() function syntax. The goal is to specify the number column as a list key and then sum() function to get the totals. Assign the results of the groupby() function to a new variable which is the data series. b. Use the print() command for the variable in (a). This should show you totals for each month - in alphabetical order. c. The data needs to be sorted with January being first. Note that the original CSV file is sorted correctly by month. Use the following command to create a list of unique months from the data set - months_unique = list(data.month.unique()) d. Use the pandas reindex() function on the variable in (a). Use months_unique from (c) as input parameter. The second parameter is axis-0. This function then sorts the data in the correct month 0 Edit Delete More Share
Expert Solution
steps

Step by step

Solved in 4 steps with 4 images

Blurred answer
Knowledge Booster
Introduction to Template
Learn more about
Need a deep-dive on the concept behind this application? Look no further. Learn more about this topic, computer-science and related others by exploring similar questions and additional content below.
Recommended textbooks for you
Database System Concepts
Database System Concepts
Computer Science
ISBN:
9780078022159
Author:
Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:
McGraw-Hill Education
Starting Out with Python (4th Edition)
Starting Out with Python (4th Edition)
Computer Science
ISBN:
9780134444321
Author:
Tony Gaddis
Publisher:
PEARSON
Digital Fundamentals (11th Edition)
Digital Fundamentals (11th Edition)
Computer Science
ISBN:
9780132737968
Author:
Thomas L. Floyd
Publisher:
PEARSON
C How to Program (8th Edition)
C How to Program (8th Edition)
Computer Science
ISBN:
9780133976892
Author:
Paul J. Deitel, Harvey Deitel
Publisher:
PEARSON
Database Systems: Design, Implementation, & Manag…
Database Systems: Design, Implementation, & Manag…
Computer Science
ISBN:
9781337627900
Author:
Carlos Coronel, Steven Morris
Publisher:
Cengage Learning
Programmable Logic Controllers
Programmable Logic Controllers
Computer Science
ISBN:
9780073373843
Author:
Frank D. Petruzella
Publisher:
McGraw-Hill Education