Data_wrangling
docx
keyboard_arrow_up
School
Cumberland University *
*We aren’t endorsed by this school
Course
3268
Subject
Statistics
Date
Nov 24, 2024
Type
docx
Pages
5
Uploaded by ngararichrisgmail.com
1
Data Wrangling Assignment
Name
Institutional Affiliation
Date
2
Data Wrangling Task
Step 1: Loading the dataset
I started by loading the crime data and COVID-19 data into pandas DataFrames. This
allowed me to work with the datasets in a format that is easy to manipulate and analyze.
Step 2: Exploring the structure and content
Next, I explored the structure and content of both datasets using the
head()
function.
This gave me an initial understanding of the data and helped me identify the relevant
variables for analysis.
Step 3: Identifying the analytical question
To guide my data wrangling and analysis process, I formulated a specific analytical question:
What is the relationship between the number of reported shoplifting offenses and
COVID-19 cases?
This question provided a clear objective for the analysis.
Step 4: Data wrangling
I performed data wrangling to prepare the datasets for analysis. This involved
handling missing values, converting data types, and filtering relevant columns. These
operations ensured that the data was in a suitable format and contained only the necessary
information.
Step 5: Merging the datasets
3
In order to analyze the relationship between shoplifting offenses and COVID-19
cases, I merged the crime data and COVID-19 data based on the common ZIP code column.
This created a unified dataset that contained information about both variables.
Step 6: Data visualization
I created a scatter plot to visualize the relationship between the number of reported
shoplifting offenses and COVID-19 cases. This allowed me to observe any patterns or trends
in the data and gain initial insights.
Step 7: Drawing conclusions
By examining the scatter plot and analyzing the distribution of data points, I drew
conclusions or made inferences about the relationship between the variables. This provided
initial insights into the relationship which were:
Based on the analysis of the scatter plot, where the x-axis is the number of reported
shoplifting offenses and the y-axis is the number of COVID-19 cases by ZIP code, I observe
that the data points are located towards the left side of the scatter plot.
This concentration of data points towards the left side suggests that there is a higher number
of reported shoplifting offenses in ZIP codes with a lower number of COVID-19 cases. On
the other hand, ZIP codes with a higher number of COVID-19 cases tend to have a lower
number of reported shoplifting offenses.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
4
Step 8: Calculating correlation coefficient and hypothesis testing
To quantitatively measure the strength and significance of the relationship, I
calculated the correlation coefficient and conducted hypothesis testing. This allowed me to
determine the statistical significance of the observed relationship.
Step 9: Interpreting the results
Finally, I interpreted the correlation coefficient and p-value to draw meaningful
conclusions about the relationship between the variables. This helped me answer the
analytical question and understand the practical significance of the findings which were:
Based on the calculation results, the correlation coefficient between the number of reported
shoplifting offenses and COVID-19 cases is approximately 0.2391 which suggests a weak
positive correlation between the of reported shoplifting offenses and the number of COVID-
19 cases by ZIP code.
The p-value is approximately 0.1426 which indicates a null hypothesis that indicates no
correlation. In this case, the p-value is greater than the commonly used significance level of
0.05. Since its greater than 0.05, there is not enough evidence to reject the null hypothesis.
Therefore, I cannot conclude that the relationship between the number of reported shoplifting
offenses and COVID-19 cases is statistically significant.
In summary, based on the correlation coefficient and p-value, there is a weak positive
correlation between the reported shoplifting offenses and the number of COVID-19 cases by
ZIP code, which is not statistically significant.
By following these steps, I was able to systematically analyze the data, perform
relevant data wrangling operations, visualize the relationship between shoplifting offenses
and COVID-19 cases, and conduct statistical analysis to draw meaningful conclusions. The
5
chosen solutions, such as merging the datasets and calculating correlation coefficients, were
directly aligned with the analytical question and allowed me to investigate the relationship
between the variables in a comprehensive and structured manner.
Related Documents
Related Questions
The Excel file for this assignment contains a database with information about the tax assessment value assigned to medical office buildings in a city. The following is a list of the variables in the database:
FloorArea: square feet of floor space
Offices: number of offices in the building
Entrances: number of customer entrances
Age: age of the building (years)
AssessedValue: tax assessment value (thousands of dollars)
Use the data to construct a model that predicts the tax assessment value assigned to medical office buildings with specific characteristics.
Construct a scatter plot in Excel with FloorArea as the independent variable and AssessmentValue as the dependent variable. Insert the bivariate linear regression equation and r^2 in your graph. Do you observe a linear relationship between the 2 variables?
Use Excel’s Analysis ToolPak to conduct a regression analysis of FloorArea and AssessmentValue. Is FloorArea a significant predictor of AssessmentValue?
Construct a scatter plot…
arrow_forward
The exercise involving data in this and subsequent sections were designed to be solved using Excel. Johnson Filtration, Inc. provides maintenance service for water-filtration systems. Suppose that in addition to information on the number of months since the machine was serviced and whether a mechanical or an electrical repair was necessary, the managers obtained a list showing which repairperson performed the service. The revised data follow.
Repair Time(hours)
Months SinceLast Service
Type ofRepair
Repairperson
2.9
2
electrical
Dave Newton
3.0
6
mechanical
Dave Newton
4.8
8
electrical
Bob Jones
1.8
3
mechanical
Dave Newton
2.9
2
electrical
Dave Newton
4.9
7
electrical
Bob Jones
4.2
9
mechanical
Bob Jones
4.8
8
mechanical
Bob Jones
4.4
4
electrical
Bob Jones
4.5
6
electrical
Dave Newton
Ignore for now the months since the last maintenance service (x1 ) and the repairperson…
arrow_forward
. Primary Data Source and Secondary Data Source ?
arrow_forward
Create scatterplot using Excel with the following variables:
gender: 0 = male, 1 = female.
height: in inches.
weight: in pounds.
First we will create a scatterplot to examine how weight is related to height, ignoring gender.
To do that in Excel:
Sort the data by gender:
Hold down the Control key (Command key on MacOS) and click the A key to select all of the data in the worksheet.
Select the Home tab, then the Editing group Sort & Filter -> Custom Sort.
In the pop-up window, make sure that My list has headers box is checked and then choose gender from the pull-down menu next to Sort by. Click OK.
Now select all of the data in columns B and C, select the Insert tab and in the Charts group choose Scatter.
Choose the first scatterplot option (Scatter with only Markers).
Now we have a scatterplot, but the data is all on the right of the plot. To fix this:
Right-click on the x-axis, and choose Format Axis from the pop-up menu.
Make sure that Axis Options is selected on the left, and…
arrow_forward
A real estate major collected information on some recent local home sales. The first 6 lines of the database appear in the accompanying table. The columns correspond to the house identification number, the community name, the zip code, the
number of acres of the property, the year the house was built, the market value, and the size of the living area (in square feet). Do the data appear to have come from a designed survey or experiment? What concerns might you have about
drawing conclusions from this data set?
E Click the icon to view the data table.
Do the data appear to have come from a designed survey or experiment?
O A. It is not clear if the data were obtained from an experiment. They are certainly not from a survey.
O B. It is not clear if the data were obtained from a survey. They are certainly not from an experiment.
O C. It is not clear if the data were obtained from a survey or an experiment.
O D. The data were certainly not obtained from survey or an experiment.
What…
arrow_forward
The whole data set will be in the two pictures
arrow_forward
The entirety of the data set will be in the two pictures
arrow_forward
The migration pattern of Monarch butterflies are tracked by a catch-and-release method in which individual
butterflies are tagged with a circular, lightweight sticker placed carefully on the wings so as not to impede
their ability to fly. The sticker contains a unique ID number. Volunteers across the U.S. and South America
capture the butterflies, record the IDs if they are tagged, and release them. This allows us to track the
locations each unique ID is found, allowing us to estimate the migration pattern. On average, 1 out of 100
captured butterflies are already tagged. Suppose you are a volunteer and capture 50 butterflies; let X denote
the number of those that are already tagged. What is the distribution of X? What is the probability that
you catch at least one tagged butterfly?
arrow_forward
The process of systematic arrangement of data
in rows and columns is called
a. Array
b. None of These
c. Classification
d. Tabulation
arrow_forward
Please use the given info to answer the subquestion Part B
arrow_forward
In IBM SPSS, what does clicking on this icon do?
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you

Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt

Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL

Mathematics For Machine Technology
Advanced Math
ISBN:9781337798310
Author:Peterson, John.
Publisher:Cengage Learning,
Related Questions
- The Excel file for this assignment contains a database with information about the tax assessment value assigned to medical office buildings in a city. The following is a list of the variables in the database: FloorArea: square feet of floor space Offices: number of offices in the building Entrances: number of customer entrances Age: age of the building (years) AssessedValue: tax assessment value (thousands of dollars) Use the data to construct a model that predicts the tax assessment value assigned to medical office buildings with specific characteristics. Construct a scatter plot in Excel with FloorArea as the independent variable and AssessmentValue as the dependent variable. Insert the bivariate linear regression equation and r^2 in your graph. Do you observe a linear relationship between the 2 variables? Use Excel’s Analysis ToolPak to conduct a regression analysis of FloorArea and AssessmentValue. Is FloorArea a significant predictor of AssessmentValue? Construct a scatter plot…arrow_forwardThe exercise involving data in this and subsequent sections were designed to be solved using Excel. Johnson Filtration, Inc. provides maintenance service for water-filtration systems. Suppose that in addition to information on the number of months since the machine was serviced and whether a mechanical or an electrical repair was necessary, the managers obtained a list showing which repairperson performed the service. The revised data follow. Repair Time(hours) Months SinceLast Service Type ofRepair Repairperson 2.9 2 electrical Dave Newton 3.0 6 mechanical Dave Newton 4.8 8 electrical Bob Jones 1.8 3 mechanical Dave Newton 2.9 2 electrical Dave Newton 4.9 7 electrical Bob Jones 4.2 9 mechanical Bob Jones 4.8 8 mechanical Bob Jones 4.4 4 electrical Bob Jones 4.5 6 electrical Dave Newton Ignore for now the months since the last maintenance service (x1 ) and the repairperson…arrow_forward. Primary Data Source and Secondary Data Source ?arrow_forward
- Create scatterplot using Excel with the following variables: gender: 0 = male, 1 = female. height: in inches. weight: in pounds. First we will create a scatterplot to examine how weight is related to height, ignoring gender. To do that in Excel: Sort the data by gender: Hold down the Control key (Command key on MacOS) and click the A key to select all of the data in the worksheet. Select the Home tab, then the Editing group Sort & Filter -> Custom Sort. In the pop-up window, make sure that My list has headers box is checked and then choose gender from the pull-down menu next to Sort by. Click OK. Now select all of the data in columns B and C, select the Insert tab and in the Charts group choose Scatter. Choose the first scatterplot option (Scatter with only Markers). Now we have a scatterplot, but the data is all on the right of the plot. To fix this: Right-click on the x-axis, and choose Format Axis from the pop-up menu. Make sure that Axis Options is selected on the left, and…arrow_forwardA real estate major collected information on some recent local home sales. The first 6 lines of the database appear in the accompanying table. The columns correspond to the house identification number, the community name, the zip code, the number of acres of the property, the year the house was built, the market value, and the size of the living area (in square feet). Do the data appear to have come from a designed survey or experiment? What concerns might you have about drawing conclusions from this data set? E Click the icon to view the data table. Do the data appear to have come from a designed survey or experiment? O A. It is not clear if the data were obtained from an experiment. They are certainly not from a survey. O B. It is not clear if the data were obtained from a survey. They are certainly not from an experiment. O C. It is not clear if the data were obtained from a survey or an experiment. O D. The data were certainly not obtained from survey or an experiment. What…arrow_forwardThe whole data set will be in the two picturesarrow_forward
- The entirety of the data set will be in the two picturesarrow_forwardThe migration pattern of Monarch butterflies are tracked by a catch-and-release method in which individual butterflies are tagged with a circular, lightweight sticker placed carefully on the wings so as not to impede their ability to fly. The sticker contains a unique ID number. Volunteers across the U.S. and South America capture the butterflies, record the IDs if they are tagged, and release them. This allows us to track the locations each unique ID is found, allowing us to estimate the migration pattern. On average, 1 out of 100 captured butterflies are already tagged. Suppose you are a volunteer and capture 50 butterflies; let X denote the number of those that are already tagged. What is the distribution of X? What is the probability that you catch at least one tagged butterfly?arrow_forwardThe process of systematic arrangement of data in rows and columns is called a. Array b. None of These c. Classification d. Tabulationarrow_forward
arrow_back_ios
arrow_forward_ios
Recommended textbooks for you
- Big Ideas Math A Bridge To Success Algebra 1: Stu...AlgebraISBN:9781680331141Author:HOUGHTON MIFFLIN HARCOURTPublisher:Houghton Mifflin HarcourtHolt Mcdougal Larson Pre-algebra: Student Edition...AlgebraISBN:9780547587776Author:HOLT MCDOUGALPublisher:HOLT MCDOUGALMathematics For Machine TechnologyAdvanced MathISBN:9781337798310Author:Peterson, John.Publisher:Cengage Learning,

Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt

Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL

Mathematics For Machine Technology
Advanced Math
ISBN:9781337798310
Author:Peterson, John.
Publisher:Cengage Learning,