Assignment-2--Loading,-Saving-and-describing-data
docx
keyboard_arrow_up
School
Bowling Green State University *
*We aren’t endorsed by this school
Course
4600
Subject
Statistics
Date
Feb 20, 2024
Type
docx
Pages
5
Uploaded by pogulasuman
ANA 515 Assignment 2 - Loading, Saving and Describing data
Srilekha Gurrapu
2023-06-15
Section 1: Description of the data
The dataset I am using is the COVID-19 dataset, which provides comprehensive information about the COVID-19 pandemic across multiple countries. The data measures various aspects of the pandemic, including the number of cases, deaths, and recoveries. The
dataset was collected from reliable sources such as national health agencies and international organizations tracking the pandemic.
The data is saved in a CSV (Comma-Separated Values) format. It is a flat file that uses commas as the delimiter to separate the values in each row. The dataset is organized in a tabular format, where each row represents a specific observation (e.g., a specific country and date), and each column represents a variable or attribute related to that observation (e.g., cases, deaths, recoveries). The CSV format is a widely used and easily accessible format for storing structured data. It can be opened and processed using various tools and programming languages, including R, Python, and spreadsheet software.
To read this data into R, I will use the read.csv
function, which is a base R function specifically designed for reading CSV files. This function automatically handles the parsing of the CSV format, including the detection of the delimiter and the conversion of the data into appropriate data types. The resulting data will be stored in a dataframe, which is a common data structure in R for handling tabular data.
Section 2: Reading the data into R
In this code, I used the read_csv()
function to read the data from the CSV file and assign it to the data
dataframe object. The read_csv()
function from the readr package is used to read the data from the zip file. The read_csv()
function is designed to read CSV files and is a part of the readr package.
The unzip()
function is used to extract the file from the zip archive, and the extracted data is directly passed to the read_csv()
function to read it as a CSV file.
## Warning: package 'readr' was built under R version 4.2.3
## Rows: 35156 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Country/Region, WHO Region
## dbl (7): Confirmed, Deaths, Recovered, Active, New cases, New deaths, New r...
## date (1): Date
## ## Use `spec()` to retrieve the full column specification for this ℹ
data.
## Specify the column types or set `show_col_types = FALSE` to quiet
ℹ
this message.
Section 3: Cleaning the data
In the code chunk, I performed some basic data cleaning operations on the dataset.
First, I renamed the columns of the dataframe using the colnames
function. I provided a vector of new column names that correspond to the desired names for each column.
Next, I converted the “Date” column to the “Date” format using the as.Date
function. This ensures that the date values are treated as dates in R, allowing for easier manipulation and analysis.
Finally, I subsetted the dataset to keep only the “Country”, “Date”, “Total_Cases”, and “Total_Deaths” columns using indexing with square brackets ([]). This creates a new dataframe called data_subset
that contains the selected columns.
To verify the changes and check the cleaned dataset, I used the head
function to display the
first few rows of the data_subset
dataframe.
## Warning: package 'dplyr' was built under R version 4.2.3
## ## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## ## filter, lag
## The following objects are masked from 'package:base':
## ## intersect, setdiff, setequal, union
## # A tibble: 6 × 4
## Date Country Total_Cases Total_Deaths
## <date> <chr> <dbl> <dbl>
## 1 2020-01-22 Afghanistan 0 0
## 2 2020-01-22 Albania 0 0
## 3 2020-01-22 Algeria 0 0
## 4 2020-01-22 Andorra 0 0
## 5 2020-01-22 Angola 0 0
## 6 2020-01-22 Antigua and Barbuda 0 0
Section 4: Characteristics of the data.
In the code, num_rows
and num_cols
variables are assigned the number of rows and columns in the dataframe data
. The column_descriptions
variable contains the
descriptions for each column. The paste()
function to insert the column names and descriptions. The resulting table will have two columns: “Column Name” and “Description”.
created a table using the kable
function from the knitr
package in R Markdown. This code will create a table using the kable
function and format it as Markdown. The table will have two columns: “Column Name” and “Description”. The “column_table” data frame contains the column names and descriptions, which are then passed to the kable
function for formatting.
## Warning: package 'knitr' was built under R version 4.2.3
Column Name
Description
Date
The date of the recorded data for COVID-19 cases.
Country
The name of the country where the COVID-19 cases were reported.
Total_Case
s
The total number of confirmed COVID-19 cases in a specific country or region.
Total_Deat
hs
The total number of deaths caused by COVID-19 in a specific country or region.
Total_Reco
veries
The total number of individuals who have recovered from COVID-19 in a specific country or region.
Active_Cas
es
The number of active COVID-19 cases in a specific country or region.
New cases
The number of new COVID-19 cases reported on a specific date.
New deaths
The number of new deaths due to COVID-19 reported on a specific date.
New recovered
The number of new recoveries from COVID-19 reported on a specific date.
WHO Region
The World Health Organization (WHO) region to which the country or region belongs.
#Inline code “This data set has 35156 rows and, 10, columns., The names of the columns and a brief description of each are in the table above”
Section 5: Summary statistics.
In this code, we first define the “selected_columns” variable to contain the names of the three columns you want to analyze. We then create a subset of the dataframe “data” by selecting only those columns using “subset_data <- data[selected_columns]”.
Next, we use the apply()
function to calculate the minimum, maximum, and mean values for each column in the subset_data
dataframe. The na.rm = TRUE
argument is used to exclude missing values from the calculations.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
We also use the colSums()
and is.na()
functions to count the number of missing values in each column.
Finally, we create a new dataframe summary_data
to store the summary statistics, with columns for the column names, minimum values, maximum values, mean values, and number of missing values.
The summary_data
dataframe is then printed to display the summary statistics in a tabular format.
## Column Minimum Maximum Mean Missing_Values
## Total_Cases Total_Cases 0 4290259 23566.631 0
## Total_Deaths Total_Deaths 0 148011 1234.068 0
## Total_Recoveries Total_Recoveries 0 1846641 11048.135 0
Histogram of Confirmed cases
## Warning: package 'ggplot2' was built under R version 4.2.3
Scatterplot of Confirmed cases and Deaths
Related Documents
Related Questions
. Primary Data Source and Secondary Data Source ?
arrow_forward
Paint lifetime: A paint company collected data on the lifetime (in years) of its paint in eleven United States cities. The data are in the following table.
Average annual
Precipitation
(inches)
48.6
43.8
29.3
26.4
City
Atlanta, GA
Boston, MA
Kansas City, KS
Minneapolis, MN
Dallas, TX
Denver, CO
Miami, FL
Phoenix, AZ
San Francisco, CA
Seattle, WA
Send data to Excel
Paint
Lifetime
Part 1 of 2
11.5
11.7
12.3
10.5
11.2
15.2
8.7
11.1
16.7
Average January
Temperature
41.9
29.6
28.4
11.2
45.0
29.5
67.1
52.3
48.5
40.6
Average July
Temperature
78.6
73.5
80.9
73.1
86,3
73.3
82.4
92.3
62.2
65.3
34.2
15.3
57.5
7.1
19.7
38.9
In Cheyenne, Wyoming, the average January temperature is 26.1, the average July temperature is 68.9, and the average annual precipitation is 13.3.
Construct a 95% confidence interval for the paint lifetime. Round your answers to at least two decimal places.
arrow_forward
Literacy rate is a reflection of the educational facilities and quality of education available in a country, and mass communication plays a large part in the educational process. In an effort to relate the literacy rate of a country to various mass communication outlets, a demographer has proposed to relate literacy rate to the following variables: number of daily newspaper copies (per 1000 population), number of radios (per 1000 population), and number of TV sets (per 1000 population). Here are the data for a sample of 10 countries:
arrow_forward
tion 2 of 15
Last summer, the Smith family drove through seven different states and visited various popular landmarks. The prices of gasoline
in dollars per gallon varied from state to state and are listed below.
$2.34, $2.75, $2.48, $3.58, $2.87, $2.53, $3.31
Click to download the data in your preferred format.
CrunchIt! CSV Excel JMP Mac Text Minitab PC Text R SPSS TI Calc
Calculate the range of the price of gas. Give your solution to the nearest cent.
range:
dollars per gallon
DELL
&
4.
7
8.
arrow_forward
Which type of data source provides real-time data?
A. Historical data
B. Survey data
C. Transactional data
D. Social media data
arrow_forward
Give a detailed outline for this:
Also, do not give plagirised answer.
Suppose that you have two sets of data. The first set is a list of all the
injuries that were seen in a clinic in a month's time. The second set
contains data on the number of minutes that each patient spent in the
waiting room of a doctor's office. You can make assumptions about
other information or variables that are included in each data set.
For each data set, propose your idea of how best to represent the key
information.
To organize your data, would you choose to use a frequency table, a
cumulative frequency table, or a relative frequency table? Why?
What type of graph would you use to display the organized data from
each frequency distribution? What would be shown on each of the
axes for each graph?
Minimum of 1 scholarly source and one appropriate resource such as
the textbook, math video, and/or math website
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
![Text book image](https://www.bartleby.com/isbn_cover_images/9781285463247/9781285463247_smallCoverImage.gif)
Linear Algebra: A Modern Introduction
Algebra
ISBN:9781285463247
Author:David Poole
Publisher:Cengage Learning
![Text book image](https://www.bartleby.com/isbn_cover_images/9781680331141/9781680331141_smallCoverImage.jpg)
Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt
![Text book image](https://www.bartleby.com/isbn_cover_images/9780079039897/9780079039897_smallCoverImage.jpg)
Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill
![Text book image](https://www.bartleby.com/isbn_cover_images/9780547587776/9780547587776_smallCoverImage.jpg)
Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL
Related Questions
- . Primary Data Source and Secondary Data Source ?arrow_forwardPaint lifetime: A paint company collected data on the lifetime (in years) of its paint in eleven United States cities. The data are in the following table. Average annual Precipitation (inches) 48.6 43.8 29.3 26.4 City Atlanta, GA Boston, MA Kansas City, KS Minneapolis, MN Dallas, TX Denver, CO Miami, FL Phoenix, AZ San Francisco, CA Seattle, WA Send data to Excel Paint Lifetime Part 1 of 2 11.5 11.7 12.3 10.5 11.2 15.2 8.7 11.1 16.7 Average January Temperature 41.9 29.6 28.4 11.2 45.0 29.5 67.1 52.3 48.5 40.6 Average July Temperature 78.6 73.5 80.9 73.1 86,3 73.3 82.4 92.3 62.2 65.3 34.2 15.3 57.5 7.1 19.7 38.9 In Cheyenne, Wyoming, the average January temperature is 26.1, the average July temperature is 68.9, and the average annual precipitation is 13.3. Construct a 95% confidence interval for the paint lifetime. Round your answers to at least two decimal places.arrow_forwardLiteracy rate is a reflection of the educational facilities and quality of education available in a country, and mass communication plays a large part in the educational process. In an effort to relate the literacy rate of a country to various mass communication outlets, a demographer has proposed to relate literacy rate to the following variables: number of daily newspaper copies (per 1000 population), number of radios (per 1000 population), and number of TV sets (per 1000 population). Here are the data for a sample of 10 countries:arrow_forward
- tion 2 of 15 Last summer, the Smith family drove through seven different states and visited various popular landmarks. The prices of gasoline in dollars per gallon varied from state to state and are listed below. $2.34, $2.75, $2.48, $3.58, $2.87, $2.53, $3.31 Click to download the data in your preferred format. CrunchIt! CSV Excel JMP Mac Text Minitab PC Text R SPSS TI Calc Calculate the range of the price of gas. Give your solution to the nearest cent. range: dollars per gallon DELL & 4. 7 8.arrow_forwardWhich type of data source provides real-time data? A. Historical data B. Survey data C. Transactional data D. Social media dataarrow_forwardGive a detailed outline for this: Also, do not give plagirised answer. Suppose that you have two sets of data. The first set is a list of all the injuries that were seen in a clinic in a month's time. The second set contains data on the number of minutes that each patient spent in the waiting room of a doctor's office. You can make assumptions about other information or variables that are included in each data set. For each data set, propose your idea of how best to represent the key information. To organize your data, would you choose to use a frequency table, a cumulative frequency table, or a relative frequency table? Why? What type of graph would you use to display the organized data from each frequency distribution? What would be shown on each of the axes for each graph? Minimum of 1 scholarly source and one appropriate resource such as the textbook, math video, and/or math websitearrow_forward
arrow_back_ios
arrow_forward_ios
Recommended textbooks for you
- Linear Algebra: A Modern IntroductionAlgebraISBN:9781285463247Author:David PoolePublisher:Cengage LearningBig Ideas Math A Bridge To Success Algebra 1: Stu...AlgebraISBN:9781680331141Author:HOUGHTON MIFFLIN HARCOURTPublisher:Houghton Mifflin HarcourtGlencoe Algebra 1, Student Edition, 9780079039897...AlgebraISBN:9780079039897Author:CarterPublisher:McGraw Hill
- Holt Mcdougal Larson Pre-algebra: Student Edition...AlgebraISBN:9780547587776Author:HOLT MCDOUGALPublisher:HOLT MCDOUGAL
![Text book image](https://www.bartleby.com/isbn_cover_images/9781285463247/9781285463247_smallCoverImage.gif)
Linear Algebra: A Modern Introduction
Algebra
ISBN:9781285463247
Author:David Poole
Publisher:Cengage Learning
![Text book image](https://www.bartleby.com/isbn_cover_images/9781680331141/9781680331141_smallCoverImage.jpg)
Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt
![Text book image](https://www.bartleby.com/isbn_cover_images/9780079039897/9780079039897_smallCoverImage.jpg)
Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill
![Text book image](https://www.bartleby.com/isbn_cover_images/9780547587776/9780547587776_smallCoverImage.jpg)
Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL