Katreddy_Project1_Report
pdf
keyboard_arrow_up
School
Northeastern University *
*We aren’t endorsed by this school
Course
6000
Subject
English
Date
Apr 3, 2024
Type
Pages
7
Uploaded by SargentRose1791
1
Project 1 – R Practice: Results and Key Findings
Harsha Katreddy
College of Professional Studies, Northeastern University, Boston
ALY 6000: Introduction to Analytics S9 Fall 2023 (CRN: 70407)
Dr. Richard He
September 25, 2023
2
Introduction and Key Findings
This report summarizes the results and key inferences derived by setting up a project in RStudio and executing the code as per 42 instructions given in the ALY_6000_Project1.pdf. The assignment provides with clear instructions to setup the project in RStudio and code to clear out the environment in RStudio. Key Findings
Problem 1:
Answers for all the operations are in below attached screenshot with computed results of mathematical and logical operations performed as per instructions. This problem exhibits the working of arithmetic and logical operators in R.
Figure 1. Answers for Problem 1
Problem 14:
On execution of 4 lines of code as per given instructions, the result of each line is explained below.
second_vector + 20
: Increments each element of second_vector by 20 and returns a new vector
second_vector * 20
: Multiplies each element of second_vector by 20 and returns a new vector
second_vector >= 20
: If element in second_vector greater than or equal to 20, returns True and if element is less than 20 returns False. Output is a logical vector.
second_vector != 20
: If element in second vector is not equal to 20, returns True and if element is equal to 20, returns False. Output is a logical vector.
3
Problem 23:
Code executed extracts elements from first_vector
[17 12 -33 5] by indexing using logical vector [FALSE TRUE FALSE TRUE]
. It returns elements of first_vector
with indexing value as TRUE and stores it in vector_from_boolean_brackets
[12 5]
Problem 24:
If element in second_vector
[10 12 14 16 18 20 22 24 26 28 30] greater than or equal to 20, returns True and if element is less than 20 returns False, output is a logical vector [FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE]
Problem 25:
Returns a vector containing sequence of numbers from 10 and incrementing by 2, and ending at 30 and stores in ages_vector
[10 12 14 16 18 20 22 24 26 28 30]
Problem 26:
ages_vector >= 20
generates a logical vector with elements as TRUE if element in ages_vector
is > or = 20 or FALSE if < 20, this logical vector is used to index ages_vector
which returns vector with elements > or = 20 in ages_vector
. Ans: [20 22 24 26 28 30]
Problem 30:
set.seed(5)
intializes the random number generator to a certain starting point. This is used to ensure reproducibility.
runif(n=10, min=0, max=1000)
generates a vector with 10 random numbers between 0 and 1000 that follow uniform distribution. This is stored in random_vector
[200.2145 685.2186 916.8758 284.3995 104.6501 701.0575 527.9600 807.9352 956.5001 110.4530]
Problem 37:
set.seed(5)
ensures the random number generator is set to the same starting point as in Problem 30.
rnorm(n=1000, mean=50, sd=15)
generates a vector with 1000 random numbers between 0 and 1000 with mean as 50 and standard deviation as 15. This is stored in random_vector
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
4
Problem 38: Figure 2. Histogram generated in Problem 38
hist(random_vector)
generates a histogram of random_vector
as generated previously. It graphically represents the distribution of random_vector
. This histogram represents count of values (Frequency) that fall within different intervals represented on x-axis. As random_vector
contains 1000 random numbers with normal distribution with mean as 50, the histogram rightly represents a bell-shaped curve with center around 50. The spread of values is influenced by standard deviation of 15.
Problem 42:
first_dataframe
is generated using read_csv
function for “ds_salaries.csv”
provided
head(first_dataframe)
: Returns first 6 rows of the "first_dataframe" as n=6 is the default value for number of rows to be returned.
head(first_dataframe, n=7
: Returns first 7 rows of the "first_dataframe" as n=7 defines number of rows to be returned.
names(first_dataframe
: Returns column names of all columns in "first_dataframe"
5
smaller_dataframe <- select(first_dataframe, job_title, salary_in_usd)
smaller_dataframe
select()
is used to create a new dataframe by selecting two columns namely "job_title"
and "salary_in_usd"
from "
first_dataframe
" and assigned it to "smaller_dataframe"
, which is subsequently displayed.
better_smaller_dataframe <- arrange(smaller_dataframe, desc(salary_in_usd))
better_smaller_dataframe
arrange()
is used to create a new dataframe by sorting all rows of "smaller_dataframe"
from highest to lowest based on "salary_in_usd"
column and assigned it to "better_smaller_dataframe"
, which is subsequently displayed.
better_smaller_dataframe <- filter(smaller_dataframe, salary_in_usd > 80000)
better_smaller_dataframe
filter()
is used to create a new dataframe by filtering rows of "smaller_dataframe"
which include only rows where "
salary_in_usd"
is greater than 80,000 USD and assigned it to "better_smaller_dataframe"
, which is subsequently displayed.
better_smaller_dataframe <- mutate(smaller_dataframe, salary_in_euros = salary_in_usd*.94)
better_smaller_dataframe
mutate()
is used to add a new column "salary_in_euros"
whose values are generated by multiplying salary in "salary_in_usd"
by 0.94 (exchange rate). This new dataframe is stored in "better_smaller_dataframe"
and displayed subsequently.
6
better_smaller_dataframe <- slice(smaller_dataframe,1,1,2,3,4,10,1)
better_smaller_dataframe
slice()
is used to create a new dataframe with rows selected from "smaller_dataframe"
corresponding to given indices. This new dataframe is stored in "better_smaller_dataframe"
and displayed subsequently.
ggplot(better_smaller_dataframe) +
Intializes the ggplot()
using "better_smaller_dataframe"
as argument
geom_col(mapping = aes(x = job_title, y=salary_in_usd), fill="blue")+
Adds a bar chart layer with x-axis variable as "
job_title"
and y-axis variable as "
salary_in_usd".
The bars are generated in "blue" color.
xlab("Job Title") +
Sets the x-axis variable as "
Job title"
ylab("Salary in US Dollars") +
Sets the y-axis variable as "Salary in US Dollars"
labs (title = "Comparision of Jobs ") +
Sets the title of plot as "Comparision of Jobs"
.
scale_y_continuous(labels = scales::dollar) +
Formats y-axis labels as dollar values and represents salary in USD
theme(axis.text.x = element_text(angle = 50, hjust = 1))
Formats x-axis labels by rotating by 50 degrees and adjusting horizontal alignment to 1. This ensures no overlapping of job titles.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
7
Figure 3. Bar Chart generated in Problem 43
Conclusion
This assignment enabled me to learn some basic concepts in R programming language and made me familiar with several functions in R. However, for the last ggplot() executed for “better_smaller_dataframe”, the barchart generated a skewed graphical representation due to the data frame selected. As the dataframe selected for the barchart has multiple rows of job title corresponding to “Data Scientist” due to multiple indices of “1” given in slice(). This resulted in Salary value for “Data Scientist” as around 2,40,000 USD instead of 79,833 USD
Citations:
Robert I. Kabacoff. (2022).
R in Action
(3rd ed.). Manning Publications.
Datacamp. (2023). https://rdocumentation.org/
American Psycological Associstion. (2023). https://www.apa.org
Introduction to problem solving with R. (2023). Instructions set. ALY6000_Project_1.pdf
OpenAI.(September 25, 2023). https://chat.openai.com/ Prompt : Explain
geom_col(mapping = aes(x = job_title, y=salary_in_usd), fill="blue")
in context on R Language