IE6400_Day14
html
keyboard_arrow_up
School
Northeastern University *
*We aren’t endorsed by this school
Course
6400
Subject
Industrial Engineering
Date
Feb 20, 2024
Type
html
Pages
36
Uploaded by ColonelStraw13148
IE6400 Foundations for Data Analytics
Engineering
¶
Fall 2023
¶
Module 2: Joint, Marginal and Conditional Probability
¶
Probability Concepts
¶
1. Joint Probability:
¶
•
Definition
: The joint probability of two events, ( A ) and ( B ), denoted as $P(A \
cap B)$ or $P(A, B)$, is the probability that both events occur at the same time. •
Formula
: $P(A \cap B) = P(A) \times P(B|A)$ or $P(A \cap B) = P(B) \times P(A|B)
$ 2. Conditional Probability:
¶
•
Definition
: The conditional probability of an event ( A ) given that another event
( B ) has occurred is denoted as $P(A|B)$. It represents the probability of ( A ) occurring, assuming that ( B ) has already occurred. •
Formula
: $P(A|B) = \frac{P(A \cap B)}{P(B)}$ 3. Marginal Probability:
¶
•
Definition
: The marginal probability of an event ( A ) is simply the probability of that event occurring without any condition on another event. It's also known as the "unconditional probability" or simply the "probability." •
Formula
: For two events ( A ) and ( B ), the marginal probability of ( A ) can be found by summing up the joint probabilities of ( A ) occurring with each possible state of ( B ). That is, $P(A) = \sum_{b} P(A, B=b)$, where ( B=b ) represents each possible state of ( B ). Relationship
:
•
These probabilities are related in the sense that they provide different perspectives on the likelihood of events. Joint probability considers two events together, conditional probability considers one event given the occurrence of another, and marginal probability considers one event without any conditions. Joint , Conditional and Marginal Probability
¶
Exercise 1 Understanding Joint Probability through Dice Rolling Simulation
¶
Problem Statement
¶
Imagine you have two six-sided dice:
•
Die A: A standard die with faces [1, 2, 3, 4, 5, 6]. •
Die B: Another standard die with faces [1, 2, 3, 4, 5, 6]. We will simulate the rolling of die A and die B 10,000 times. Our goal is to calculate the
joint probability of the following two specific events:
1. Event 1: Die A rolls a 2. 2. Event 2: Die B rolls a 4. The joint probability
is the probability of both events happening at the same time.
Objective:
¶
1. Simulate the rolling of two dice 10,000 times. 2. Calculate the joint probability of rolling a 2 with die A and a 4 with die B. 3. Visualize the outcomes. 4. Interpret the results. Step 1: Importing Necessary Libraries
¶
In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Step 2: Simulating the Dice Rolls
¶
We'll use numpy
to simulate the rolling of two dice 10,000 times.
In [2]:
np.random.seed(0) # for reproducibility
n_rolls = 10000
# Simulating the rolls
rolls_A = np.random.randint(1, 7, n_rolls)
rolls_B = np.random.randint(1, 7, n_rolls)
Step 3: Calculating the Joint Probability
¶
We'll calculate the joint probability of rolling a 2 with die A and a 4 with die B.
In [3]:
# Identifying the successful events
success_events = np.logical_and(rolls_A == 2, rolls_B == 4)
# Calculating the joint probability
joint_prob = np.sum(success_events) / n_rolls
# Print the result
print(f"Joint Probability of event A (rolling a 2) and event B (rolling a 4) is : {joint_prob}")
Joint Probability of event A (rolling a 2) and event B (rolling a 4) is : 0.028
Step 4: Visualization
¶
We'll visualize the outcomes of the dice rolls using seaborn
.
In [4]:
# Creating a DataFrame for visualization
import pandas as pd
df = pd.DataFrame({'Die A': rolls_A, 'Die B': rolls_B})
# Plotting
sns.histplot(df, bins=np.arange(1, 9), discrete=True, stat='probability', common_norm=False)
plt.title('Distribution of Dice Rolls')
plt.xlabel('Die Face')
plt.ylabel('Probability')
plt.legend(['Die A', 'Die B'])
plt.show()
Interpretation
¶
The joint probability calculated gives us the probability of both events (rolling a 2 with die A and rolling a 4 with die B) occurring together in a single roll. The visualization shows the distribution of outcomes for each die over the 10,000 rolls.
Conclusion
¶
Through simulation, we can estimate probabilities of various events. The joint probability provides insights into the likelihood of multiple events occurring together. Understanding this concept is crucial in various fields like statistics, data science, and various research areas where dependency between events is analyzed.
Exercise 2 Understanding Joint and Marginal Probabilities from Customer Complaints
¶
Problem Statement
¶
Consider a scenario at a popular company service center where they receive various complaints from their customers. Out of a total of 100 complaints:
•
80 customers complained about late delivery of the items. •
60 customers complained about poor product quality. We want to answer the following questions:
1. What is the probability that a customer complaint will be about both product quality and late delivery? 2. What is the probability that a complaint will be only about late delivery? Objective:
¶
1. Calculate the joint probability of complaints about both product quality and late delivery. 2. Calculate the marginal probability of complaints only about late delivery. 3. Visualize the outcomes. 4. Interpret the results. Step 1: Importing Necessary Libraries
¶
In [5]:
import matplotlib.pyplot as plt
Step 2: Calculating Probabilities
¶
Given the data, we can use the principle of Inclusion-Exclusion to find the joint and marginal probabilities.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In [6]:
# Given data
total_complaints = 100
late_delivery_complaints = 80
poor_quality_complaints = 60
# Using Inclusion-Exclusion principle to find complaints about both
both_complaints = late_delivery_complaints + poor_quality_complaints - total_complaints
# a) Probability of both complaints
prob_both = both_complaints / total_complaints
# b) Probability of only late delivery
prob_only_late_delivery = (late_delivery_complaints - both_complaints) / total_complaints
#prob_both, prob_only_late_delivery
print('a. Probability that a customer complaint about both product quality and late delivery is %1.4f' % prob_both)
print('b. probability that a complaint will be only about late delivery. is %1.4f' % prob_only_late_delivery)
a. Probability that a customer complaint about both product quality and late delivery is 0.4000
b. probability that a complaint will be only about late delivery. is 0.4000
Step 3: Visualization
¶
We'll visualize the complaints using a Venn diagram for better understanding.
In [7]:
!pip install matplotlib_venn
Requirement already satisfied: matplotlib_venn in /Users/sivaritsultornsanee/opt/anaconda3/lib/python3.9/site-packages (0.11.9)
Requirement already satisfied: numpy in /Users/sivaritsultornsanee/opt/anaconda3/lib/python3.9/site-packages (from matplotlib_venn) (1.21.2)
Requirement already satisfied: scipy in /Users/sivaritsultornsanee/opt/anaconda3/lib/python3.9/site-packages (from matplotlib_venn) (1.8.0)
Requirement already satisfied: matplotlib in /Users/sivaritsultornsanee/opt/anaconda3/lib/python3.9/site-packages (from matplotlib_venn) (3.5.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /Users/sivaritsultornsanee/opt/anaconda3/lib/python3.9/site-packages (from matplotlib-
>matplotlib_venn) (1.4.0)
Requirement already satisfied: fonttools>=4.22.0 in /Users/sivaritsultornsanee/opt/anaconda3/lib/python3.9/site-packages (from matplotlib-
>matplotlib_venn) (4.31.1)
Requirement already satisfied: pyparsing>=2.2.1 in /Users/sivaritsultornsanee/opt/anaconda3/lib/python3.9/site-packages (from matplotlib-
>matplotlib_venn) (3.0.4)
Requirement already satisfied: python-dateutil>=2.7 in /Users/sivaritsultornsanee/opt/anaconda3/lib/python3.9/site-packages (from matplotlib-
>matplotlib_venn) (2.8.2)
Requirement already satisfied: packaging>=20.0 in
/Users/sivaritsultornsanee/opt/anaconda3/lib/python3.9/site-packages (from matplotlib-
>matplotlib_venn) (21.3)
Requirement already satisfied: cycler>=0.10 in /Users/sivaritsultornsanee/opt/anaconda3/lib/python3.9/site-packages (from matplotlib-
>matplotlib_venn) (0.11.0)
Requirement already satisfied: pillow>=6.2.0 in /Users/sivaritsultornsanee/opt/anaconda3/lib/python3.9/site-packages (from matplotlib-
>matplotlib_venn) (9.0.1)
Requirement already satisfied: six>=1.5 in /Users/sivaritsultornsanee/opt/anaconda3/lib/python3.9/site-packages (from python-
dateutil>=2.7->matplotlib->matplotlib_venn) (1.16.0)
In [8]:
from matplotlib_venn import venn2
# Plotting Venn diagram
plt.figure(figsize=(8, 8))
venn2(subsets=(late_delivery_complaints, poor_quality_complaints, both_complaints),
set_labels=('Late Delivery', 'Poor Quality'))
plt.title('Venn Diagram of Customer Complaints')
plt.show()
Interpretation
¶
From the calculated probabilities:
1. The joint probability represents the likelihood of a customer complaining about both late delivery and poor product quality. 2. The marginal probability for only late delivery gives us the proportion of
customers who had issues solely with late delivery and not with product quality. The Venn diagram visually represents the overlap between the two types of complaints, helping us understand the distribution of complaints better.
Conclusion
¶
Understanding joint and marginal probabilities is crucial in real-world scenarios, especially in customer service and product management. It helps businesses identify areas of improvement and prioritize issues based on their impact and frequency.
Conditional probability
¶
Conditional probability is a concept in probability theory that quantifies the likelihood of one event occurring given that another event has already occurred. It expresses how the probability of an event is influenced or constrained by the knowledge of another event. Conditional probability is denoted as P(A | B), where A is the event of interest, and B is the condition under which we're assessing the probability. The formula for conditional probability is:
\begin{equation} [ P(A | B) = \frac{P(A \cap B)}{P(B)} ] \end{equation} Exercise 3 Understanding Conditional Probability with a Deck of Cards
¶
Problem Statement
¶
Given a standard deck of 52 playing cards, we want to:
1. Calculate the probability of drawing an Ace on the first draw. 2. Calculate the conditional probability of drawing a King on the second draw given that an Ace was drawn first. Objective:
¶
1. Define the deck of cards. 2. Calculate the probabilities. 3. Interpret the results. Step 1: Importing Necessary Libraries
¶
In [9]:
import numpy as np
import matplotlib.pyplot as plt
Step 2: Defining the Deck and Calculating Probabilities
¶
In [10]:
# Define the deck of cards
deck = ['Ace', '2', '3', '4', '5', '6', '7', '8', '9', '10', 'Jack', 'Queen', 'King'] *
4 # Four suits
# Probability of drawing an Ace on the first draw
prob_Ace_first_draw = deck.count('Ace') / len(deck)
# Probability of drawing a King on the second draw after drawing an Ace
deck.remove('Ace') # Remove one Ace
prob_King_second_draw = deck.count('King') / (len(deck) - 1)
# Calculate conditional probability
conditional_probability = (prob_Ace_first_draw * prob_King_second_draw) / prob_Ace_first_draw
prob_Ace_first_draw, conditional_probability
Out[10]:
(0.07692307692307693, 0.08)
Step 4: Visualization
¶
We'll visualize the probabilities using a bar chart for better understanding.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In [11]:
# Plotting the probabilities
labels = ['P(Ace on 1st draw)', 'P(King on 2nd draw | Ace on 1st draw)']
values = [prob_Ace_first_draw, prob_King_second_draw]
plt.bar(labels, values, color=['blue', 'green'])
plt.ylabel('Probability')
plt.title('Conditional Probability with a Deck of Cards')
plt.show()
print(f"Conditional Probability (P(King Second | Ace First)): {conditional_probability}")
Conditional Probability (P(King Second | Ace First)): 0.08
Interpretation
¶
From the calculated probabilities:
1. The first probability gives us the likelihood of drawing an Ace from a full deck of cards. 2. The conditional probability represents the likelihood of drawing a King given that an Ace was drawn in the previous draw. Conclusion
¶
Conditional probability is a fundamental concept in probability theory and statistics. The provided code demonstrates how to compute conditional probabilities using a real-
world example of drawing cards from a deck. The redundancy in the calculation highlights the importance of understanding the underlying principles of probability.
Exercise 5 Understanding the Probability of Consecutive Events with Dice Rolling
¶
Problem Statement
¶
Imagine you have a standard six-sided die. We want to understand the probability of a specific scenario:
What is the probability of rolling a "6" in two consecutive trials when rolling the die?
Through this exercise, we will simulate rolling the die multiple times and compute the probability of the event of interest.
Objective:
¶
1. Simulate rolling a die multiple times.
2. Calculate the probability of rolling a "6" in two consecutive trials. 3. Visualize the outcomes. 4. Interpret the results. Step 1: Importing Necessary Libraries
¶
In [12]:
import numpy as np
import matplotlib.pyplot as plt
Step 2: Simulating Dice Rolls
¶
We'll use numpy
to simulate rolling a die multiple times.
In [13]:
np.random.seed(0) # for reproducibility
n_trials = 10000
# Simulating the rolls
rolls = np.random.randint(1, 7, n_trials)
# Checking for consecutive "6"s
consecutive_sixes = np.sum((rolls[:-1] == 6) & (rolls[1:] == 6))
# Probability of getting two consecutive "6"s
prob_consecutive_sixes = consecutive_sixes / (n_trials - 1)
prob_consecutive_sixes
Out[13]:
0.028502850285028504
Step 3: Visualization (Revised)
¶
We'll visualize the outcomes of the dice rolls and highlight the instances of consecutive
"6"s.
In [14]:
plt.figure(figsize=(15, 6))
plt.plot(rolls[:100], 'o-', label='Dice Rolls') # Plotting the first 100 rolls for clarity
# Identifying positions where a "6" is followed by another "6"
positions_of_consecutive_sixes = np.where((rolls[:99] == 6) & (rolls[1:100] == 6))[0]
plt.plot(positions_of_consecutive_sixes, rolls[positions_of_consecutive_sixes], 'ro', label='First of Consecutive 6s')
plt.plot(positions_of_consecutive_sixes + 1, rolls[positions_of_consecutive_sixes + 1],
'ro') # Second of Consecutive 6s
plt.xlabel('Trial')
plt.ylabel('Dice Face')
plt.title('First 100 Dice Rolls with Consecutive "6"s Highlighted')
plt.legend()
plt.grid(True)
plt.show()
Interpretation
¶
From the simulation:
•
The probability calculated gives us the likelihood of rolling a "6" in two consecutive trials. •
The visualization provides a snapshot of the first 100 dice rolls, with instances of rolling a "6" highlighted in red. Conclusion
¶
Understanding the probability of consecutive events is crucial in various scenarios, from gaming strategies to statistical analyses. Through this exercise, we've demonstrated how to compute such probabilities using a simple dice-rolling example.
Exercise 6 Understanding Conditional Probability with Sports Preferences and Gender
¶
Problem Statement
¶
A survey was conducted among 300 individuals, asking them about their favorite sport
among the following options: baseball, basketball, football, or soccer. The survey also recorded the gender of each respondent.
Given the survey results, we want to answer questions like:
1. What is the probability that a randomly selected individual prefers basketball? 2. Given that an individual is male, what is the probability they prefer basketball? 3. Given that an individual prefers basketball, what is the probability they are female? Through this exercise, we will compute and understand conditional probabilities based on the survey results.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Objective:
¶
1. Analyze the survey results. 2. Calculate the probability of an individual preferring basketball. 3. Calculate the conditional probabilities based on gender. 4. Visualize the outcomes. 5. Interpret the results. Step 1: Importing Necessary Libraries
¶
In [15]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Step 2: Creating the Survey Dataset
¶
We'll use the provided sample data to represent the survey results.
In [16]:
# Create pandas DataFrame with raw data
df = pd.DataFrame({'gender': np.repeat(np.array(['Male', 'Female']), 150),
'sport': np.repeat(np.array(['Baseball', 'Basketball', 'Football',
'Soccer', 'Baseball', 'Basketball',
'Football', 'Soccer']),
(34, 40, 58, 18, 34, 52, 20, 44))})
df.head()
Out[16]:
gender
sport
0
Male
Baseball
1
Male
Baseball
2
Male
Baseball
3
Male
Baseball
4
Male
Baseball
Step 3: Calculating Probabilities
¶
Given the dataset, we can now calculate the required probabilities.
In [17]:
# Probability of preferring basketball
prob_basketball = len(df[df['sport'] == 'Basketball']) / len(df)
# Conditional probability of preferring basketball given male
prob_basketball_given_male = len(df[(df['sport'] == 'Basketball') & (df['gender'] == 'Male')]) / len(df[df['gender'] == 'Male'])
# Conditional probability of being female given preferring basketball
prob_female_given_basketball = len(df[(df['sport'] == 'Basketball') & (df['gender'] == 'Female')]) / len(df[df['sport'] == 'Basketball'])
prob_basketball, prob_basketball_given_male, prob_female_given_basketball
Out[17]:
(0.30666666666666664, 0.26666666666666666, 0.5652173913043478)
Step 4: Visualization
¶
We'll visualize the survey results and the calculated probabilities.
In [18]:
# Plotting the survey results based on gender and sport preference
pivot_count = df.groupby(['gender', 'sport']).size().unstack()
pivot_count.plot(kind='bar', stacked=True, figsize=(10, 7))
plt.title('Survey Results: Favorite Sports by Gender')
plt.ylabel('Number of Individuals')
plt.show()
Interpretation
¶
From the visualization and calculated probabilities:
•
The bar chart shows the distribution of sports preferences among males and females. •
The calculated probabilities provide insights into specific scenarios, such as the likelihood of a male preferring basketball and the likelihood of a basketball fan being female.
Conclusion
¶
Conditional probability is a crucial concept in probability theory and statistics. Through this exercise, we've demonstrated how to compute conditional probabilities using a real-world example of a sports preference survey segmented by gender.
Exercise 7 Understanding Marginal Probability with Dice Rolling
¶
Problem Statement
¶
Imagine you have a standard six-sided die. We want to understand a specific probability scenario:
What is the marginal probability of rolling a "3" when rolling the die?
Through this exercise, we will compute the marginal probability of the event of interest
and visualize the outcomes of rolling the die multiple times.
Objective:
¶
1. Simulate rolling a die multiple times. 2. Calculate the marginal probability of rolling a "3". 3. Visualize the outcomes. 4. Interpret the results. Step 1: Importing Necessary Libraries
¶
In [19]:
import numpy as np
import matplotlib.pyplot as plt
Step 2: Simulating Dice Rolls
¶
We'll use numpy
to simulate rolling a die multiple times.
In [20]:
np.random.seed(0) # for reproducibility
n_trials = 1000
# Simulating the rolls
rolls = np.random.randint(1, 7, n_trials)
# Marginal probability of rolling a "3"
prob_rolling_3 = np.sum(rolls == 3) / n_trials
prob_rolling_3
Out[20]:
0.157
Step 3: Visualization
¶
We'll visualize the outcomes of the dice rolls and highlight the instances of rolling a "3".
In [21]:
# Plotting the outcomes of the dice rolls
plt.figure(figsize=(15, 6))
plt.hist(rolls, bins=np.arange(1, 8) - 0.5, rwidth=0.8, align='mid', color='skyblue', edgecolor='black')
plt.xlabel('Dice Face')
plt.ylabel('Frequency')
plt.title('Distribution of 1000 Dice Rolls')
plt.xticks(np.arange(1, 7))
plt.axvline(x=3, color='red', linestyle='dashed', label='Dice Face = 3')
plt.legend()
plt.show()
print(f"Marginal Probability (P(3)): {prob_rolling_3}")
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Marginal Probability (P(3)): 0.157
Interpretation
¶
From the simulation:
•
The histogram shows the distribution of dice faces over 1000 rolls. •
The dashed red line indicates the dice face "3", for which we calculated the marginal probability. •
The calculated probability gives us the likelihood of rolling a "3" in any given trial. Conclusion
¶
Marginal probability is a fundamental concept in probability theory. Through this exercise, we've demonstrated how to compute marginal probabilities using a simple dice-rolling example and visualized the outcomes for better understanding.
Probability Mass Function
¶
Probability Mass Function
A probability mass function (pmf) is the probability distribution of a discrete random variable, and provides the possible values and their associated probabilities X which gives the probability that X is equal to a certain value.
Let X be a discrete random variable on a sample space S.Then the probability mass function f(x) is defined as f(x)=P[X=x].
Exercise 8 Understanding Probability Mass Function (PMF) with Dice Rolling
¶
Problem Statement
¶
Imagine you have a standard six-sided die. We want to understand the Probability Mass Function (PMF) for this scenario.
What is the PMF when rolling a six-sided die?
Through this exercise, we will compute the PMF for each possible outcome of the die and visualize the results.
Objective:
¶
1. Define the PMF for a fair six-sided die. 2. Visualize the PMF. 3. Interpret the results. Step 1: Importing Necessary Libraries
¶
In [22]:
import numpy as np
import matplotlib.pyplot as plt
Step 2: Defining the PMF
¶
For a fair six-sided die, each face has an equal probability of 1/6. We'll define the PMF accordingly.
In [23]:
# Possible outcomes of the die
outcomes = np.arange(1, 7)
# PMF for each outcome
pmf = [1/6 for _ in outcomes]
pmf
Out[23]:
[0.16666666666666666,
0.16666666666666666,
0.16666666666666666,
0.16666666666666666,
0.16666666666666666,
0.16666666666666666]
Step 3: Visualization
¶
We'll visualize the PMF to better understand the distribution of probabilities for each outcome.
In [24]:
# Plotting the PMF
plt.figure(figsize=(10, 6))
plt.bar(outcomes, pmf, color='lightblue', edgecolor='black')
plt.xlabel('Dice Face')
plt.ylabel('Probability')
plt.title('Probability Mass Function (PMF) of a Fair Six-Sided Die')
plt.xticks(outcomes)
plt.ylim(0, 1/6 + 0.05) # Adjusting y-axis to better visualize the probabilities
plt.show()
Interpretation
¶
From the visualization:
•
The bar chart shows the PMF of a fair six-sided die. •
Each face of the die has an equal probability of 1/6, as represented by the equal heights of the bars. •
This confirms our understanding that in a fair die, each face has an equal chance
of landing face up. Conclusion
¶
The Probability Mass Function (PMF) provides a clear way to represent the probabilities
of discrete random variables. Through this exercise, we've visualized the PMF for a simple dice-rolling scenario, reinforcing the concept of equal probabilities for each face
of a fair die.
Probability Density Function
¶
Probability Density Function
Probability density function (PDF), which expresses the likelihood of a continuous random variable taking on a particular value. We can leverage powerful libraries like NumPy, SciPy, and Matplotlib to plot the PDF of a continuous random variable in Python. The mathematical model for the PDF is as follows:
$P(a ≤ X ≤ b) = P(a < X ≤ b) = P(a ≤ X < b) = P(a < X < b)$
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Exercise 9 Understanding Probability Density Function (PDF) with the Standard Normal Distribution
¶
Problem Statement
¶
Consider a scenario where we are studying the heights of adult males in a particular region. The heights are normally distributed with a mean of 175 cm and a standard deviation of 7 cm.
Objective:
Understand and visualize the Probability Density Function (PDF) for this scenario.
Step 1: Importing Necessary Libraries
¶
In [25]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
Step 2: Defining the PDF
¶
For the given scenario, we'll use the standard normal distribution to define the PDF. We'll generate a range of height values and compute the PDF for each value.
In [26]:
# Parameters for the normal distribution
mean = 175
std_dev = 7
# Generating a range of height values
heights = np.linspace(mean - 4*std_dev, mean + 4*std_dev, 1000)
# Computing the PDF for each height value
pdf_values = norm.pdf(heights, mean, std_dev)
Step 3: Visualization
¶
We'll visualize the PDF to better understand the distribution of heights.
In [27]:
# Plotting the PDF
plt.figure(figsize=(12, 6))
plt.plot(heights, pdf_values, color='blue', linewidth=2)
plt.fill_between(heights, pdf_values, color='skyblue', alpha=0.4)
plt.title('Probability Density Function (PDF) of Heights')
plt.xlabel('Height (cm)')
plt.ylabel('Density')
plt.grid(True)
plt.show()
Interpretation
¶
From the visualization:
•
The curve represents the distribution of heights of adult males in the region. •
The peak of the curve is at the mean height of 175 cm. •
The spread of the curve is determined by the standard deviation, indicating the variability in heights.
•
The area under the curve represents the probability, and for a continuous distribution, the total area under the curve is 1. Conclusion
¶
The Probability Density Function (PDF) provides a way to represent the probabilities of continuous random variables. Through this exercise, we've visualized the PDF for the distribution of heights, reinforcing the concept of how probabilities are distributed for continuous variables.
Cumulative Distribution Function
¶
Cumulative Distribution Function
\ The cumulative distribution function (cdf) is the probability that the variable takes a value less than or equal to x. That is
For a continuous distribution, this can be expressed mathematically as $f(x) = \int_{-\
infty}^{x} f(u) \, du$
For a discrete distribution, the cdf can be expressed as
$F(x) = \sum_{i=0}^{x} f(i)$
For a continuous distribution, this can be expressed mathematically as
$F(x)= ∫x−∞f(μ)dμ$
For a discrete distribution, the cdf can be expressed as
$F(x)= ∑xi=0f(i)$
Exercise 10 Understanding Cumulative Distribution Function (CDF) with the Standard Normal Distribution
¶
Problem Statement
¶
Imagine a scenario where we are studying the exam scores of students in a particular class. The scores are normally distributed with a mean of 70 and a standard deviation of 10.
Objective:
Understand and visualize the Cumulative Distribution Function (CDF) for this scenario, which represents the probability that a student scored less than or equal
to a particular score.
Step 1: Importing Necessary Libraries
¶
In [28]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
Step 2: Defining the CDF
¶
For the given scenario, we'll use the standard normal distribution to define the CDF. We'll generate a range of score values and compute the CDF for each value.
In [29]:
# Parameters for the normal distribution
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
mean = 70
std_dev = 10
# Generating a range of score values
scores = np.linspace(mean - 4*std_dev, mean + 4*std_dev, 1000)
# Computing the CDF for each score value
cdf_values = norm.cdf(scores, mean, std_dev)
Step 3: Visualization
¶
We'll visualize the CDF to better understand the distribution of exam scores.
In [30]:
# Plotting the CDF
plt.figure(figsize=(12, 6))
plt.plot(scores, cdf_values, color='green', linewidth=2)
plt.title('Cumulative Distribution Function (CDF) of Exam Scores')
plt.xlabel('Score')
plt.ylabel('Probability')
plt.grid(True)
plt.show()
Interpretation
¶
From the visualization:
•
The curve represents the cumulative probabilities of exam scores. •
For any given score on the x-axis, the corresponding y-value gives the
probability that a student scored less than or equal to that score. •
The curve starts at 0 and ends at 1, representing the cumulative probability range. •
The steeper regions of the curve indicate where most students' scores lie, while flatter regions indicate fewer scores. Conclusion
¶
The Cumulative Distribution Function (CDF) provides a way to represent the cumulative probabilities of continuous random variables. Through this exercise, we've visualized the CDF for the distribution of exam scores, reinforcing the concept of how cumulative probabilities are distributed for continuous variables.
Marginal Probability Distribution
¶
Definition:
The marginal probability distribution of a random variable in a multivariate
distribution represents the probability distribution of that single variable, ignoring the values of other variables in the distribution. It is obtained by summing (for discrete variables) or integrating (for continuous variables) the joint probability distribution over all possible values of the variable of interest.
Mathematical Representation:
For a discrete random variable X:
\begin{equation}P(X=x) = \sum_{y} P(X=x, Y=y) \end{equation}
This equation states that the probability that X takes on a specific value (x) is obtained
by summing over all possible values of the other random variable Y, considering all pairs (x, y) where x is the value of interest for X.
For a continuous random variable X: \begin{equation} f_X(x) = \int_{-\infty}^{\infty} f_{XY}(x, y) \, dy \end{equation}
In this equation, $f_X(x)$ represents the probability density function (PDF) of X. It is obtained by integrating the joint density function $(f_{XY}(x, y))$ with respect to (y) over the entire range of possible (y) values.
Exercise 11 Understanding Marginal Probability Distribution
¶
Problem Statement
¶
A company conducted a survey among its customers to understand their preferences for two products: Product A and Product B. The survey also recorded the age group of the respondents: "Young" (below 30) and "Old" (30 and above).
Given the joint distribution of age group and product preference, we want to find the marginal probabilities for each product and each age group.
Objective:
Understand and visualize the Marginal Probability Distribution for the given
scenario.
Step 1: Importing Necessary Libraries
¶
In [31]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Step 2: Generating the Dataset
¶
We'll create a dataset representing the joint distribution of age group and product preference.
In [32]:
# Sample data representing joint distribution
data = {
'Product': ['Product A', 'Product A', 'Product B', 'Product B'],
'Age Group': ['Young', 'Old', 'Young', 'Old'],
'Count': [120, 80, 100, 150]
}
df = pd.DataFrame(data)
df
Out[32]:
Product
Age Group
Count
0
Product A
Young
120
1
Product A
Old
80
2
Product B
Young
100
3
Product B
Old
150
Step 3: Calculating Marginal Probabilities
¶
Given the joint distribution, we'll calculate the marginal probabilities for each product and each age group.
In [33]:
# Marginal probability for each product
marginal_product = df.groupby('Product')['Count'].sum() / df['Count'].sum()
# Marginal probability for each age group
marginal_age = df.groupby('Age Group')['Count'].sum() / df['Count'].sum()
marginal_product, marginal_age
Out[33]:
(Product
Product A 0.444444
Product B 0.555556
Name: Count, dtype: float64,
Age Group
Old 0.511111
Young 0.488889
Name: Count, dtype: float64)
Step 4: Visualization
¶
We'll visualize the marginal probabilities to better understand the distribution.
In [34]:
# Plotting the marginal probabilities
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 6))
# For products
marginal_product.plot(kind='bar', ax=axes[0], color='lightblue', edgecolor='black')
axes[0].set_title('Marginal Probability Distribution for Products')
axes[0].set_ylabel('Probability')
axes[0].set_xlabel('Product')
# For age groups
marginal_age.plot(kind='bar', ax=axes[1], color='lightgreen', edgecolor='black')
axes[1].set_title('Marginal Probability Distribution for Age Groups')
axes[1].set_ylabel('Probability')
axes[1].set_xlabel('Age Group')
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
plt.tight_layout()
plt.show()
Interpretation
¶
From the visualization:
•
The first bar chart shows the marginal probabilities for each product. This gives us the overall likelihood of a customer preferring each product, irrespective of their age group. •
The second bar chart shows the marginal probabilities for each age group. This provides the overall distribution of age groups among the respondents, irrespective of their product preference. Conclusion
¶
Marginal Probability Distribution provides a way to understand the probabilities of individual events by summing or averaging out the other events. Through this exercise, we've visualized the marginal probabilities for product preferences and age groups, reinforcing the concept of how probabilities are distributed for individual events.
Joint Density Function
¶
It describes the probability distribution of two or more continuous random variables, typically denoted as X and Y, simultaneously taking on specific values.
The joint density function is primarily used when dealing with continuous random
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
variables. These are random variables that can take on an uncountably infinite number
of values within a certain range.
The joint density function $(f_{XY}(x, y))$ for two continuous random variables X and Y is defined as follows:
•
It satisfies two key properties:
1.
Non-negativity
: $(f_{XY}(x, y) \geq 0)$ for all pairs of values (x, y). This ensures that probabilities are always non-negative. 2.
Total Probability
: $(\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f_{XY}
(x, y) \, dx \, dy = 1)$ Joint density functions are used in various statistical analyses, including:
•
Probability Calculation:
You can use $(f_{XY}(x, y))$ to calculate probabilities
associated with events involving both (X) and (Y). •
Expected Values:
They are useful for calculating expected values (means) and variances of functions of (X) and (Y). •
Correlation and Covariance:
Joint density functions are essential for understanding the correlation and covariance between (X) and (Y) Exercise 12 Understanding Joint Density Function
¶
Problem Statement
¶
A company is analyzing the relationship between the ages and monthly expenditures of its customers. The age (in years) and monthly expenditure (in dollars) are continuous random variables.
Given a dataset of ages and expenditures, we want to understand the joint density of these two variables.
Objective:
Understand and visualize the Joint Density Function for the given scenario.
Step 1: Importing Necessary Libraries
¶
In [35]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Step 2: Generating the Dataset
¶
We'll create a dataset representing the ages and monthly expenditures of 1000 customers.
In [36]:
np.random.seed(0)
# Generating sample data
ages = np.random.normal(35, 10, 1000).astype(int)
expenditures = np.random.normal(500, 100, 1000) + (ages - 35) * 5
df = pd.DataFrame({'Age': ages, 'Expenditure': expenditures})
df.head()
Out[36]:
Age Expenditure
0
52
640.596268
1
39
609.247389
2
44
502.768518
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Age Expenditure
3
57
620.471403
4
53
612.805333
Step 3: Estimating Joint Density
¶
We'll use kernel density estimation (KDE) to estimate the joint density of age and expenditure.
In [37]:
# Estimating joint density using KDE
sns.jointplot(x='Age', y='Expenditure', data=df, kind='kde', cmap='Blues', fill=True)
plt.title('Joint Density Estimation of Age and Expenditure', loc='right')
plt.show()
Interpretation
¶
From the visualization:
•
The plot shows the joint density of age and expenditure. •
The darker regions indicate higher density, meaning many customers fall into those age and expenditure brackets. •
We can observe a trend where as age increases, the expenditure also tends to
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
increase. This might be due to older customers having more disposable income or different purchasing habits. Conclusion
¶
The Joint Density Function provides a way to understand the relationship between two continuous random variables. Through this exercise, we've visualized the joint density of age and expenditure, gaining insights into how these two variables are related in the dataset.
Variance of Random Variable
¶
Variance is a statistical measure that quantifies the degree of spread or dispersion in the values of a random variable. It provides insight into how much individual data points deviate from the expected or average value. In the context of a random variable, variance helps us understand the variability or uncertainty associated with its
possible outcomes.
Let $X$ be a random variable with mean $\mu$. The variance
of $X$ -- denoted by $\
sigma^2$ or $\sigma_X^2$ or $\mathbb{V}(X)$ or $\mathbb{V}X$ -- is defined by
$$ \sigma^2 = \mathbb{E}(X - \mu)^2 = \int (x - \mu)^2\; dF(x) $$
assuming this expectation exists. The standard deviation
is $\text{sd}(X) = \sqrt{\
mathbb{V}(X)}$ and is also denoted by $\sigma$ and $\sigma_X$.
Exercise 13 Understanding Variance through Dice Rolling
¶
Problem Statement
¶
Imagine you have a standard six-sided die. We want to understand the variance in the outcomes when rolling this die.
Objective:
Simulate the rolling of a six-sided die 10,000 times, visualize the outcomes, and calculate the variance.
Step 1: Importing Necessary Libraries
¶
In [38]:
import numpy as np
import matplotlib.pyplot as plt
Step 2: Simulating Dice Rolls
¶
We'll simulate rolling the die 10,000 times and store the outcomes.
In [39]:
np.random.seed(0)
# Number of simulations
n_simulations = 10000
# Simulating dice rolls
rolls = np.random.choice([1, 2, 3, 4, 5, 6], n_simulations)
Step 3: Visualization
¶
We'll visualize the outcomes to understand the distribution of dice rolls.
In [40]:
# Plotting the outcomes
plt.figure(figsize=(10, 6))
plt.hist(rolls, bins=np.arange(1, 8) - 0.5, edgecolor='black', alpha=0.7, align='mid')
plt.xticks([1, 2, 3, 4, 5, 6])
plt.xlabel('Dice Face')
plt.ylabel('Frequency')
plt.title('Distribution of 10,000 Dice Rolls')
plt.grid(axis='y')
plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Step 4: Calculating Variance
¶
Variance measures how far a set of numbers are spread out from their average value. We'll calculate the variance of our simulated dice rolls.
In [41]:
# Calculating variance
variance = np.var(rolls)
variance
Out[41]:
2.92216279
Interpretation
¶
From the visualization:
•
The histogram shows the distribution of outcomes from 10,000 dice rolls. •
Since it's a fair die, each face has an approximately equal chance of landing, as reflected in the similar heights of the bars. From the variance calculation:
•
The variance gives us a measure of how spread out the outcomes are from the mean. For a fair six-sided die, the variance is expected to be around 2.92 (since the theoretical variance is (1^2 + 2^2 + 3^2 + 4^2 + 5^2 + 6^2)/6 - (3.5^2)
). Conclusion
¶
Through this exercise, we've simulated the rolling of a fair six-sided die, visualized the outcomes, and calculated the variance. This helps reinforce the concept of variance and how it measures the spread of data.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Co-variance
¶
Covariance is a statistical measure that quantifies the degree to which two random variables change together. In simpler terms, it tells us whether two variables tend to increase or decrease at the same time.
If $X$ and $Y$ are random variables, then the covariance and correlation between $X$
and $Y$ measure how strong the linear relationship between $X$ and $Y$ is. Let $X$ and $Y$ be random variables with means $\mu_X$ and $\mu_Y$ and standard deviation $\sigma_X$ and $\sigma_Y$. Define the covariance
between $X$ and $Y$ by
$$ \text{Cov}(X, Y) = \mathbb{E}[(X - \mu_X)(Y - \mu_Y)] $$
and the correlation
by
$$ \rho = \rho_{X, Y} = \rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} $$ Exercise 14 Understanding Co-variance
¶
Problem Statement
¶
Imagine a scenario where we are studying the relationship between the hours studied by students and their scores in a particular exam.
Objective:
Generate a dataset representing hours studied and exam scores, visualize the relationship, and calculate the co-variance.
Step 1: Importing Necessary Libraries
¶
In [42]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Step 2: Generating the Dataset
¶
We'll create a dataset representing the hours studied and the corresponding exam scores of 100 students.
In [43]:
np.random.seed(0)
# Generating sample data
hours_studied = np.random.normal(5, 2, 100) # Students study between 1 to 9 hours, on average 5 hours
exam_scores = 50 + 10 * hours_studied + np.random.normal(0, 5, 100) # Base score is 50, with 10 points for each hour studied
df = pd.DataFrame({'Hours_Studied': hours_studied, 'Exam_Scores': exam_scores})
df.head()
Out[43]:
Hours_Studied Exam_Scores
0
8.528105
144.696800
1
5.800314
101.264349
2
6.957476
113.222335
3
9.481786
149.664848
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Hours_Studied Exam_Scores
4
8.735116
131.485543
Step 3: Visualization
¶
We'll visualize the relationship between hours studied and exam scores to get an initial
understanding.
In [44]:
# Scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(df['Hours_Studied'], df['Exam_Scores'], alpha=0.6)
plt.title('Relationship between Hours Studied and Exam Scores')
plt.xlabel('Hours Studied')
plt.ylabel('Exam Scores')
plt.grid(True)
plt.show()
Step 4: Calculating Co-variance
¶
Co-variance measures the joint variability of two random variables. We'll calculate the co-variance between hours studied and exam scores.
In [45]:
# Calculating co-variance
covariance_matrix = np.cov(df['Hours_Studied'], df['Exam_Scores'])
covariance = covariance_matrix[0, 1]
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
covariance
Out[45]:
42.22040604887268
Interpretation
¶
From the visualization:
•
The scatter plot shows a positive relationship between hours studied and exam scores. As the hours studied increase, the exam scores tend to increase as well. From the co-variance calculation:
•
A positive co-variance value indicates that the two variables move in the same direction. In our case, as hours studied increases, exam scores also tend to increase. •
The magnitude of the co-variance gives us an idea of the strength of this relationship, though it's not bounded like correlation. Conclusion
¶
Through this exercise, we've generated a dataset, visualized the relationship between hours studied and exam scores, and calculated the co-variance. This helps reinforce the concept of co-variance and its role in understanding the relationship between two variables.
Correlation
¶
Correlation is a statistical measure that quantifies the strength and direction of the linear relationship between two variables. In simpler terms, it tells us how closely two variables are related and whether they tend to move together in a predictable way. Here's a more detailed explanation of correlation
The formula for correlation is
\begin{equation} r = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\
sum_{i=1}^{n}(X_i - \bar{X})^2 \sum_{i=1}^{n}(Y_i - \bar{Y})^2}} \
end{equation} Exercise 15 Understanding Correlation
¶
Problem Statement
¶
Imagine a scenario where we are analyzing the relationship between the daily exercise
duration and the corresponding energy levels of individuals.
Objective:
Generate a dataset representing daily exercise duration and energy levels,
visualize the relationship, and calculate the correlation coefficient.
Step 1: Importing Necessary Libraries
¶
In [46]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Step 2: Generating the Dataset
¶
We'll create a dataset representing the daily exercise duration (in hours) and the corresponding energy levels (on a scale of 1 to 10) of 200 individuals.
In [47]:
np.random.seed(0)
# Generating sample data
exercise_duration = np.random.normal(1.5, 0.5, 200) # Individuals exercise between 0.5
to 2.5 hours, on average 1.5 hours
energy_levels = 5 + 2 * exercise_duration + np.random.normal(0, 1, 200) # Base energy level is 5, with 2 points for each hour of exercise
df = pd.DataFrame({'Exercise_Duration': exercise_duration, 'Energy_Levels':
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
energy_levels})
df.head()
Out[47]:
Exercise_Duration Energy_Levels
0
2.382026
9.394871
1
1.700079
8.160778
2
1.989369
10.078398
3
2.620447
10.896157
4
2.433779
10.507690
Step 3: Visualization
¶
We'll visualize the relationship between exercise duration and energy levels to get an initial understanding.
In [48]:
# Scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(df['Exercise_Duration'], df['Energy_Levels'], alpha=0.6, color='purple')
plt.title('Relationship between Exercise Duration and Energy Levels')
plt.xlabel('Exercise Duration (hours)')
plt.ylabel('Energy Levels (1-10 scale)')
plt.grid(True)
plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Step 4: Calculating Correlation
¶
Correlation measures the strength and direction of a linear relationship between two variables. We'll calculate the correlation coefficient between exercise duration and energy levels.
In [49]:
# Calculating correlation
correlation = df['Exercise_Duration'].corr(df['Energy_Levels'])
correlation
Out[49]:
0.7579114556127929
Interpretation
¶
From the visualization:
•
The scatter plot shows a clear positive relationship between exercise duration and energy levels. As the duration of exercise increases, the energy levels also seem to rise. From the correlation calculation:
•
A correlation value close to 1 indicates a strong positive linear relationship. In our case, the positive value suggests that as exercise duration increases, energy
levels also tend to increase. •
The magnitude of the correlation coefficient gives us an idea of the strength of this linear relationship. Conclusion
¶
Through this exercise, we've generated a dataset, visualized the relationship between
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
exercise duration and energy levels, and calculated the correlation coefficient. This helps reinforce the concept of correlation and its role in understanding the linear relationship between two variables.
Causation
¶
Causation, also known as cause and effect, refers to the relationship between two events where one event (the cause) brings about another event (the effect). In other words, causation implies that a change in one variable is responsible for a change in another. This is a stronger statement than correlation, which merely indicates that two
variables change together.
Key Points:
¶
1.
Directionality:
Causation indicates a direction. If A causes B, then changes in A will lead to changes in B, but not necessarily the other way around. 2.
Isolation:
All other factors are held constant when considering causation. This means that it's only the changes in A causing changes in B, and not some other lurking variable. 3.
Consistency:
The cause always leads to the effect. If A causes B, then every time A happens, B will also happen (assuming all other conditions are the same).
Equations:
¶
While causation in its essence is a conceptual relationship, in many statistical methods, we try to quantify this relationship. For instance, in a simple linear regression:
$y = \beta_0 + \beta_1 x + \epsilon$
Here, ( x ) is the independent variable (potential cause), ( y ) is the dependent variable
(effect), and $\beta_1 $ measures the change in ( y ) for a unit change in ( x ). If $\
beta_1$ is statistically significant, it suggests that changes in ( x ) are associated with changes in ( y ). However, this doesn't necessarily mean ( x ) causes ( y ). Establishing causation requires more rigorous experimental design and evidence.
Remember:
¶
Correlation does not imply causation. Just because two variables are correlated does not mean that changes in one variable cause changes in another. There could be lurking variables or other reasons for the observed correlation.
Exercise 16 Understanding Causation
¶
Problem Statement
¶
Imagine a scenario where a health organization is analyzing the relationship between the consumption of a new health supplement and improvement in immune system strength.
Objective:
Generate a dataset representing the daily dosage of the supplement and the corresponding immune strength levels. Analyze if the supplement causes an improvement in the immune system.
Step 1: Importing Necessary Libraries
¶
In [50]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
Step 2: Generating the Dataset
¶
We'll create a dataset representing the daily dosage of the supplement (in mg) and the
corresponding immune strength levels (on a scale of 1 to 10) of 200 individuals.
In [51]:
np.random.seed(0)
# Generating sample data
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
daily_dosage = np.random.normal(50, 10, 200) # Dosage varies between 40 to 60 mg, on average 50 mg
immune_strength = 5 + 0.05 * daily_dosage + np.random.normal(0, 0.5, 200) # Base strength is 5, with a slight increase for each mg of supplement
df = pd.DataFrame({'Daily_Dosage': daily_dosage, 'Immune_Strength': immune_strength})
df.head()
Out[51]:
Daily_Dosage Immune_Strength
0
67.640523
8.197435
1
54.001572
7.580389
2
59.787380
8.539199
3
72.408932
8.948078
4
68.675580
8.753845
Step 3: Visualization
¶
We'll visualize the relationship between daily dosage and immune strength to get an initial understanding.
In [52]:
# Scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(df['Daily_Dosage'], df['Immune_Strength'], alpha=0.6, color='green')
plt.title('Relationship between Daily Dosage and Immune Strength')
plt.xlabel('Daily Dosage (mg)')
plt.ylabel('Immune Strength (1-10 scale)')
plt.grid(True)
plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Step 4: Regression Analysis
¶
To understand if there's a causal relationship, we'll perform a simple linear regression. If the coefficient for daily dosage is statistically significant, it suggests a potential causal relationship.
In [53]:
# Adding a constant for the intercept term
X = sm.add_constant(df['Daily_Dosage'])
Y = df['Immune_Strength']
model = sm.OLS(Y, X).fit()
model.summary()
Out[53]:
OLS Regression Results
Dep. Variable:
Immune_Strength
R-squared: 0.574
Model:
OLS
Adj. R-squared: 0.572
Method:
Least Squares
F-statistic: 267.3
Date:
Mon, 09 Oct 2023
Prob (F-statistic):
1.38e-38
Time:
16:23:33
Log-Likelihood: -132.90
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
No. Observations:
200
AIC: 269.8
Df Residuals:
198
BIC: 276.4
Df Model:
1
Covariance Type:
nonrobust
coef
std err
t
P>|t| [0.025 0.975]
const
4.7590 0.169
28.117 0.000 4.425
5.093
Daily_Dosage
0.0535 0.003
16.348 0.000 0.047
0.060
Omnibus:
0.111
Durbin-Watson: 2.249
Prob(Omnibus):
0.946
Jarque-Bera (JB): 0.008
Skew:
-0.004
Prob(JB): 0.996
Kurtosis:
3.031
Cond. No. 262.
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. Interpretation
¶
From the visualization:
•
The scatter plot shows a positive relationship between daily dosage and immune
strength. From the regression analysis:
•
The coefficient for daily dosage indicates the change in immune strength for each additional mg of the supplement. •
If the p-value for the coefficient is less than 0.05, it suggests that the relationship
is statistically significant. •
The regression analysis indicates a statistically significant positive relationship between Daily_Dosage
of the supplement and Immune_Strength
. For each additional mg of the supplement, the Immune_Strength
increases by approximately 0.0535. The model accounts for about 57.4% of the variability in immune strength (
R-squared
of 0.574). The residuals of the model appear to be normally distributed, and there's no evidence of autocorrelation, suggesting the model's assumptions are met. Conclusion
¶
Through this exercise, we've generated a dataset, visualized the relationship between daily dosage and immune strength, and performed regression analysis. This helps
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
reinforce the concept of causation and the importance of experimental design in establishing causal relationships.
Revised Date: October 9, 2023
In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help