IE6400_Day12
html
keyboard_arrow_up
School
Northeastern University *
*We aren’t endorsed by this school
Course
6400
Subject
Industrial Engineering
Date
Feb 20, 2024
Type
html
Pages
25
Uploaded by ColonelStraw13148
IE6400 Foundations for Data Analytics
Engineering
¶
Fall 2023
¶
Module 2: Nonparametric Methods
¶
Nonparametric methods refer to a broad category of statistical techniques that do not make strong assumptions about the form or parameters of the underlying population distribution from which the sample data is drawn. These methods are often used when the assumptions of parametric methods (like normal distribution) are not met. Here are some key points about nonparametric methods:
1.
Distribution-Free
: Nonparametric methods do not assume a specific distribution for the data, such as the normal distribution. This makes them more flexible in handling data from unknown or non-normal distributions.
2.
Rank-Based
: Many nonparametric tests are based on the ranks of the data rather than their actual values. Examples include the Wilcoxon rank-sum test and the Kruskal-Wallis test.
3.
Applications
: Nonparametric methods are particularly useful for analyzing ordinal or nominal data, as well as interval or ratio data that doesn't meet the assumptions of parametric tests.
4.
Advantages
:
•
Flexibility: Can be applied to data that doesn't meet the assumptions of parametric tests. •
Robustness: Less sensitive to outliers and skewed data. •
Simplicity: Often easier to understand and interpret. 5.
Disadvantages
:
•
Less Powerful: When the assumptions of parametric tests are met, nonparametric tests are generally less powerful (i.e., they might have a lower chance of detecting a true effect). •
Limited Parameters: Nonparametric methods do not provide estimates of population parameters like the mean or standard deviation. 6.
Common Nonparametric Tests
:
•
Mann-Whitney U Test (or Wilcoxon Rank-Sum Test)
: Compares the distributions of two independent samples. •
Wilcoxon Signed-Rank Test
: Compares the distributions of two paired samples. •
Kruskal-Wallis Test
: An extension of the Mann-Whitney U test for comparing more than two samples. •
Spearman's Rank Correlation
: Measures the strength and direction of the association between two ranked variables. •
Chi-Squared Test
: Tests the association between two categorical variables. 7.
Kernel Density Estimation
: A nonparametric way to estimate the probability density function of a continuous random variable.
8.
Nonparametric Regression
: Techniques like LOESS (locally weighted scatterplot smoothing) that do not assume a specific form for the relationship between predictors and response variable.
In summary, nonparametric methods offer a versatile toolkit for statistical analysis when the assumptions of traditional parametric methods are not met. They are especially useful for analyzing data that is skewed, has outliers, or comes from an unknown distribution.
Exercise 1 Ranking data
¶
Ranking data refers to data representing the order or position of items relative to one another without necessarily indicating the magnitude of differences between them. In other words, ranking data tells you the order of items but not the actual values or scores that led to that order.
In [1]:
import pandas as pd
In [2]:
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Score': [85, 72, 92, 72]}
In [3]:
df = pd.DataFrame(data)
df
Out[3]:
Name
Score
0
Alice
85
1
Bob
72
2
Charlie 92
3
David
72
In [4]:
df['Rank'] = df['Score'].rank(method='average', ascending=False)
print(df)
Name Score Rank
0 Alice 85 2.0
1 Bob 72 3.5
2 Charlie 92 1.0
3 David 72 3.5
This method assigns ranks to data points based on their values. Ties can be handled in
various ways, such as averaging the ranks.
Exercise 2 Ranking Using the scipy.stats.rankdata() Function:
¶
In [5]:
import numpy as np
from scipy.stats import rankdata
In [6]:
scores = np.array([85, 72, 92, 72])
ranks = rankdata(-scores, method='average')
In [7]:
print('Data:', scores)
print('Ranks:', ranks)
Data: [85 72 92 72]
Ranks: [2. 3.5 1. 3.5]
The rankdata() function from SciPy can be used to rank data, and it provides various methods for handling ties
Exercise 3 Ranking Using the pandas.Series.rank() Method:
¶
In [8]:
import pandas as pd
In [9]:
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Score': [85, 72, 92, 72]}
In [10]:
df = pd.DataFrame(data)
df['Rank'] = df['Score'].rank(ascending=False, method='average')
print(df)
Name Score Rank
0 Alice 85 2.0
1 Bob 72 3.5
2 Charlie 92 1.0
3 David 72 3.5
Exercise 4 Ranking Using the argsort() Function:
¶
Perform an indirect sort along the given axis using the algorithm
In [11]:
import numpy as np
In [12]:
scores = np.array([85, 72, 92, 72])
ranks = np.argsort(-scores) + 1
print('Data:', scores)
print('Ranks:', ranks)
Data: [85 72 92 72]
Ranks: [3 1 2 4]
You can rank data by sorting the indices using the argsort() function, which returns the
indices that would sort an array
Exercise 5
¶
The SciPy
library provides the rankdata()
function to rank numerical data, which supports a number of variations on ranking. The example below demonstrates how to rank a numerical dataset.
In [13]:
from numpy.random import rand
from numpy.random import seed
from scipy.stats import rankdata
In [14]:
# seed random number generator
seed(1)
In [15]:
# generate dataset
data = rand(1000)
In [16]:
# review first 10 samples
print(data[:10])
[4.17022005e-01 7.20324493e-01 1.14374817e-04 3.02332573e-01
1.46755891e-01 9.23385948e-02 1.86260211e-01 3.45560727e-01
3.96767474e-01 5.38816734e-01]
Display the first 10 elements of a sequence data
In [17]:
# rank data
ranked = rankdata(data)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
# review first 10 ranked samples
print(ranked[:10])
[408. 721. 1. 300. 151. 93. 186. 342. 385. 535.]
Exercise 6 Kendall Tau Coefficient
¶
The Kendall Tau Coefficient, often denoted as τ (tau)
, is a non-parametric statistic used to measure the strength and direction of the association between two ordinal variables. It's a rank correlation coefficient that assesses the degree of concordance between two sets of rankings.
In this exercise, we will compute the Kendall Tau Coefficient using Python to measure the rank correlation between two sets of data. We'll use the scipy.stats library, which provides a function to calculate this coefficient.
Step 1: Import Necessary Libraries
¶
In [18]:
import numpy as np
from scipy.stats import kendalltau
import matplotlib.pyplot as plt
Step 2: Create Sample Data
¶
Let's create two sets of ranked data to compute the Kendall Tau Coefficient.
In [19]:
# Sample data
rankings_A = np.array([1, 2, 3, 4, 5])
rankings_B = np.array([3, 2, 4, 1, 5])
Step 3: Compute the Kendall Tau Coefficient
¶
In [20]:
tau, p_value = kendalltau(rankings_A, rankings_B)
print(f'Kendall Tau Coefficient: {tau:.2f}')
print(f'p-value: {p_value:.2f}')
Kendall Tau Coefficient: 0.20
p-value: 0.82
In [21]:
# Interpret the correlation
alpha = 0.05
if p_value < alpha:
print("There is a statistically significant correlation.")
else:
print("There is no statistically significant correlation.")
There is no statistically significant correlation.
The Kendall Tau Coefficient value will lie between -1 and 1. A positive value indicates that the two sets of rankings are similar, while a negative value indicates that the rankings are dissimilar. A value close to 0 suggests little to no association.
In the conclusion, you can determine the strength and direction of the association between the two sets of rankings. A significant p-value (typically < 0.05) suggests that
the observed association is statistically significant.
Exercise 7 Case Study - Analyzing Exam Scores and Study Hours
¶
Background:
¶
A researcher wants to investigate if there is a significant correlation between the number of hours students spend studying for an exam and their scores on the exam.
Data Collection:
¶
The researcher collects data from a group of 20 students. For each student, they record the number of hours spent studying and the exam score achieved. Here's the dataset:
In [22]:
import pandas as pd
In [23]:
df = pd.read_csv('example.csv')
In [24]:
df
Out[24]:
Hours_Study Exam_Score
0
10
85
1
5
60
2
8
75
3
3
50
4
12
92
5
15
98
6
9
80
7
7
70
8
2
45
9
11
88
10
6
62
11
4
55
12
14
96
13
16
99
14
13
90
15
18
100
16
1
40
17
17
98
18
20
105
19
19
110
Hours_Study Exam_Score
In [25]:
from scipy import stats
In [26]:
# Calculate Kendall's Tau and p-value
tau, p_value = stats.kendalltau(df['Hours_Study'], df['Exam_Score'])
# Set significance level
alpha = 0.05
In [27]:
# Compare p-value to alpha
if p_value < alpha:
result = "Reject the null hypothesis. There is a significant correlation."
else:
result = "Fail to reject the null hypothesis. There is no significant correlation."
In [28]:
print(f"Kendall's Tau (τ) = {tau:.2f}")
print(f"P-value = {p_value:.4f}")
print(result)
Kendall's Tau (τ) = 0.97
P-value = 0.0000
Reject the null hypothesis. There is a significant correlation.
Since the p-value is less than the chosen significance level (α = 0.05), we reject the null hypothesis. This suggests that there is a significant positive correlation between the number of hours spent studying and the exam scores
Exercise 8 Wilcoxon test
¶
The Wilcoxon test, also known as the Wilcoxon signed-rank test, is a non-parametric statistical test used to compare two paired groups. It is used to test the null hypothesis
that two related paired samples come from the same distribution. Specifically, it tests whether the differences between the pairs follow a symmetric distribution around zero.
In this exercise, we will perform the Wilcoxon signed-rank test using Python to compare two paired groups. We'll use the scipy.stats library, which provides a function
to conduct this test.
Step 1: Import Necessary Libraries
¶
In [29]:
import numpy as np
from scipy.stats import wilcoxon
import matplotlib.pyplot as plt
Step 2: Create Sample Paired Data
¶
Let's create two sets of paired data to perform the Wilcoxon test.
In [30]:
# Sample paired data
before_treatment = np.array([20, 22, 24, 19, 18, 21, 25, 23, 22, 20])
after_treatment = np.array([19, 24, 22, 20, 19, 20, 26, 22, 21, 19])
Step 3: Perform the Wilcoxon Test
¶
In [31]:
w, p_value = wilcoxon(before_treatment, after_treatment)
print(f'Wilcoxon Test Statistic: {w}')
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
print(f'p-value: {p_value:.4f}')
Wilcoxon Test Statistic: 23.0
p-value: 0.6953
In [32]:
# Interpret the results
alpha = 0.05 # significance level
if p_value < alpha:
print("Reject the null hypothesis: There is a significant difference between the two groups.")
else:
print("Fail to reject the null hypothesis: There is no significant difference between the two groups.")
Fail to reject the null hypothesis: There is no significant difference between the two groups.
The p-value will help us determine whether the differences between the paired samples are statistically significant. A small p-value (typically < 0.05) suggests that the two sets of paired data are significantly different.
Note that you can conclude whether there's a statistically significant difference between the two paired groups. If the p-value is less than 0.05, it suggests that the treatment had a significant effect on the sample. Otherwise, there's no evidence to suggest a significant effect.
Exercise 9 Kruskal-Wallis H Test
¶
The Kruskal-Wallis H Test, often simply referred to as the Kruskal-Wallis test, is a non-
parametric statistical test used to determine if there are statistically significant differences between two or more groups of an independent variable on a continuous or
ordinal dependent variable. It is the non-parametric equivalent to the one-way ANOVA.
In [33]:
import numpy as np
from scipy.stats import kruskal
Define three or more independent groups of data for the Kruskal-Wallis H Test.
In [34]:
group1 = [22, 25, 28, 30, 32]
group2 = [18, 21, 23, 26, 29]
group3 = [15, 16, 19, 20, 24]
Perform the Kruskal-Wallis H Test
In [35]:
statistic, p_value = kruskal(group1, group2, group3)
In [36]:
print("Kruskal-Wallis H Test:")
print(f"Kruskal-Wallis H Statistic: {statistic:.2f}")
print(f"P-value: {p_value:.4f}")
Kruskal-Wallis H Test:
Kruskal-Wallis H Statistic: 6.86
P-value: 0.0324
In [37]:
if p_value < 0.05:
print("There is a significant difference among the groups.")
else:
print("There is no significant difference among the groups.")
There is a significant difference among the groups.
Exercise 10 Friedman Test
¶
The Friedman Test is a non-parametric statistical test used to detect differences in
treatments across multiple test attempts or blocks. It is the non-parametric alternative
to the repeated measures ANOVA and is used to test for differences between groups when the dependent variable is ordinal or when the assumptions of parametric ANOVA
are not met for interval data.
In [38]:
import numpy as np
from scipy.stats import friedmanchisquare
Collect related (paired) data for three or more groups for the Friedman Test
In [39]:
group1 = [10, 12, 15, 8, 11]
group2 = [8, 10, 13, 6, 12]
group3 = [9, 11, 14, 7, 10]
In [40]:
statistic, p_value = friedmanchisquare(group1, group2, group3)
In [41]:
print("Friedman Test:")
print(f"Friedman Test Statistic: {statistic:.2f}")
print(f"P-value: {p_value:.4f}")
Friedman Test:
Friedman Test Statistic: 5.20
P-value: 0.0743
We just compare the p value with 0.05. If more than that then there is no significant difference among the groups
In [42]:
if p_value < 0.05:
print("There is a significant difference among the groups.")
else:
print("There is no significant difference among the groups.")
There is no significant difference among the groups.
Exercise 11 Mann-Whitney U Test
¶
The Mann-Whitney U Test, also known as the Wilcoxon rank-sum test, is a non-
parametric statistical test used to determine if there are significant differences between two independent groups on a continuous or ordinal dependent variable. It is the non-parametric alternative to the independent samples t-test.
In [43]:
import numpy as np
from scipy.stats import mannwhitneyu
Define two independent groups of data for the Mann-Whitney U Test
In [44]:
group1 = [20, 24, 22, 26, 21]
group2 = [30, 32, 31, 34, 35]
In [45]:
statistic, p_value = mannwhitneyu(group1, group2)
In [46]:
print("Mann-Whitney U Test:")
print(f"Mann-Whitney U Statistic: {statistic:.2f}")
print(f"P-value: {p_value:.4f}")
Mann-Whitney U Test:
Mann-Whitney U Statistic: 0.00
P-value: 0.0079
In [47]:
if p_value < 0.05:
print("There is a significant difference between the two groups.")
else:
print("There is no significant difference between the two groups.")
There is a significant difference between the two groups.
Exercise 12 Pearson’s Chi-Squared Test
¶
Pearson’s Chi-Squared Test (often simply referred to as the Chi-Squared Test) is a statistical test used to determine if there is a significant association between two categorical variables in a contingency table. It's one of the most commonly used tests for independence in categorical data.
In this exercise, we will perform the Pearson’s Chi-Squared Test using Python to determine if there is a significant association between two categorical variables. We'll use the scipy.stats library, which provides a function to conduct this test.
Step 1: Import Necessary Libraries
¶
In [48]:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency
import matplotlib.pyplot as plt
import seaborn as sns
Step 2: Create Sample Data
¶
Let's create a sample contingency table representing the relationship between two categorical variables: Gender and Preference.
In [49]:
# Sample data
data = {'Gender': ['Male', 'Male', 'Female', 'Female'],
'Preference': ['Like', 'Dislike', 'Like', 'Dislike'],
'Count': [50, 10, 20, 40]}
df = pd.DataFrame(data)
df
Out[49]:
Gender Preference Count
0
Male
Like
50
1
Male
Dislike
10
2
Female
Like
20
3
Female
Dislike
40
Step 3: Perform the Chi-Squared Test
¶
In [50]:
# Create a contingency table
contingency_table = df.pivot(index='Gender', columns='Preference', values='Count').fillna(0)
# Perform the test
chi2, p_value, _, expected = chi2_contingency(contingency_table)
print(f'Chi-Squared Value: {chi2:.2f}')
print(f'p-value: {p_value:.4f}')
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
print('Expected Frequencies:')
print(expected)
Chi-Squared Value: 28.83
p-value: 0.0000
Expected Frequencies:
[[25. 35.]
[25. 35.]]
Step 4: Visualize the Data
¶
In [51]:
# Interpretation of the Chi-Squared Test results
alpha = 0.05 # significance level
print(f'Chi-Squared Value: {chi2:.2f}')
print(f'p-value: {p_value:.4f}')
if p_value <= alpha:
print("The results are statistically significant. There is an association between the categorical variables.")
else:
print("The results are not statistically significant. There is no evidence of an association between the categorical variables.")
# Visualization of Expected vs. Observed Frequencies
fig, ax = plt.subplots(1, 2, figsize=(12, 5))
# Observed Frequencies
sns.heatmap(contingency_table, annot=True, cmap='coolwarm', fmt='g', ax=ax[0])
ax[0].set_title('Observed Frequencies')
# Expected Frequencies
sns.heatmap(expected, annot=True, cmap='coolwarm', fmt='.1f', ax=ax[1])
ax[1].set_title('Expected Frequencies')
plt.tight_layout()
plt.show()
Chi-Squared Value: 28.83
p-value: 0.0000
The results are statistically significant. There is an association between the categorical variables.
Decision:
¶
•
If the p-value is less than or equal to a significance level (often 0.05), you reject the null hypothesis and conclude that there is a significant association between the two categorical variables. •
If the p-value is greater than 0.05, you fail to reject the null hypothesis and conclude that there is no significant evidence to suggest an association between the two categorical variables. Visualization (Observed vs. Expected Frequencies):
¶
•
By visually comparing the observed frequencies (from your data) with the expected frequencies (calculated by the test), you can get a sense of where the differences lie. •
Areas (cells) in the heatmap with the most color contrast between observed and expected frequencies are where the most significant differences occur. Exercise 13 Spearman’s Rank Correlation
¶
Spearman's rank correlation coefficient is a non-parametric statistical test used to measure the strength and direction of the association between two ranked (ordinal) variables. It's an alternative to Pearson's correlation coefficient when the assumptions of linearity or normality are not met.
Step 1: Import Libraries and Seed Generator
¶
In [52]:
# Import necessary libraries
from numpy.random import rand
from numpy.random import seed
from scipy.stats import spearmanr
In [53]:
# Seed the random number generator
seed(1)
•
Generate two sets of random data, data1 and data2, each containing 1000 data points. •
rand(1000) generates 1000 random numbers between 0 and 1. •
rand(1000) * 20 scales the random numbers in data1 to have values between 0 and 20. •
data2 is created by adding random values between 0 and 10 to data1. This introduces a degree of correlation between data1 and data2. In [54]:
# Prepare data
data1 = rand(1000) * 20
data2 = data1 + (rand(1000) * 10)
In [55]:
# Calculate Spearman's correlation coefficient and p-value
coef, p = spearmanr(data1, data2)
In [56]:
# Print Spearman's correlation coefficient
print('Spearmans correlation coefficient: %.3f' % coef)
Spearmans correlation coefficient: 0.900
In [57]:
# Interpret the significance
alpha = 0.05
if p > alpha:
print('Samples are uncorrelated (fail to reject H0) p=%.3f' % p)
else:
print('Samples are correlated (reject H0) p=%.3f' % p)
Samples are correlated (reject H0) p=0.000
Running the example calculates the Spearman’s correlation coefficient between the two variables in the test dataset. The statistical test reports a strong positive correlation with a value of 0.9. The p-value is close to zero, which means that the likelihood of observing the data given that the samples are uncorrelated is very unlikely (e.g. 95% confidence) and that we can reject the null hypothesis that the samples are uncorrelated.
Exercise 14 Bootstrap Resampling
¶
Bootstrap resampling, commonly referred to as "bootstrapping," is a powerful statistical technique used for estimating the distribution of a statistic (like the mean or variance) by repeatedly sampling with replacement from an observed dataset. It's particularly useful when the sample size is small or when the underlying distribution of
the data is unknown.
In this exercise, we will perform bootstrap resampling to estimate the mean and its 95% confidence interval of a dataset using Python. We'll use the numpy library for data manipulation and matplotlib for visualization.
Step 1: Import Necessary Libraries
¶
In [58]:
import numpy as np
import matplotlib.pyplot as plt
Step 2: Create Sample Data
¶
Let's start with a small dataset of 15 observations.
In [59]:
data = np.array([23, 45, 56, 78, 89, 12, 67, 49, 55, 77, 88, 90, 34, 56, 71])
Step 3: Bootstrap Resampling
¶
We'll draw bootstrap samples from the original dataset and calculate the mean for
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
each sample.
In [60]:
n_iterations = 10000
bootstrap_means = []
for _ in range(n_iterations):
bootstrap_sample = np.random.choice(data, size=len(data), replace=True)
bootstrap_means.append(np.mean(bootstrap_sample))
Step 4: Calculate 95% Confidence Interval
¶
In [61]:
confidence_level = 0.95
lower_percentile = (1 - confidence_level) / 2 * 100
upper_percentile = (1 + confidence_level) / 2 * 100
confidence_interval = (np.percentile(bootstrap_means, lower_percentile),
np.percentile(bootstrap_means, upper_percentile))
print(f'95% Confidence Interval for the Mean: {confidence_interval}')
95% Confidence Interval for the Mean: (47.46666666666667, 70.66666666666667)
Step 5: Visualize the Bootstrap Distribution
¶
In [62]:
plt.hist(bootstrap_means, bins=50, color='skyblue', edgecolor='black')
plt.axvline(confidence_interval[0], color='red', linestyle='dashed')
plt.axvline(confidence_interval[1], color='red', linestyle='dashed')
plt.title('Bootstrap Distribution of the Mean')
plt.xlabel('Mean')
plt.ylabel('Frequency')
plt.show()
Conclusion
¶
The histogram visualizes the distribution of the bootstrap means. The dashed red lines represent the 95% confidence interval. If the confidence interval does not contain the
population parameter (in this case, the mean), then the estimate is considered statistically significant at the 5% level.
Interpretation
¶
Bootstrapping provides an empirical representation of the sampling distribution of the mean. The 95% confidence interval gives us a range in which we are 95% confident that the true population mean lies. This method is especially useful when the sample size is small or the underlying distribution of the data is unknown.
Exercise 15 Normality Assumption
¶
The normality assumption refers to the assumption that the residuals (or errors) of a model are normally distributed. This assumption is foundational for many statistical tests and methods, especially in the context of linear regression and other parametric tests. When the normality assumption is met, it allows for the use of certain statistical techniques that are based on the properties of the normal distribution.
In this exercise, we will explore the normality assumption by checking whether a given
dataset is normally distributed. We'll use Python libraries such as numpy, scipy, and matplotlib for this purpose.
Step 1: Import Necessary Libraries
¶
In [63]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import shapiro, probplot
Step 2: Generate Sample Data
¶
Let's create a sample dataset using numpy.
In [64]:
data = np.random.randn(100)
Step 3: Visual Inspection using Histogram
¶
A simple way to check for normality is to visualize the data using a histogram.
In [65]:
plt.hist(data, bins=30, color='skyblue', edgecolor='black')
plt.title('Histogram of the Data')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
Step 4: Quantile-Quantile Plot
¶
A Q-Q plot is another graphical method to assess normality. If the data is normally distributed, the points should fall roughly on a straight line.
In [66]:
probplot(data, plot=plt)
plt.title('Q-Q Plot')
plt.show()
Step 5: Shapiro-Wilk Test
¶
The Shapiro-Wilk test is a statistical test for normality. A p-value less than 0.05 typically indicates that the data is not normally distributed.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In [67]:
stat, p = shapiro(data)
print(f'Statistic: {stat}, p-value: {p}')
Statistic: 0.9924927949905396, p-value: 0.8556731343269348
Conclusion
¶
Based on the histogram and Q-Q plot, you can visually assess the normality of the data. The Shapiro-Wilk test provides a statistical measure. If the p-value is less than 0.05, it suggests that the data may not be normally distributed.
Interpretation
¶
•
If the data appears to be normally distributed based on visualizations and the Shapiro-Wilk test p-value is greater than 0.05, you can proceed with statistical tests that assume normality. •
If the data does not appear to be normally distributed, consider data transformations or non-parametric statistical methods. Excercise 16 Make Data Gaussian and Gaussian-Like
¶
"Making data Gaussian" or "Gaussian-like" refers to the process of transforming a dataset so that its distribution becomes closer to a Gaussian distribution (also known as a normal distribution).
In this exercise, we will explore various techniques to transform a dataset so that its distribution becomes closer to a Gaussian distribution. We'll use Python libraries such as numpy, scipy, and matplotlib for this purpose.
Step 1: Import Necessary Libraries
¶
In [68]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import boxcox, yeojohnson, shapiro, probplot
from sklearn.preprocessing import QuantileTransformer
Step 2: Generate Sample Data
¶
Let's create a skewed dataset using numpy.
In [69]:
data = np.random.exponential(scale=2, size=1000)
plt.hist(data, bins=30, color='skyblue', edgecolor='black')
plt.title('Original Skewed Data')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
Step 3: Log Transformation
¶
In [70]:
data_log = np.log(data)
plt.hist(data_log, bins=30, color='lightgreen', edgecolor='black')
plt.title('Log Transformed Data')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
Step 4: Box-Cox Transformation
¶
In [71]:
data_boxcox, _ = boxcox(data)
plt.hist(data_boxcox, bins=30, color='lightcoral', edgecolor='black')
plt.title('Box-Cox Transformed Data')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
Step 5: Yeo-Johnson Transformation
¶
In [72]:
data_yj, _ = yeojohnson(data)
plt.hist(data_yj, bins=30, color='lightpink', edgecolor='black')
plt.title('Yeo-Johnson Transformed Data')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Step 6: Quantile Transformation
¶
In [73]:
transformer = QuantileTransformer(output_distribution='normal')
data_quantile = transformer.fit_transform(data.reshape(-1, 1)).flatten()
plt.hist(data_quantile, bins=30, color='lightsalmon', edgecolor='black')
plt.title('Quantile Transformed Data')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
Conclusion
¶
Each transformation method has its own characteristics and may be suitable for different types of skewed data. It's essential to visualize the transformed data and possibly use statistical tests like the Shapiro-Wilk test to check the effectiveness of the
transformation.
Interpretation
¶
The goal of these transformations is to make the data more Gaussian-like, which can be beneficial for certain statistical methods and algorithms. However, the transformed data might require a different interpretation, especially in the context of the problem domain.
Excercise 17 Normality Assumption Exercise using the Iris Dataset
¶
In this exercise, we'll explore the normality assumption by checking whether the features in the Iris dataset are normally distributed. We'll use Python libraries such as pandas, seaborn, scipy, and matplotlib.
Step 1: Import Necessary Libraries
¶
In [74]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import shapiro
Step 2: Load the Iris Dataset
¶
The Iris dataset is available in the seaborn library.
In [75]:
iris = sns.load_dataset('iris')
iris.head()
Out[75]:
sepal_length sepal_width petal_length petal_width species
0
5.1
3.5
1.4
0.2
setosa
1
4.9
3.0
1.4
0.2
setosa
2
4.7
3.2
1.3
0.2
setosa
3
4.6
3.1
1.5
0.2
setosa
4
5.0
3.6
1.4
0.2
setosa
Step 3: Visual Inspection using Histograms
¶
A simple way to check for normality is to visualize the distribution of each feature using histograms.
In [76]:
iris.hist(figsize=(12, 10), bins=30, color='skyblue', edgecolor='black')
plt.suptitle('Histograms of Iris Dataset Features')
plt.show()
Step 4: Shapiro-Wilk Test for Normality
¶
The Shapiro-Wilk test is a statistical test for normality. A p-value less than 0.05 typically indicates that the data is not normally distributed.
In [77]:
features = iris.columns[:-1] # Exclude the 'species' column
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
for feature in features:
stat, p = shapiro(iris[feature])
print(f'{feature} - Statistic: {stat:.4f}, p-value: {p:.4f}')
sepal_length - Statistic: 0.9761, p-value: 0.0102
sepal_width - Statistic: 0.9849, p-value: 0.1011
petal_length - Statistic: 0.8763, p-value: 0.0000
petal_width - Statistic: 0.9018, p-value: 0.0000
Step 5: Q-Q Plots
¶
A Q-Q plot is another graphical method to assess normality. If the data is normally distributed, the points should fall roughly on a straight line.
In [78]:
from scipy.stats import probplot
for feature in features:
plt.figure(figsize=(8, 6))
probplot(iris[feature], plot=plt)
plt.title(f'Q-Q Plot for {feature}')
plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Conclusion
¶
Based on the histograms, Q-Q plots, and the Shapiro-Wilk test results, you can assess the normality of each feature in the Iris dataset.
Interpretation
¶
•
If a feature appears to be normally distributed based on visualizations and the Shapiro-Wilk test p-value is greater than 0.05, it suggests that the feature is approximately normally distributed. •
If the p-value is less than 0.05, it suggests that the feature may not be normally distributed. In such cases, consider data transformations or non-parametric statistical methods. Summary:
¶
•
sepal_length and petal_length as well as petal_width are not normally distributed. •
sepal_width is approximately normally distributed. Revised Date: October 7, 2023
In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help