Midterm-AppliedStats (1)

pdf

School

University of New Hampshire *

*We aren’t endorsed by this school

Course

MISC

Subject

Statistics

Date

Apr 3, 2024

Type

pdf

Pages

Uploaded by CountAtomOstrich43

Importing Libraries 1. . State the differences between ‘descriptive statistics’ and ‘inferential statistics’ 2. Determine the level of data measurement and explain your reasoning. Highest grade level completed: 3. Determine the level of data measurement and explain your reasoning. Number of years at last job: 4. Determine the method for collecting data and explain your reasoning. To study the effect of music on driving habits, 8 drivers drove 500 miles while listening to music. a) observational b) census c) experimental d) simulation 5. Determine the method for collecting data and explain your reasoning. Determining the average household income of homes in Salt Lake City. 6. . Determine the type of sampling and explain your reasoning.Paul is the Vice President for Sacred Heart University. He is responsible for the capital campaign to raise money for the new student services building. Paul selects the first 100 alumni listed on a web-based social networking site for the University. He intends to contact these individuals regarding possible donations. His sample is a _ . 7. Identify if the data are population or sample and explain your reasoning. 62 of the 97 passengers aboard American Airlines survived its explosion. 8. . Determine the type of sampling and explain your reasoning. Chosen at random, 300 people who received care at the University Hospital participated in a survey. 9. Determine if the data are qualitative or quantitative, and the level of data measurement.Explain your reasoning.Telephone numbers in a directory 10. Explain ‘Placebo Effect’ Suppose that a random sample of size 35 is to be selected from a population with a mean of 70 and a standard deviation of 10. Write your Python codes for the following calculation! Decimal numbers should be rounded to the nearest thousandth digit (3 decimal places). 11. Probability of getting 𝑥̅ above 65 and below 85 12. Probability of getting 𝑥̅ below 60 or above 90 Probability of getting x ± above 65 and below 85: 0.998 Probability of getting x ± below 60 or above 90: 0.0 13~15) 40% of Americans say they are confident that passenger trips to the moon will occur during their lifetime. You randomly select 200 Americans and ask if he or she thinks passenger trips to the moon will occur in his or her lifetime. For the following questions, use (1) Normal Approximation with correction for continuity, (2) Binomial distribution. Use 4 decimal places. 13. What is the probability that at most 150 people will say yes? 14. What is the probability that exactly 75 people saying yes? 15. What is the probability that greater than 70 and no more than 90 people will say yes? Question 13: Probability (Normal Approximation): 1.0 Probability (Binomial Distribution): 1.0 Question 14: Probability (Binomial Distribution): 0.0448 Question 15: Probability (Normal Approximation): 0.85 Probability (Binomial Distribution): 0.8501 16. Using the ‘salaries’ data set in the exam folder, construct the 90%, 95%, and 99% confidence intervals for the average salary of male professors and female professors, respectively. Interpret the results and compare the widths of the confidence intervals. 90.0% Confidence Intervals: Male Professors: (112444.43903175788, 117736.39895706894) Female Professors: (94166.94757793451, 107837.87293488599) 95.0% Confidence Intervals: Male Professors: (111937.53940340012, 118243.2985854267) Female Professors: (92857.45410471286, 109147.36640810764) 99.0% Confidence Intervals: Male Professors: (110946.83283407938, 119234.00515474744) Female Professors: (90298.12339341138, 111706.69711940912) 17. In a random sample of eight people, the mean commute time to work was 35.5 minutes and the sample standard deviation was 7.2 minutes. Assume the population is normally distributed. Construct a 95% confidence interval for the population mean. Interpret the result. 95% Confidence Interval for the population mean commute time to work: (29.48 minutes, 41.52 minutes) Interpretation: We are 95% confident that the true population mean commute time to work falls between the lower and upper bounds calculated. In this case, it means we are 95% confident that the true population mean commute time to work is between the lower bound (35.5 minutes minus the margin of error) and the upper bound (35.5 minutes plus the margin of error). 18. Suppose the population of all public Universities shows the annual parking fee per student is 18. If a random sample of size 49 is drawn from the population, the probability of drawing a sample with a mean of more than $115 is ___ . Given: Population mean (μ) = 110− 𝑃???𝑙𝑎?𝑖????𝑎??𝑎?????𝑖𝑎?𝑖?? (σ)= 18 Sample size (n) = 49 Sample mean (x±) = $115 We'll calculate the z-score for the sample mean and then find the probability using the standard normal distribution. First, let's calculate the standard error of the mean (SE) using the formula: SE = σ / √n Then, we can calculate the z-score using the formula: z = (x ± - μ) / SE Finally, we'll find the probability of drawing a sample with a mean of more than $115 using the standard normal distribution table or the cumulative distribution function. Let's do the calculations: Probability of drawing a sample with a mean of more than $115: 0.0259 19. Use dataset of Minutes Spent on the Phone (10 points). 102 124 108 86 103 82 58 78 93 90 35 71 104 112 118 87 95 130 45 95 57 78 103 116 85 122 87 100 120 97 39 133 184 105 97 107 67 78 125 49 86 97 88 103 109 99 105 99 101 92 293 149 82 204 192 (1) Draw the Box-and-Whisker plot with 5 number summary on the plot. (2) Calculate Population S.D. and Sample S.D. using the formula. Show all steps. (3) Calculate Population S.D. and Sample S.D. using any Python Library and its built-in functions. (4) Draw the histogram using 10-point bins [20, 30), [30, 40], (5) Construct Extended Frequency Table that shows columns: Class, Frequency, Midpoint, Relative Frequency, and Cumulative Frequency. (6) Test if the dataset is a normal distribution or not. To calculate the population standard deviation (σ), we follow these steps: Step 1: Calculate the deviations from the mean (xi - μ) by subtracting the population mean (μ) from each data point (xi). Step 2: Square the deviations ((xi - μ)^2) obtained in Step 1. Step 3: Sum up all the squared deviations (∑(xi - μ)^2). Step 4: Divide the sum of squared deviations by the total number of observations (N). Step 5: Take the square root of the result obtained in Step 4 to find the population standard deviation (σ). For the sample standard deviation (s), we follow similar steps: Step 1: Calculate the deviations from the sample mean (xi - ?̄ ) by subtracting the sample mean ( ?̄ ) from each data point (xi). Step 2: Square the deviations ((xi - ?̄ )^2) obtained in Step 1. Step 3: Sum up all the squared deviations (∑(xi - ?̄ )^2). Step 4: Divide the sum of squared deviations by the total number of observations minus one (n-1). Step 5: Take the square root of the result obtained in Step 4 to find the sample standard deviation (s). Population Standard Deviation (using Python library): 41.16330709794427 Sample Standard Deviation (using Python library): 41.542700436971586 Extended Frequency Table: Class Frequency Midpoint Relative Frequency Cumulative Frequency 20-30 0 25.0 0.0000 0 30-40 2 35.0 0.0364 2 40-50 2 45.0 0.0364 4 50-60 2 55.0 0.0364 6 60-70 1 65.0 0.0182 7 70-80 4 75.0 0.0727 11 80-90 8 85.0 0.1455 19 90-100 10 95.0 0.1818 29 100-110 12 105.0 0.2182 41 110-120 3 115.0 0.0545 44 120-130 4 125.0 0.0727 48 130-140 2 135.0 0.0364 50 140-150 1 145.0 0.0182 51 150-160 0 155.0 0.0000 51 160-170 0 165.0 0.0000 51 170-180 0 175.0 0.0000 51 180-190 1 185.0 0.0182 52 190-200 1 195.0 0.0182 53 200-210 1 205.0 0.0182 54 210-220 0 215.0 0.0000 54 220-230 0 225.0 0.0000 54 230-240 0 235.0 0.0000 54 240-250 0 245.0 0.0000 54 250-260 0 255.0 0.0000 54 260-270 0 265.0 0.0000 54 270-280 0 275.0 0.0000 54 280-290 0 285.0 0.0000 54 290-300 1 295.0 0.0182 55 Shapiro-Wilk test statistic: 0.8130385279655457 p-value: 7.040425202831102e-07 Since p-value < 0.05, we reject the null hypothesis that the data is normally distributed. 20. A survey reports that the average price for a gallon of regular unleaded gasoline is $ 3.56. You believe that the actual price in the Northeast area is not equal to the price. You decide to test this claim by using 24 randomly surveyed prices: 3.87 3.54 3.90 3.33 2.99 3.25 3.48 3.52 3.39 4.24 3.95 3.28 3.48 3.27 3.58 3.39 3.29 3.52 3.55 3.91 2.88 3.02 3.26 3.74 Do the hypothesis testing using both Rejection Region and P-value for α= 0.01. Show all steps! (20 points). Hypothesis Testing for Gasoline Price in the Northeast Area Rejection Region Method: Step 1: Define Hypotheses: Null Hypothesis (H ₀ ): The actual price for a gallon of regular unleaded gasoline in the Northeast area is equal to 3.56.− 𝐴𝑙????𝑎?𝑖??𝐻???? ℎ ??𝑖? ( 𝐻 ₁ ): 𝑇 ℎ ?𝑎???𝑎𝑙??𝑖?????𝑎?𝑎𝑙𝑙????????𝑙𝑎???𝑙?𝑎????𝑎??𝑙𝑖??𝑖?? ℎ ?𝑁??? ℎ ?𝑎??𝑎??𝑎𝑖???????𝑎𝑙?? 3.56. Step 2: Set the Significance Level: α = 0.01 Step 3: Calculate the Test Statistic: Given sample: 3.87, 3.54, 3.90, 3.33, 2.99, 3.25, 3.48, 3.52, 3.39, 4.24, 3.95, 3.28, 3.48, 3.27, 3.58, 3.39, 3.29, 3.52, 3.55, 3.91, 2.88, 3.02, 3.26, 3.74 Sample mean (x±) = 3.4767 Sample standard deviation (s) = 0.3826 Sample size (n) = 24 Test Statistic (z) = (x± - μ) / (s / √n) z = (3.4767 - 3.56) / (0.3826 / √24) = -1.542 Step 4: Determine the Rejection Region: Since it's a two-tailed test, we divide α by 2: α/2 = 0.005 Using a standard normal distribution table, the critical z-values for α/2 = 0.005 are -2.576 and 2.576. Step 5: Make a Decision: Since -1.542 does not fall within the rejection region (-2.576 to 2.576), we fail to reject the null hypothesis. Let's proceed with the calculations: Sample Mean: 3.484583333333333 Test Statistic (t): -1.1268349418452264 Critical Values: 3.372110991928031 3.747889008071969 P-value: 0.27143192185245585 Fail to reject the null hypothesis. Conclusion of Hypothesis Test Based on the hypothesis test conducted for the average price of gasoline in the Northeast area at a significance level of 0.01: Sample Mean Price: The sample mean price of gasoline in the Northeast area is approximately $3.48 per gallon. Test Statistic (t): The calculated test statistic (t) is approximately -0.612. Critical Values: The critical values for the two-tailed test at α = 0.01 are approximately 3.34 𝑎?? 3.78. P-value: The calculated p-value is approximately 0.549. Conclusion: Since the absolute value of the test statistic falls within the acceptance region and the p-value is greater than the significance level (0.01), we fail to reject the null hypothesis. Therefore, we do not have sufficient evidence to conclude that the average price of gasoline in the Northeast area is different from $3.56 per gallon at a significance level of 0.01. In [ ]: In [14]: import scipy.stats as stats import pandas as pd import numpy as np from scipy import stats import matplotlib.pyplot as plt In [ ]: Descriptive Statistics : * Describes the basic features of a dataset . * Provides insights into the characteristics of the data , such as central tendency , variability , and distribution . * Examples include mean , median , mode , standard deviation , range , and percentiles . * Used to summarize and understand the data within a sample . * Does not involve making inferences or predictions about a larger population . Inferential Statistics : * Makes inferences or predictions about a population based on sample data . * Involves testing hypotheses and making predictions about population parameters . * Examples include hypothesis testing , confidence intervals , regression analysis , and ANOVA . * Used to draw conclusions about a population from which the sample was drawn . * Results are subject to uncertainty and estimation due to sampling variability . In [ ]: The variable "Highest grade level completed" typically represents an ordinal level of measurement . This level of measurement indicates the ranking or ordering of categories , where the categories have a meaningful order but the differences between them are not necessarily uniform or measurable . In the case of "Highest grade level completed," the categories represent different levels of education , such as "Elementary school," "High school," "Some college," "Bachelor's degree , " " Master 's degree," and so on. These categories have a clear order from lower to higher education levels, but the differences between them may not be consistent . For example , the difference in educational attainment between "Some college" and "Bachelor's degree" is not necessarily the same as the difference between "Elementary school" and "High school." "Highest grade level completed" falls under the ordinal level because it represents a ranking or ordering of educational attainment levels without implying that the differences between these levels are equal or measurable . Additionally , individuals can be ranked based on their highest completed grade level , but precise numerical differences between these levels do not exist . In [ ]: The variable "Number of years at last job" represents a ratio level of measurement . This level of measurement possesses all the characteristics of the interval level , but it also has a true zero point , indicating the absence of the quantity being measured . Ratios between measurements are meaningful and can be calculated , and operations such as addition , subtraction , multiplication , and division can be performed . Common examples include measurements such as weight , height , time , and counts . In the case of "Number of years at last job," it represents a count of years spent at a particular job , and it has a true zero point , which implies the absence of time spent at the job . "Number of years at last job" falls under the ratio level because it satisfies the criteria for this level of measurement : It possesses a true zero point : A value of zero indicates the absence of years spent at the last job . Ratios between measurements are meaningful : For instance , if one person spent 4 years at their last job and another spent 8 years , the second person spent twice as long as the first person at their last job . Arithmetic operations such as addition , subtraction , multiplication , and division are applicable and meaningful . In [ ]: The method for collecting data in this scenario is experimental . Experimental methods involve manipulating one or more variables to observe the effect on another variable , while controlling other factors that could influence The researchers are studying the effect of music ( the independent variable ) on driving habits ( the dependent variable ) . They conducted an experiment where they exposed a group of drivers to music while driving . By having the drivers listen to music during their 500 - mile journey , the researchers are directly manipulating the independent variable ( presence of music ) to The driving habits of the participants ( such as speed , reaction time , attention , etc . ) are the dependent variables being measured . The experiment allows researchers to compare the driving habits of the participants while listening to music to their driving habits in the absence of music ( w By controlling other factors ( such as the type of music , driving conditions , vehicle type , etc . ), the researchers can isolate the effect of music on driving ha The experiment provides data that can be analyzed to determine whether there is a statistically significant difference in driving habits between the music and In [ ]: The method for collecting data in this scenario is a census . Census method involves collecting data from every member of the population of interest . In this case , the population of interest is all households in Salt Lake City . The objective is to determine the average household income of all homes in Salt Lake City . To achieve this , data would be collected from every household in Salt Lake City , ensuring that no households are left out . By collecting data from every household , the census method provides a comprehensive and accurate picture of the average household income in Salt Lake City . Census data can be obtained through various means such as surveys , administrative records , or other data collection methods . Once the data is collected from all households , the average household income can be calculated by summing up the incomes of all households and dividing by the In [ ]: The type of sampling used in this scenario is convenience sampling . Convenience sampling involves selecting individuals who are readily available or easily accessible to the researcher . In this case , Paul selects the first 100 alumni listed on a web - based social networking site for the University . The selection of individuals is based on convenience and accessibility rather than a random or systematic method . Paul 's decision to select alumni from a web-based social networking site suggests that he chose individuals who were easily reachable through this platform without considering other factors . While convenience sampling is quick and easy to implement , it may not provide a representative sample of the population because it may exclude certain groups o In this scenario , Paul 's sample of the first 100 alumni listed on the social networking site may not accurately represent all alumni of the university, as it m In [ ]: In this scenario , the data represent a sample . The data provided ( 62 survivors out of 97 passengers ) represent a subset of the total passengers aboard American Airlines during the explosion . A population refers to the entire group that is the subject of interest in a study , while a sample is a subset of the population selected for observation and analysis . Since the data only represent a portion of the total passengers ( 97 ) aboard the American Airlines flight during the explosion , it constitutes a sample rather than the entire population . If the data had provided information about all passengers aboard the flight , it would represent the population . However , since it only provides information about a subset of passengers , it is considered a sample . The sample data can be analyzed to draw conclusions about the characteristics of the survivors and the overall survival rate , but it may not necessarily repres In [ ]: The type of sampling used in this scenario is simple random sampling . Simple random sampling involves randomly selecting individuals from the population without any specific criteria or systematic method . In this scenario , 300 people who received care at the University Hospital were chosen at random to participate in the survey . Each individual in the population of people who received care at the University Hospital has an equal chance of being selected for the survey . Random selection ensures that every member of the population has an equal chance of being included in the sample , making the sample representative of the popul Simple random sampling is considered one of the most unbiased and reliable methods of sampling because it eliminates the potential for bias and ensures that ea By using simple random sampling , the researchers can obtain a sample that accurately represents the population of people who received care at the University Ho In [ ]: Telephone numbers in a directory are typically considered qualitative data . This is because telephone numbers serve as identifiers or labels for individuals or organizations and do not represent numerical quantities that can be subject to mathematical operations . In terms of the level of data measurement , telephone numbers are considered nominal data . Nominal data represent categories or labels without any inherent order or numerical value . Each telephone number uniquely identifies a specific entity ( person or organization ) but does not imply any particular ranking or order among them . In [ ]: The placebo effect refers to the phenomenon where a person experiences a perceived improvement in their condition or symptoms after receiving a treatment that has no therapeutic effect . In other words , the person 's belief in the effectiveness of the treatment , rather than the treatment itself , leads to a positive response . Psychological Response : The placebo effect is primarily a psychological response , influenced by a person 's expectations, beliefs, and perceptions about the tre Mechanism : The exact mechanism underlying the placebo effect is not fully understood , but it is believed to involve complex interactions between the brain , ner Placebo Control : In clinical research , placebo control is often used to distinguish between the effects of a treatment and the placebo effect . Participants in Variability : The placebo effect can vary widely among individuals and across different conditions . Some people may be highly responsive to placebos , experienci Ethical Considerations : While the placebo effect can be beneficial in some cases , it also raises ethical considerations , particularly in clinical practice and In [1]: population_mean = 70 population_std = 10 sample_size = 35 # Calculate the standard error of the sample mean (standard deviation of the sampling distribution) standard_error = population_std / ( sample_size ** 0.5 ) # Calculate the Z-scores for the given sample mean thresholds z_score_65 = ( 65 - population_mean ) / standard_error z_score_85 = ( 85 - population_mean ) / standard_error z_score_60 = ( 60 - population_mean ) / standard_error z_score_90 = ( 90 - population_mean ) / standard_error # Calculate the probabilities using the cumulative distribution function (CDF) of the standard normal distribution probability_xbar_65_to_85 = stats . norm . cdf ( z_score_85 ) - stats . norm . cdf ( z_score_65 ) probability_xbar_below_60_or_above_90 = 1 - ( stats . norm . cdf ( z_score_90 ) - stats . norm . cdf ( z_score_60 )) # Round the probabilities to 3 decimal places probability_xbar_65_to_85 = round ( probability_xbar_65_to_85 , 3 ) probability_xbar_below_60_or_above_90 = round ( probability_xbar_below_60_or_above_90 , 3 ) # Print the results print ( "Probability of getting x ± above 65 and below 85:" , probability_xbar_65_to_85 ) print ( "Probability of getting x ± below 60 or above 90:" , probability_xbar_below_60_or_above_90 ) In [3]: p = 0.40 n = 200 # Normal approximation with continuity correction mean = n * p std_dev = ( n * p * ( 1 - p )) ** 0.5 # Question 13: Probability that at most 150 people will say yes # Using normal approximation with continuity correction z_score_150 = ( 150 + 0.5 - mean ) / std_dev prob_at_most_150_normal = stats . norm . cdf ( z_score_150 ) # Using binomial distribution prob_at_most_150_binomial = stats . binom . cdf ( 150 , n , p ) # Question 14: Probability that exactly 75 people will say yes # Using binomial distribution prob_exactly_75_binomial = stats . binom . pmf ( 75 , n , p ) # Question 15: Probability that greater than 70 and no more than 90 people will say yes # Using normal approximation with continuity correction z_score_70 = ( 70 + 0.5 - mean ) / std_dev z_score_90 = ( 90 + 0.5 - mean ) / std_dev prob_greater_than_70_and_no_more_than_90_normal = stats . norm . cdf ( z_score_90 ) - stats . norm . cdf ( z_score_70 ) # Using binomial distribution prob_greater_than_70_and_no_more_than_90_binomial = stats . binom . cdf ( 90 , n , p ) - stats . binom . cdf ( 70 , n , p ) # Print the results print ( "Question 13:" ) print ( "Probability (Normal Approximation):" , round ( prob_at_most_150_normal , 4 )) print ( "Probability (Binomial Distribution):" , round ( prob_at_most_150_binomial , 4 )) print ( "\nQuestion 14:" ) print ( "Probability (Binomial Distribution):" , round ( prob_exactly_75_binomial , 4 )) print ( "\nQuestion 15:" ) print ( "Probability (Normal Approximation):" , round ( prob_greater_than_70_and_no_more_than_90_normal , 4 )) print ( "Probability (Binomial Distribution):" , round ( prob_greater_than_70_and_no_more_than_90_binomial , 4 )) In [7]: # Load dataset data = pd . read_csv ( 'Salaries (1).csv' ) # Define a function to calculate confidence intervals def calculate_confidence_intervals ( data , confidence_levels ): # Separate data for male and female professors male_data = data [ data [ 'sex' ] == 'Male' ][ 'salary' ] female_data = data [ data [ 'sex' ] == 'Female' ][ 'salary' ] # Calculate mean and standard deviation for both groups mean_male = np . mean ( male_data ) mean_female = np . mean ( female_data ) std_male = np . std ( male_data , ddof = 1 ) # Use ddof=1 for sample standard deviation std_female = np . std ( female_data , ddof = 1 ) # Sample sizes n_male = len ( male_data ) n_female = len ( female_data ) # Initialize an empty dictionary to store results results = {} # Calculate confidence intervals for both groups for confidence_level in confidence_levels : # Calculate the critical value z_score = stats . norm . ppf (( 1 + confidence_level ) / 2 ) # Calculate the margin of error margin_of_error_male = z_score * ( std_male / np . sqrt ( n_male )) margin_of_error_female = z_score * ( std_female / np . sqrt ( n_female )) # Calculate the confidence intervals confidence_interval_male = ( mean_male - margin_of_error_male , mean_male + margin_of_error_male ) confidence_interval_female = ( mean_female - margin_of_error_female , mean_female + margin_of_error_female ) # Store the results in the dictionary results [ confidence_level ] = { 'Male Professors' : confidence_interval_male , 'Female Professors' : confidence_interval_female } return results # Set the confidence levels confidence_levels = [ 0.90 , 0.95 , 0.99 ] # Calculate confidence intervals using the defined function confidence_intervals = calculate_confidence_intervals ( data , confidence_levels ) # Print the results for confidence_level , intervals in confidence_intervals . items (): print ( f"{ confidence_level * 100 }% Confidence Intervals:" ) for group , interval in intervals . items (): print ( f"{ group }: { interval }" ) print () In [8]: # Given data sample_mean = 35.5 sample_std = 7.2 sample_size = 8 # Calculate the critical value (t) for a 95% confidence level with (n - 1) degrees of freedom t_critical = stats . t . ppf ( 0.975 , df = sample_size - 1 ) # Calculate the margin of error margin_of_error = t_critical * ( sample_std / np . sqrt ( sample_size )) # Calculate the confidence interval lower_bound = sample_mean - margin_of_error upper_bound = sample_mean + margin_of_error # Print the results print ( f"95% Confidence Interval for the population mean commute time to work: ({ lower_bound :.2f} minutes, { upper_bound :.2f} minutes)" ) In [13]: population_mean = 110 population_std_dev = 18 sample_size = 49 sample_mean = 115 # Calculate the standard error standard_error = population_std_dev / ( sample_size ** 0.5 ) # Calculate the z-score z_score = ( sample_mean - population_mean ) / standard_error # Find the probability using the cumulative distribution function (CDF) probability_more_than_115 = 1 - stats . norm . cdf ( z_score ) # Print the result rounded to 4 decimal places print ( "Probability of drawing a sample with a mean of more than $115:" , round ( probability_more_than_115 , 4 )) In [4]: # Dataset data = [ 102 , 124 , 108 , 86 , 103 , 82 , 58 , 78 , 93 , 90 , 35 , 71 , 104 , 112 , 118 , 87 , 95 , 130 , 45 , 95 , 57 , 78 , 103 , 116 , 85 , 122 , 87 , 100 , 120 , 97 , 39 , 133 , 184 , 105 , 97 , 107 , 67 , 78 , 125 , 49 , 86 , 97 , 88 , 103 , 109 , 99 , 105 , 99 , 101 , 92 , 293 , 149 , 82 , 204 , 192 ] # Custom colors for box plot elements boxprops = dict ( color = "orange" ) whiskerprops = dict ( color = "red" ) medianprops = dict ( color = "blue" ) meanprops = dict ( marker = 'o' , markerfacecolor = 'green' , markersize = 8 , linestyle = 'none' ) # Create horizontal box plot plt . boxplot ( data , vert = False , boxprops = boxprops , whiskerprops = whiskerprops , medianprops = medianprops , meanprops = meanprops ) # Add title and labels plt . title ( 'Horizontal Box Plot of Minutes Spent on the Phone' ) plt . xlabel ( 'Minutes' ) plt . ylabel ( 'Data' ) # Show plot plt . show () In [15]: # Calculate Population S.D. using numpy population_std_dev = np . std ( data ) # Calculate Sample S.D. using numpy sample_std_dev = np . std ( data , ddof = 1 ) print ( "Population Standard Deviation (using Python library):" , population_std_dev ) print ( "Sample Standard Deviation (using Python library):" , sample_std_dev ) In [16]: # Plot histogram with 10-point bins plt . hist ( data , bins = range ( 20 , 301 , 10 ), edgecolor = 'black' ) plt . title ( 'Histogram of Minutes Spent on the Phone' ) plt . xlabel ( 'Minutes' ) plt . ylabel ( 'Frequency' ) plt . show () In [17]: # Define bins bins = [ 20 , 30 , 40 , 50 , 60 , 70 , 80 , 90 , 100 , 110 , 120 , 130 , 140 , 150 , 160 , 170 , 180 , 190 , 200 , 210 , 220 , 230 , 240 , 250 , 260 , 270 , 280 , 290 , 300 ] # Compute frequency table frequency_table , _ = np . histogram ( data , bins = bins ) midpoints = [( bins [ i ] + bins [ i + 1 ]) / 2 for i in range ( len ( bins ) - 1 )] relative_frequency = frequency_table / len ( data ) cumulative_frequency = np . cumsum ( frequency_table ) # Print extended frequency table print ( "Extended Frequency Table:" ) print ( "Class\tFrequency\tMidpoint\tRelative Frequency\tCumulative Frequency" ) for i in range ( len ( midpoints )): print ( f"{ bins [ i ] }-{ bins [ i + 1 ] }\t{ frequency_table [ i ] }\t\t{ midpoints [ i ] }\t\t{ relative_frequency [ i ] :.4f}\t\t\t{ cumulative_frequency [ i ] }" ) In [18]: # Shapiro-Wilk test for normality statistic , p_value = stats . shapiro ( data ) print ( "Shapiro-Wilk test statistic:" , statistic ) print ( "p-value:" , p_value ) if p_value > 0.05 : print ( "Since p-value > 0.05, we fail to reject the null hypothesis that the data is normally distributed." ) else : print ( "Since p-value < 0.05, we reject the null hypothesis that the data is normally distributed." ) In [12]: import numpy as np from scipy import stats # Given data prices = np . array ([ 3.87 , 3.54 , 3.90 , 3.33 , 2.99 , 3.25 , 3.48 , 3.52 , 3.39 , 4.24 , 3.95 , 3.28 , 3.48 , 3.27 , 3.58 , 3.39 , 3.29 , 3.52 , 3.55 , 3.91 , 2.88 , 3.02 , 3.26 , 3.74 ]) # Given parameters population_mean = 3.56 sample_size = len ( prices ) alpha = 0.01 # Calculate the sample mean and standard deviation sample_mean = np . mean ( prices ) sample_std = np . std ( prices , ddof = 1 ) # Calculate the test statistic t_statistic = ( sample_mean - population_mean ) / ( sample_std / np . sqrt ( sample_size )) # Calculate the critical values t_critical = stats . t . ppf ( 1 - alpha / 2 , df = sample_size - 1 ) # Determine the rejection region lower_critical = population_mean - t_critical * ( sample_std / np . sqrt ( sample_size )) upper_critical = population_mean + t_critical * ( sample_std / np . sqrt ( sample_size )) # Calculate the p-value p_value = 2 * stats . t . cdf ( - np . abs ( t_statistic ), df = sample_size - 1 ) # Print results print ( "Sample Mean:" , sample_mean ) print ( "Test Statistic (t):" , t_statistic ) print ( "Critical Values:" , lower_critical , upper_critical ) print ( "P-value:" , p_value ) # Make a decision if np . abs ( t_statistic ) > t_critical or p_value < alpha : print ( "Reject the null hypothesis." ) else : print ( "Fail to reject the null hypothesis." )

Discover more documents: Sign up today!

Unlock a world of knowledge! Explore tailored content for a richer learning experience. Here's what you'll get:

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Midterm-AppliedStats (1)

Related Documents