IS Project -Suraj Mishra

docx

School

Chitkara University *

*We aren’t endorsed by this school

Course

125

Subject

Mechanical_engineering

Date

May 8, 2024

Type

docx

Pages

20

Uploaded by DukeExploration627

Report
INFERENTIAL STATSTICS Project Report By – Suraj Mishra
Contents - Problem 1 - An independent research organization is trying to estimate the probability that an accident at a nuclear power plant will result in radiation leakage. The types of accidents possible at the plant are, fire hazards, mechanical failure, or human error. The research organization also knows that two or more types of accidents cannot occur simultaneously. Problem 2 - Grades of the final examination in a training course are found to be normally distributed, with a mean of 77 and a standard deviation of 8.5. Problem 3- E-news Express, an online news portal, aims to expand its business by acquiring new subscribers. With every visitor to the website taking certain actions based on their interest, the company plans to analyze these actions to understand user interests and determine how to drive better engagement. The executives at E-news Express are of the opinion that there has been a decline in new monthly subscribers compared to the past year because the current web page is not designed well enough in terms of the outline & recommended content to keep customers engaged long enough to make a decision to subscribe.
Problem 1 - An independent research organization is trying to estimate the probability that an accident at a nuclear power plant will result in radiation leakage. The types of accidents possible at the plant are, fire hazards, mechanical failure, or human error. The research organization also knows that two or more types of accidents cannot occur simultaneously. According to the studies carried out by the organization, the probability of a radiation leak in case of a fire is 20%, the probability of a radiation leak in case of a mechanical 50%, and the probability of a radiation leak in case of a human error is 10%. The studies also showed the following. The probability of a radiation leak occurring simultaneously with fire is 0.1%. The probability of a radiation leak occurring simultaneously with a mechanical failure is 0.15%. The probability of a radiation leak occurring simultaneously with a human error is 0.12%. On the basis of the information available, answer the questions below: 1.1 What are the probabilities of a fire, a mechanical failure, and a human error respectively? 1.2 What is the probability of a radiation leak? 1.3 Suppose there has been a radiation leak in the reactor for which the definite cause is not known. What is the probability that it has been caused by: a) a fire? b) a mechanical failure? c) a human error? Solution – Defining the events F = Fire M = Mechanical Error H = Human Error R = Radiation leak
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
N = No accident Given Probabilities: P(R/F) = 0.2 P(R/M) = 0.5 P(R/H) = 0.1 P (R ∩ F) = 0.001 P (R ∩ M) = 0.0015 P (R ∩ H) = 0.0012 1.2 What are the probabilities of a fire, a mechanical failure, and a human error respectively? P(F) = P (P (R ∩ F)/ P(R/F) = 0.001/0.2 = 0.005 P (M) = P (P (R ∩ M) / P(R/M) = 0.0015/0.5 = 0.003 P(H) = P (P (R ∩ H) / P(R/H) = 0.0012/0.1 = 0.012 1.2 What is the probability of a radiation leak? Since the types of accidents possible at the plant are fire hazards, mechanical failure, or human error. = P(N) = 1 – (0.005 + 0.003 + 0.012) = 0.98 P(R/N) = 0 P (R N) = P(R/N) P(N) = 0 By Probability Theorem: P(R) = P (R F) + P(R M) + P (R H) + P (R N) P(R) = 0.001+0.0015+0.0012+0 P(R) = 0.0037 1.3 Suppose there has been a radiation leak in the reactor for which the definite cause is not known. What is the probability that it has been caused by: a) a fire? b) a mechanical failure?
c) a human error? = If there has been a radiation leak in the reactor for which the definite cause is not known. The probability that it has been caused by a) The probability of a fire radiation is: P(F/R) = P (P (R F) / P(R) = 0.001 / 0.0037 = 0.270 b) The probability of the mechanical failure radiation leak P(M/R) = P (P (R M) / P(R) = 0.0015 / 0.0037 = 0.405 c) The probability of the Human Error radiation leak P(H/R) = P (P (R H) / P(R) = 0.0012 / 0.0037 = 0.324 Problem 2 Grades of the final examination in a training course are found to be normally distributed, with a mean of 77 and a standard deviation of 8.5. Based on the given information answer the questions below. 2.1 What is the probability that a randomly chosen student gets a grade below 85 on this exam? 2.2 What is the probability that a randomly selected student score between 65 and 87? 2.3 What should be the passing cut-off so that 75% of the students clear the exam? 2.1 What is the probability that a randomly chosen student gets a grade below 85 on this exam? = To find the probability that a randomly chosen student gets a grade below 85, we need to calculate the cumulative probability up to the value of 85. Using the Z -score formula: Z = (X - μ) / σ Where: X = the value we want to find the probability for (85 in this case) μ = the mean (77) σ = the standard deviation (8.5)
Z = (85 -77) /8.5 Z = 0.941176 Now, we can use a standard normal distribution table or a calculator to find the cumulative probability corresponding to the Z-score of 0.941176. From the standard normal distribution table, the cumulative probability (area under the curve) for a Z-score of 0.941176 is appr 0.8264. Therefore, the probability that a randomly chosen student gets a grade below 85 is approximately 0.8264 or 82.64%. 2.2 What is the probability that a randomly selected student score between 65 and 87? To find the probability that a randomly selected student scores between 65 and 87, we need to calculate the cumulative probability for both values and subtract the smaller probability from the larger probability. Using the Z-score formula: For 65: Z1 = (65 – 77) / 8.5 Z1 = -1.411765 For 87: Z2 = (87 – 77) / 8.5 Z2 = 1.76471 Using the standard normal distribution table or a calculator, we find the cumulative probabilities corresponding to Z1 and Z2. For Z1 = -1.411765, the cumulative probability is approximately 0.0793. For Z2 = 1.176471, the cumulative probability is approximately 0.8790. The probability of scoring between 65 and 87 is the difference between the cumulative probability: 0.8790 – 0.0793 = 0.7997 Therefore, the probability that a randomly selected student score between 65 and 87 is approximately 0.7997 or 79.97 %.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
2.3 What should be the passing cut-off so that 75% of the students clear the exam? To determine the passing cut off that 75% of the students clear the exam, we need to find the corresponding Z-score for the cumulative probability of 0.75. From the standard normal distribution table or a calculator, we find the Z-score corresponding to a cumulative probability of 0.75 is approximately 0.6745. Using the Z-score formula: Z = (X - μ) / σ Substituting The known values: 0.6745 = (X – 77) / 8.5 Solving for X: X – 77 = 0.6745 * 8.5 X – 77 = 5.73425 X = 82.73425 Therefore, the passing cut off should be set at approximately 82.73425 for 75% of the students to clear the exam. Problem 3: Business Context The advent of e-news, or electronic news, portals has offered us a great opportunity to quickly get updates on the day-to-day events occurring globally. The information on these portals is retrieved electronically from online databases, processed using a variety of software, and then transmitted to the users. There are multiple advantages of transmitting news electronically, like faster access to the content and the ability to utilize different technologies such as audio, graphics, video, and other interactive elements that are either not being used or aren’t common yet in traditional newspapers.
E-news Express, an online news portal, aims to expand its business by acquiring new subscribers. With every visitor to the website taking certain actions based on their interest, the company plans to analyse these actions to understand user interests and determine how to drive better engagement. The executives at E-news Express are of the opinion that there has been a decline in new monthly subscribers compared to the past year because the current web page is not designed well enough in terms of the outline & recommended content to keep customers engaged long enough to make a decision to subscribe. [Companies often analyze user responses to two variants of a product to decide which of the two variants is more effective. This experimental technique, known as A/B testing, is used to determine whether a new feature attracts users based on a chosen metric.] Objective The design team of the company has researched and created a new landing page that has a new outline & more relevant content shown compared to the old page. In order to test the effectiveness of the new landing page in gathering new subscribers, the Data Science team conducted an experiment by randomly selecting 100 users and dividing them equally into two groups. The existing landing page was served to the first group (control group) and the new landing page to the second group (treatment group). Data regarding the interaction of users in both groups with the two versions of the landing page was collected. Being a data scientist in E-news Express, you have been asked to explore the data and perform a statistical analysis (at a significance level of 5%) to determine the effectiveness of the new landing page in gathering new subscribers for the news portal by answering the following questions: 1. Do the users spend more time on the new landing page than on the existing landing page? 2. Does the converted status depend on the preferred language? 3. Is the mean time spent on the new page the same for the different language users? 3.1 Define the problem and perform an Exploratory Data Analysis Typical Data exploration activity consists of the following steps: 1. Importing Data 2. Variable Identification 3. Variable Transformation/Feature Creation 4. Missing value detection 5. Univariate Analysis 6. Bivariate Analysis Basic observations
The data frame has a total of 100 rows and 6 columns. Clearly, each row includes the gathered information related to one user of the online news portal. As expected, the Landing Page values for the Groups "control" and "treatment" are always "old" and "new," respectively. The data types are: Numerical: o Integer: user ID o Float: Time Spent on The Page Non-numerical: o Object: Group, Landing Page, Converted, and Language Preferred There is no null values in any rows and columns. Summary of Stats The mean, median, minimum, and maximum values of the time spent on the news pages by users are 5.38, 5.42, 0.19, and 10.71 minutes, respectively. As expected, there are two unique values for the columns Group ("control" and "treatment"), Landing Page ("old" and "new"), and Converted ("yes" or "no"). Converted users are more than those who are not converted (54 vs. 46). There are three possible unique values for Language Preferred, i.e., "French", "Spanish", and "English." "French" and "Spanish" have the highest frequencies in the collected sample (each 34 out of 100). Univariate Analysis
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Observations: The time spent on the page appears to be normally distributed and seems to have no outliers. Observations: It appears that the sample is equally split amongst the control and treatment groups
Observations: The average time spent on the page appears to be similar for all the preferred languages. However, users who preferred Spanish appear to have the smallest spread in time spent on the page. Bivariate Analysis
Observations: Users appear to spend more time on the new landing page compared to the old lading page.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Observations: It appears that those who converted to a subscriber spent more time on the page. Observations: It appears that users that preferred Spanish and French opted not to convert to a subscriber when viewing the old landing page. However, users of all language preferences converted to subscribers when viewing the new landing page.
Observations: It appears that more people converted to subscribers in the treatment group compared to the control group. Additionally, users in the treatment group spent more time on the page as opposed to participants in the control group. 1.Do the users spend more time on the new landing page than the existing landing page? Observations: The average time spent on the new page seems greater than the old page.
Step 1: Define the null and alternate hypotheses. H 0: The mean time spent by the users on the new page is equal to the mean time spent by the users on the old page. H a : The mean time spent by the users on the new page is greater than the mean time spent by the users on the old page. Step 2: Select Appropriate test This is a one-tailed test concerning two population means from two independent populations. The population standard deviations are unknown. Based on this information, a two-sample independent t-test would be the most appropriate . Step 3: Decide the significance level As given in the problem statement, we select α=0.05 The sample standard deviation of the time spent on the new page is : 1.82 The sample standard deviation of the time spent on the old page is : 2.58. Based on the sample standard deviations of the two groups, the population standard deviations can be assumed to be unequal . Two-sample independent t-test assumptions: Continuous data - Yes, the time spent on the pages is measured on a continuous scale. Normally distributed populations - Yes, we are informed that the populations are assumed to be normal. Independent populations - As we are taking random samples for two different groups, the two samples are from two independent populations. Unequal population standard deviations - As the sample standard deviations are different, the population standard deviations may be assumed to be different. Random sampling from the population - Yes, we are informed that the collected sample a simple random sample. The p-value is 0.0001392381225166549. As the p-value 0.0001392381225166549 is less than the level of significance, we reject the null hypothesis. Step 7: Draw inference. Since the p-value is much less than the level of significance of 5%, the null hypothesis is rejected. This means that there is significant evidence that the mean time spent by the users on the new page is greater than the mean time spent by the users on the old page. 3.4 Does the converted status depend on the preferred language?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Step 1: Define the null and alternate hypotheses¶ H0: The converted status is independent of the preferred language. Ha: The converted status is dependent on the preferred language. Step 2: Select Appropriate test. This is a problem of the test of independence, concerning two categorical variables - converted status and preferred language. Based on this information, a chi-square test for independence would be the most appropriate. Step 3: Decide the significance level. As given in the problem statement, we select α = 0.05. Chi-Squared test for independence assumptions: Categorical variables - Yes Expected value of the number of sample observations in each level of the variable is at least 5 - Yes, the number of observations in each level is greater than 5. Random sampling from the population - Yes, we are informed that the collected sample is a simple random sample. The p-value is 0.2129888748754345. As the p-value 0.2129888748754345 is greater than the level of significance, we fail to reject the null hypothesis.
Step 7: Draw inference. Since the p-value is greater than the level of significance of 5%, the null hypothesis fails to be rejected. This means that that the converted status is independent of the preferred language. 3.5 Is the mean time spent on the new page same for the different language users? Step 1: Define the null and alternate hypotheses¶ H0: The mean time spent on the new lading page is the same across all preferred languages. Ha: At least one of the mean times spent on the new landing page is different amongst the preferred languages. Step 2: Select Appropriate test. This is a problem, concerning three population means. Based on this information, a one-way ANOVA test would be the most appropriate. Step 3: Decide the significance level. As given in the problem statement, we select α = 0.05. Shapiro-Wilk’s test H0: Carbon emission follows a normal distribution. Ha: Carbon emission does not follow a normal distribution
Step 2: Select Appropriate test. This problem involves comparing the means of three independent populations of continuous variables. Provided that the populations are normally distributed, and their variances are equal, since the samples have been randomly selected, a one- way ANOVA test can be used to compare the population means. To check the normality of the distributions and the equality of their variances, the Shapiro-Wilk’s test and the Levene’s test are performed as follows. The p-value is found to be 0.8040016293525696. Since p-value (0.80400) is much larger than the level of significance (0.05), the null hypothesis cannot be rejected. Hence, the normal assumption holds. Levene’s test H0: All the population variances are equal. Ha: At least one variance is different from the rest The p-value is 0.46711357711340173. Since the p-value is large, we fail to reject the null hypothesis, meaning the variances are equal. One-way ANOVA test assumptions: The populations are normally distributed - Yes, the normality assumption is verified using the Shapiro-Wilk’s test. Samples are independent simple random samples - Yes, we are informed that the collected sample is a simple random sample. Population variances are equal - Yes, the homogeneity of variance assumption is verified using the Levene's test. Step 5: Calculate the p-value. The p-value is 0.43204138694325955. As the p-value 0.43204138694325955 is greater than the level of significance, we fail to reject the null hypothesis. Step 7: Draw inference¶ Since the p-value is greater than the level of significance at 5%, the null hypothesis fails to be rejected. This means that the mean time spent on the new landing page is relatively similar regardless of the preferred language. 3.6 Actionable Insights & Recommendations Actionable Insights.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In order to answer the question if users spend more time on the new landing page than the existing landing page, a two-sample independent t-test was performed. A p-value of 0.0001 was resulted from the test, which is less than the level of significance of 5%. Therefore, the null hypothesis is rejected. What this means in context is that there is significant evidence that the mean time spent by the users on the new page is greater than the mean time spent by the users on the old page. To answer the question if the conversion rate for the new page is greater than the conversion rate of the old page, a two-proportion z-test was performed. A p-value of 0.008 was resulted from the test, which is less than the level of significance of 5%. Therefore, the null hypothesis is rejected. What this means in context is that there is significant evidence that the conversion rate of the new landing page was greater than the conversion rate of the old landing page. In order to answer the question if the conversion status andn preferred language are related, a chi-square test for independence was performed. A p-value of 0.213 was resulted from the test, which is more than the level of significance of 5%. Therefore, the null hypothesis is failed to be rejected. What this means in context is that conversion status and the preferred langauge of the landing page are independent of each other. In order to answer the question if the time spent on the new landing page differed based on preferred language, a one-way ANOVA test was performed. A p-value of 0.432 was resulted from the test, which is more than the level of significance of 5%. Therefore, the null hypothesis is failed to be rejected. What this means in context is that the mean time spent on the new landing page was realtively similar across all the preferred languages. Recommendations: E-News Express should fully implement the new landing page as it appears to gain a lot more traction than the old landing page. The time spent on the new landing page is greater than the time spent on the old landing page is evidence that users prefer it. It might be beneficial to cut the losses with the old landing page as there are diminutive returns in average time spent and conversion rate. The new landing page has an increased conversion rate; therefore, more resources should be directed towards it as it has more opportunity to increase membership. Deploy the new landing page incorporating all the exiting preferred language. As there is no significant difference between the average time spent on the new page across the preferred languages, the conversion rate to subscribers will be the similar
throughout. Perhaps consider adding more languages to the portal to reach a wider audience.