Assignment 12 - Problems

html

School

Northeastern University *

*We aren’t endorsed by this school

Course

6400

Subject

Industrial Engineering

Date

Feb 20, 2024

Type

html

Pages

5

Uploaded by CaptainSparrowMaster1000

Report
IE6400 Foundations for Data Analytics Engineering Fall 2023 Assignment 12 Module 2: Probability Question 1: Monty Hall Problem Simulation and Analysis Background: The Monty Hall problem is a famous probability puzzle named after the host of the television game show "Let's Make a Deal." The problem goes as follows: A contestant is presented with three doors. Behind one of them is a car (which the contestant wants), and behind the other two are goats. The contestant selects one of the doors, say Door A. The host, Monty Hall, who knows what's behind each door, opens another door, say Door B, revealing a goat. Monty now asks the contestant if they want to stick with their initial choice (Door A) or switch to the remaining door (Door C). The contestant makes a decision: stick or switch. The question is, is it in the contestant's best interest to stick with their initial choice, switch, or does it not matter? Task: Your goal is to simulate the Monty Hall problem using Python and determine the empirical probabilities of winning the car for both strategies: sticking with the initial choice and switching after Monty reveals a goat. Dataset: 'monty_hall_trials.csv' In [ ]: import pandas as pd import numpy as np np.random.seed(999) def simulate_monty_hall(num_trials): doors = ['A', 'B', 'C'] results = [] for i in range(num_trials): car_location = np.random.choice(doors) initial_choice = np.random.choice(doors) remaining_doors = [door for door in doors if door != initial_choice and door != car_location] monty_reveal = np.random.choice(remaining_doors) # Simulating equal probability of sticking or switching final_decision = np.random.choice(['Stick', 'Switch']) if final_decision == 'Stick': win = 1 if initial_choice == car_location else 0 else: switch_to = [door for door in doors if door != initial_choice and door != monty_reveal][0]
win = 1 if switch_to == car_location else 0 results.append([i+1, initial_choice, monty_reveal, car_location, final_decision, win]) return pd.DataFrame(results, columns=['trial', 'initial_choice', 'monty_reveal', 'actual_car_location', 'final_decision', 'win']) df = simulate_monty_hall(1000) df.to_csv('monty_hall_trials.csv', index=False) The dataset contains six columns: trial: The trial number. initial_choice: The initial door chosen by the contestant. monty_reveal: The door Monty reveals to have a goat. actual_car_location: The door behind which the car is actually located. final_decision: The contestant's final decision, either "Stick" or "Switch". win: Whether the contestant won the car (1 for win, 0 for lose). Requirements: 1. Data Loading and Preprocessing: Load the dataset monty_hall_trials.csv into a Pandas DataFrame. Check for any missing or inconsistent data entries and handle them. Display a summary of the dataset. 2. Simulation Analysis: Calculate the empirical probability of winning the car for both strategies: sticking and switching. 3. Visualization: Plot a bar chart comparing the winning probabilities for both strategies. Ensure the graph is appropriately labeled with a relevant title and annotations. 4. Interpretation: Discuss the empirical results in the context of the theoretical probabilities. Offer insights into the optimal strategy for a contestant based on the simulation results. Evaluation Criteria: Correctness and efficiency of the Python code. Proper handling and preprocessing of the dataset. Accurate calculation and interpretation of empirical probabilities. Quality and clarity of visualizations. Insightful interpretations and conclusions regarding the Monty Hall problem. Question 2: Poisson Process Analysis of Website Hits Background: A Poisson process is a mathematical model for events that happen at random points in time and space, where the average rate of occurrence is constant and known. A common application of this process is in modeling the number of times a website is accessed over a given time interval. Scenario: You are a data analyst at a tech company. The company's main website has been receiving hits, and you suspect that the hits can be modeled as a Poisson process. Your task is to analyze the website hits data and verify if it indeed follows the Poisson distribution. Dataset: 'website_hits.csv' In [ ]: import pandas as pd import numpy as np
np.random.seed(12345) # Generating hits using Poisson distribution # Assuming mean hits per hour is 6 hits_per_hour = np.random.poisson(lam=6, size=24) time_intervals = [f"{i}-{i+1}" for i in range(24)] df = pd.DataFrame({ 'time_interval': time_intervals, 'hits': hits_per_hour }) df.to_csv('website_hits.csv', index=False) The dataset contains two columns: time_interval: Represents hourly intervals over a 24-hour period (e.g., "0-1" represents the interval from midnight to 1 AM). hits: The number of website hits recorded during the corresponding time interval. Requirements: 1. Data Loading and Preprocessing: Load the dataset website_hits.csv into a Pandas DataFrame. Check for any missing or inconsistent data entries and handle them. Display the basic statistics of the dataset. 2. Poisson Distribution Fitting: Calculate the mean hit rate from the data. Using the calculated mean, generate the expected hit frequencies for each hour if the process follows a Poisson distribution. 3. Visualization: Plot a bar chart showing both the observed and expected hits for each hourly interval. The bars for observed and expected hits should be side-by-side for comparison. Ensure the graph is properly labeled with a relevant title, legend, and annotations. 4. Hypothesis Testing: Conduct a goodness-of-fit test (e.g., chi-squared test) to determine if the observed hits significantly differ from a Poisson distribution with the calculated mean rate. 5. Interpretation: Discuss the results of the visualization and hypothesis test. Provide insights and recommendations to the company based on your findings. Evaluation Criteria: Correctness and efficiency of the Python code. Proper handling and preprocessing of the dataset. Accurate fitting of the Poisson distribution and calculation of expected frequencies. Quality and clarity of visualizations. Thoroughness of hypothesis testing. Insightful interpretations and conclusions drawn from the analysis. Question 3: Bayesian Analysis of Product Review Sentiments Background: Bayesian statistics is a branch of probability theory that applies probability to
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
statistical problems, involving several quantities. It makes use of Bayes' theorem to update probabilities as more evidence or information becomes available. In the world of online retail, Bayesian techniques can be used to analyze and update beliefs about product sentiments based on user reviews. Scenario: Imagine an e-commerce platform that has recently launched a new electronic gadget. The company is interested in understanding customer sentiment about this product based on text reviews. They believe that the reviews can be categorized into "Positive," "Neutral," and "Negative." The company has a prior belief about the sentiment distribution but wants to update this belief based on observed reviews. Dataset: 'product_reviews.csv' In [ ]: import pandas as pd import numpy as np np.random.seed(56789) # Generating review sentiments sentiments = ["Positive", "Neutral", "Negative"] probabilities = [0.55, 0.25, 0.2] reviews_count = 1000 generated_sentiments = np.random.choice(sentiments, size=reviews_count, p=probabilities) review_texts = [ "Loved it! Amazing product.", "It's okay. Does the job.", "Not what I expected. Disappointed.", "Works like a charm!", "Mediocre experience.", "Wouldn't recommend to anyone." ] df = pd.DataFrame({ 'review_id': range(1, reviews_count + 1), 'text': np.random.choice(review_texts, reviews_count), 'sentiment': generated_sentiments }) df.to_csv('product_reviews.csv', index=False) The dataset contains three columns: review_id: A unique identifier for each review. text: The text content of the review. sentiment: The actual sentiment of the review - this is to be used for analysis and not to be directly shown in the output. Prior Belief: The company believes the following prior distribution of sentiments for their product: $P(Positive) = 0.5$ $P(Neutral) = 0.3$ $P(Negative) = 0.2$
Requirements: 1. Data Loading and Preprocessing: Load the dataset product_reviews.csv into a Pandas DataFrame. Check for any missing or inconsistent data entries and handle them. Display a summary of the sentiments distribution. 2. Bayesian Updating: Using the given prior beliefs and the observed reviews, update the probabilities of the sentiments using Bayes' theorem. 3. Visualization: Plot a bar chart comparing the prior beliefs, observed frequencies, and posterior probabilities for each sentiment category. Ensure the graph is properly labeled with a relevant title, legend, and annotations. 4. Interpretation: Discuss the significance of the shift from prior to posterior beliefs. Provide insights into the perceived quality of the product based on the updated beliefs. Evaluation Criteria: Correctness and efficiency of the Python code. Proper handling and preprocessing of the dataset. Accurate Bayesian updating based on the provided prior and observed data. Quality and clarity of visualizations. Insightful interpretations and conclusions from the Bayesian analysis. In [ ]: