Entity Academy Lesson 6 Statistical Inference

docx

School

Liberty University *

*We aren’t endorsed by this school

Course

BASIC

Subject

Statistics

Date

Jan 9, 2024

Type

docx

Pages

10

Uploaded by JusticeFogPrairieDog37

Report
Sindy Saintclair Monday, November 28 2021 Lesson 6 – Statistical Inference Learning Objectives and Questions Notes and Answers Sampling Methods Balance Accuracy with Practicality Sampling – when you take a subset of the population and make assertions about the entire population by just observing a small subset of that population. - population: everyone (not as practical) - sample: some (more practical) The risk involved with this is since you are purposely excluding most of the households in the city, you are obtaining incomplete information. You can take multiple samples and they will all be slightly different, with no way to tell which one is the most accurate. The advantage is that my workload is dramatically decreased if I decide to sample. Sample Size Sample size is referred to as n. For instance, if you talk to 30 people out of a larger group, this is a sample size of 30, or n=30. Simple Random Sample Everyone has an equal chance of being selected - drawing names out of hat - scientists will assign people a number and will have Excel or Python select someone Cluster Sampling Randomly select a group, not an individual Example Stratified Sampling – usually demographic in nature Population 10,000 Sample: 100 Females: 20% - 2,000 Female: 20% - 20 Males: 80% - 8,000 Males: 80% Systematic Sampling
Convenience Sampling Just sampling those that are easy to sample, for instance only the people closest to you, the smallest file, or the beginning of the alphabet— usually what’s convenient for the scientist. Sample Size: Number of People in the Sample - How many people do you need in your sample? Enough to represent the population accuracy Not too many to be impractical Simple Random Sampling Often referred to as SRS, every potential candidate for data collection has the same probability of being chosen as every other candidate for data collection. Whoever gets selected is random . The best way to do this is to assign all candidates a unique number, and then have a random number generator select which of those candidates should be a part of the sample. This method provides the best samples but may not always be done because it can be logistically difficult or expensive. Sampling Method Examples I work for a healthcare provider. My team is tasked by a federal agency to add to the knowledge base of 10 th grade students by collecting medical information. The population is all 10 th grade students in the state of Ohio. You will collect height, weight, and hearing test data for each child sampled. Simple Random Sampling Example Each 10 th grader in Ohio is assigned a number from 1 to 38,559 (that is the number of 10 th graders in
the state). A random number generator is used to select 3000 of these numbers, with no repeats. Your team will spread out across the state to meet with each of these 3000 students and collect the health data you need. Cluster Sampling Occurs when the population is broken down into groups based on some sort of information such as location, age, or gender. Usually demographic in nature, but not always. Then, a few of the groups, called clusters, are randomly selected, and within each of the chosen clusters, you sample at 100%-- which means that everyone in a chosen cluster becomes part of the sample, and everyone in a non-chosen cluster is not part of the sample. Example – Rather than looking at individual students as sampling candidates, you decide to look at each school as a sampling candidate. There are 270 high schools in Ohio, so you randomly select 18 of these high schools and then sample every 10 th grader at each of these 18 high schools. There will be no data collected from the 10 th graders at the high schools that weren’t selected. Stratified Sampling Occurs when the population is divided into strata, usually based on demographic characteristics, such as gender, age, or education level, and then within each strata candidate are randomly open. The strata are sampled according to their relative size within the population. For example, if the largest strata have 5x as many candidates as the smallest data, then the final sample should have five times as many sampled candidates from the large strata as the small strata. In this image, the population has been stratified or broken into groups based on a particular characteristic (color). However, unlike cluster
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
sampling, you don’t randomly select the strata and sample everyone in it. Instead, you will randomly sample within the strata at a rate that is proportionate to the how large the strata are. In this image, since the purple strata is the largest, you will randomly select more people from those strata then from either the red or the green strata, which are smaller. Example – There are some huge high schools in Ohio (3500+) students, and there are some smaller schools and charter schools with as few as 50 students. I break the schools into three different strata: schools with 1500 or more students, schools with 400 to 1499 students, and schools with 399 or fewer students. I will label these schools as big, medium, and small schools. The big schools are where 45% of Ohio 10 th grade students are going to school, with 38% going to medium-sized schools, and 17% going to the small schools. I want my total sample size to be 3000 students, so I randomly select 3000 x 45% = 1350 students from the big schools ,3000 x 38% = 1140 students from the medium schools, and 3000 x 17% = 510 students from the small schools. This way, I have broken down the population into stratum and sampled from each strata proportionally so that each is fairly represented in the final sample of 3000 students. Systematic Sampling Systematic sampling happens when a start point is chosen at random, and then every nth item is selected after that. This ultimately makes the sampling as a whole no longer random. Example
Each 10 th grader is assigned a number again, and then a random number between 1 and 13 is generated by a random number generator. Suppose you generate the number 10. Then, every 13 th student after that is selected. In this case, that would be student number 10, 23, 36, 49, 62, etc. Sampling in this manner will get you pretty close to sample size of 3000. Convenience Sampling When someone makes little or no effort to randomly sample, and they collect the information that is easiest to get. Convenience sampling is usually the most biased type of sampling, but unfortunately, the most common. Example Right across the street from your facility is a high school. You send a team member over there to measure all of the 10 th graders at that school. One of your team members has a spouse that teaches high school, so tomorrow she goes with her husband to the high school and measures all of those 10 th – grade students there. Another team member has a cousin who is the school nurse at a high school in another city, so you asked the cousin to collect data from the 10 th grade students at that high school, and so forth. It turns out that for some of the high schools, there was a volleyball tournament, so some of the 10 th graders were missing because they are at the tournament, and another high school had a bunch of students missing because of band camp, but you just don’t care. Choosing a Dependent on one’s sampling goals and what is the
Sampling Method most practical. Most academics believe that convenience sampling should be avoided whenever possible. The trick is to approach the convenience sampling in a thoughtful rather than a ‘lazy’ manner and to incorporate some aspects of randomization within the constraints of the convenience sample. For example, a manufacturer is interested in how much of their output gets returned for being defective. Their goal is to sample all returned materials and then read through the description of reasons for the return, because some of it gets returned for defects, but there are many other reasons for returns. Because most of the returns data is returned through a vendor and some vendors won’t share the information, collect useable data, or in some cases, collect plenty of data in different languages do not know by the manufacturer—the manufacturer carefully selects the vendors that provide useful data, and then use some randomization to look at returns for different products, from different vendors, and at different times of the year. They purposely ignore a lot of data because it just isn’t useful to them. Systematic sampling doesn’t make a bunch of sense in the 10 th graders example, because you haven’t made the sampling any easier than simple random sampling would be. All you have done is systematically selected students instead of randomly picking them. However, if you were auditing quality on a manufactured item such as a wireless keyboard, and you are standing in the warehouse to select the keyboards you will test, doing systematic sampling makes perfect sense because you have shelves and shelves of finished product waiting to be shipped. Pulling out every 40 th keyboard is no problem at all. Why Sampling and Analysis? Why Should You Sample? - takes a lot of time and effort to reach an entire population - not every person will consent in a
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
population anyway - because you want to know something about a bigger set, usually called a population. - helps me to develop estimates of what is going on with the whole population, and usually can provide you with a confidence interval around that estimate - Balance rigor and feasibility Why Analyze Data? From a practicality perspective: - is 1.1 really bigger than 1? - is 2.5 really bigger than 2.3? Hypothesis Testing Statistical inference has been called the science of comparison. Mathematics has a lot of branches, and most of them are academic and theoretical. Null : H 0 equal Alternative : H a or H 1 not equal, greater than, or lesser than Steps of Hypothesis Testing - create a hypothesis - calculate a test statistic - calculate the probability – p value - determine alpha - decision rule Determine Alpha (α) α Level Accuracy Error α = 0.01 99% chance of accuracy 1% chance of error α = 0.05 95% chance of accuracy 5% chance of error α = 0.10 90% chance of accuracy 10% chance of error Alpha is a pre-determined number that represents how often you are willing to draw a conclusion from the data but be wrong. I will compare my p value with the alpha value to determine whether something is significant or not. The most typical value for alpha is 0.05. This means that there is a 95% chance that my results are accurate, and I am willing to accept a 5% chance that my data is wrong. However, on the more rigorous side, I may also see alpha = 0.01, Rigor
meaning that there is a 99% chance that my results are accurate, and I am willing to accept a 1% chance of error. And on the less rigorous side, from time to time I may see an alpha of 0.10, meaning that there is a 90% chance my results are accurate and a 10% chance my data is inaccurate. Creating a Hypothesis Statistical inference is used to test hypotheses. But to test a hypothesis, it has to be created. The most important part of creating a hypothesis is to remember that it must contain the ‘=’ sign, either explicitly or implicitly. Example #1) You have invented a new drug to treat HTN. You are sure your drug is better than the most commonly used current drug, but as you develop a hypothesis, you hypothesize that the two drugs are exactly the same in effectiveness. What you really want to do is prove yours is better but will start by assuming they are both equally good. H 0 : new drug = baseline drug H α : new drug > baseline drug Example #2) You are taking a survey and you believe that the crab cakes sold at “The Shack” are better than the crab cakes sold at “Seafood Supreme.” Your hypothesis should be that both crab cakes are equally desirable to customers, and you hope to use data to disprove the hypothesis. H 0 : The shack = Seafood Supreme H α : The shack > Seafood Supreme Example #3) You want to compare baseball players from 2 different eras, and you think one is better than the other. Your hypothesis is that they are equally skilled, and then you hope to use data to disprove that hypothesis. H 0 : Player A = Player B H α : Player A Player B Example #4) You want to collect some data to see if movie preference is dependent on gender. Your
hypothesis is that movie preference is the same for both genders, and then collect data to see if the hypothesis should be rejected or not. H 0 : movie preference is independent of gender H α : movie preference depends on gender Calculating a Test Statistic When one assumes that the null hypothesis is true (which is in every case of hypothesis testing), a test statistic will be calculated. Know the distribution of the test statistic, assuming that the null hypothesis is true. Then you can calculate the probability. Calculating the Probability You have already done this a bunch, but now it is formalized. If you know what a distribution looks like, you can calculate the probability of having value greater than or less than the test statistic value. The statistical term for this probability is the p value. Decision Rule If the p value is less than alpha, I should reject the null hypothesis in favor of the alternative hypothesis. However, if the p value is greater than alpha, I fail to reject the null hypothesis, which is also accepting the null. Type I and Type II Error When trying to make inferences on a population from a sample, there will always be an amount of error no matter what you do. Type I (α) – the probability of the time I will be wrong for any given statistical test - reject the null when you should have accepted it Type II (β) - beta - accept the null when you should have rejected it Reject Null Accept Null ART-BAF stands for: Null True Type I Correct A: Alpha Null False Correct Type II R: Reject ART-BAF T: True B: Beta
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
A: Accept F: False EXAMPLES COURTROOM ANALOGY The null hypothesis is not guilty because everyone is innocent until proven guilty. if the null is false, then the person will get arrested if I reject the null. If I chose to accept the null, that would be a type II error. if the null is true, then the person goes free if I accept the null. If I chose to reject the null, that would be a type I error. Type II Error Example – if the defendant is guilty, but the jury finds him “not guilty”, they have committed a Type II error. Type I Error Example – if the defendant did not commit the crime, but the jury finds him guilty anyway, the jury has committed a Type I Error. A Real-Life Example of Type II Error OJ Simpson murder trial – in short, the test doesn’t determine whether or not A=B, it determines if there is enough convincing evidence that A and B are different, or not just as trials don’t determine innocence or guilt but whether if the prosecution has enough convincing evidence to find the defendant guilty, or insufficient evidence, which will lead to a verdict of being not guilty.