stat503_Spring_2024_hw1

docx

School

Purdue University *

*We aren’t endorsed by this school

Course

503

Subject

Statistics

Date

Apr 3, 2024

Type

docx

Pages

7

Uploaded by AmbassadorTiger4134

Report
STAT 503 – Statistical Methods for Biologists Modified: 2024-01-08 STAT 503 – Statistical Methods for Biology Homework 1 30 Points, Due at 11:59 PM on Friday, January 19, 2024 1. [Total: 3 points] In your own words, please explain the difference between each of the following terms. a. [1 point] A statistical population and a statistical sample. A statistical population is consisted of all of the individuals of the studied population, however statistical sample is only a portion of the population and is a subgroup of it, which should represent the population characteristics. b. [1 point] A statistical sample and a biological sample. A statistical sample is consisting of multiple units of the entire population; however, a biological sample usually refers to a single unit of the population. c. [1 point] A ordinal variable and a nominal variable. Ordinal variable is a term referring to categorical or qualitative variables which have a logical order, like age levels. On the other hand, nominal variables are categorical variables that do not represent an order, like different majors of study in the department of biological sciences. 2. [Total: 4 points, 0.5 point each blank] Fill in the Blank with the terminology listed below (each word can be used more than once). distribution, relative frequency, population, parameter, sample, statistic, variable 1) _____ Parameter _____ is a trait or characteristic that describes the members of a population (e.g., heights, weight, hair color, etc.). 2) A _____ relative frequency ____is a mathematical function that describes the _____ distribution ____(i.e., commonness or rarity) of different values of a variable in a population. 3) A parameter is a ____ population ____ quantity. It is used to describe some characteristics of a ____population____. Usually, we cannot measure a parameter. It is an unknown constant that we would like to estimate. 4) A ____ statistics ___ is any quantity computed from values in a sample. 1
STAT 503 – Statistical Methods for Biologists Modified: 2024-01-08 5) A ___ statistic ____ describes a sample. A ____ parameter ____ describes a population. 3. [Total: 8 points] For each of the following scenarios, please identify (i) the sample unit[0.5pts], (ii) the response variable[0.5pts], (iii) the specific variable type for the response (i.e., continuous, discrete, nominal, or ordinal) [0.5pts], and (iv) the explanatory variable [0.5pts] (if there is no explanatory variable, write None). Complete sentences are not required. a. [2 points] A study used a particular strain of lab mice, C57BL/6, to examine the effects of nutrition on diabetes risk. A total of 36 male lab mice were maintained on one of three randomly assigned diets (high-sugar, high-fat, or control). After two months, the mice were classified into one of four levels of insulin resistance: normal, mild, moderate, or severe. Sample unit: Male lab mice Response variable: Insulin resistance Specific variable type for the response: Ordinal Explanatory variable: Diet type b. [2 points] Koella et al. (1998, Proceedings of the Royal Society of London: Series B Biological Sciences 265:763-768) investigated the hypothesis that mosquitos infected by Plasmodium (the cause of malaria) bite more people than uninfected mosquitoes do. They captured 262 mosquitoes that had human blood in their guts. For each mosquito, they determined whether the mosquito was infected with malaria, and used a genetic technique to determine whether it had fed on only 1 or ¿ 1 person. Sample unit: Mosquitos with human blood in their gut Response variable: Number of persons a mosquito had Specific variable type for the response: Nominal Explanatory variable: Malaria Infection c. [2 points] Ten people volunteered for a study to investigate the relationship between sleep deprivation and cortisol levels (cortisol is a stress hormone). Half of the subjects were randomly assigned to sleep for 8 hours, and the other half slept for 4 hours. In each volunteer, blood samples were collected at 6:00 am on two successive days, before the subject ate breakfast, and the cortisol level on the first day was subtracted from the 2
STAT 503 – Statistical Methods for Biologists Modified: 2024-01-08 level on the second day. The change in cortisol levels over the 24-hour period was compared between the two groups. Sample unit: Person Response variable: Change in cortisol level after 24 hours Specific variable type for the response: Continuous Explanatory variable: Sleep time (hours) d. [2 points] Blind forms of the cavefish Astyanax mexicanus spend much less time sleeping than their non-blind, surface-living conspecifics. Duboué et al. (2011, Current Biology 21:671-676) measured the number of minutes that each of 23 blind cavefish spent sleeping, out of a 24-hour day. Sample unit: Blind cavefish Response variable: Sleeping time (minutes) Specific variable type for the response: Discrete Explanatory variable: none 4. [Total: 5 points] An important quantity in conservation biology is the number of plant and animal species inhabiting a given area. To survey the community of small mammals inhabiting Kruger National Park in South Africa, a large series of live traps were placed randomly throughout the park for one week in the main dry season of 2014. Traps were set each evening and checked the following morning. Individuals caught were identified, tagged (so that new captures could be distinguished from recaptures), and released. At the end of the survey, the total number of small mammal species in the park was estimated by the total number of species captured in the survey. a. [1 point] What is the parameter being estimated in the survey? The population size of small mammals in Kruger National Park. b. [ 2.5 points] Is the sample of individuals captured in the traps likely to be a random sample? Why or why not? In your answer, address the two criteria [1 point for each criterion] that define a sample as random. Given that the capturing has only happened during the night (from evening to next morning), it is more likely for nocturnal animals to get capture compared with diurnal animals. Therefore, the subjects do not have an equal chance for being captured. However, capturing one subject does not seem to have an effect on the other subjects, hence the selection seems to be independent. 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
STAT 503 – Statistical Methods for Biologists Modified: 2024-01-08 c. [1.5 points] Is the number of species in the sample likely to be an unbiased (accurate) estimate of the total number of small mammal species in the park? Why or why not? Since the sampling cannot be considered random and some particular groups can be over or underrepresented, the derived estimation is more likely to be biased. 5. [Total 2 points] Suppose that to investigate the effect of exercise on blood triglyceride levels, 60 healthy subjects who do not exercise are randomly selected from a population, and their triglyceride levels are measured in mmol/L. Half are then assigned to an exercise regime that includes 1 hour of aerobic exercise, repeated three times a week (assume they follow the protocol). After one month, all of the participants' triglyceride levels are measured a second time, and the change in triglyceride levels is recorded for each individual. How will the differences in methodology described below affect the (i) accuracy and (ii) precision of our estimates for the change in triglyceride? Remember that accuracy and precision refer to the results that we would see if we were to repeat the same study many times. a. [1 point] We reduce the number of sample units from 60 to 30 (i.e., from 30/group to 15/group). Reduced sample is less likely to cover all groups of the population or represent a wrong distribution of population and reduce precision and accuracy of the estimation, since in case of repeating the study a completely different sample group might be studied. b. [1 point] Instead of randomly sampling individuals from the general population, we recruit them from the student population at our local university. The population of the individuals in the university are generally younger than general population, hence having a better health condition, and also are more uniform. This type of sampling reduces the accuracy of the study but increases the precision because of the uniformity. 6. [Total: 2 points] In lecture, we discuss randomization in two different contexts: random selection of sample units from the population, and random assignment of treatments to the individual sample units within a study. Please explain how these two forms of randomization differ from each other. What goals do they serve? [1 point each] 4
STAT 503 – Statistical Methods for Biologists Modified: 2024-01-08 Random Selection: Randomly sampling individuals from the population in a way that the sample represents the overall characteristics of the population resulting in an unbiased estimation. Random Assignment to Treatment: This type of randomization is based on randomly assigning each unit to a treatment group with goal of reducing the variety of population effect on the study. 7. [Total: 6 points] This exercise will begin building your coding skills by asking you think about how you approach unfamiliar code and requiring you to debug problematic code. When you are working with a new script, do not try to run it all at once. Be systematic. Take the time to run it line-by-line and look at the output after each line. Try to understand what each part of each line does, and why it is written the way that it is. If there is an error, try to figure out what caused the error, and fix it. The code on the next page has 7 statements, labeled A-G. It also contains several errors. Please copy and paste the code as-is to a new script in RStudio. Then answer questions 7a-b. Possible resources include: 1) Run each line of code to see what happens. 2) Look at the help file for any unfamiliar functions. 3) See Tutorial 1 and the material in the Cheat sheets folder on the Brightspace. 4) Check Google (e.g., “R qplot”) 5) Discuss the problem with your classmates or TA (I recommend that you try each question first before seeking help from others). You will know that you have successfully debugged the code when you see this graph in the Plots pane in RStudio: 10 15 20 25 30 16 17 18 19 20 21 soil_temp_c n_seeds # seed germination # A n_seeds <- c(11, 10, 20, 11, 22, 16, 16, 25, 24, 27, 27, 30) # B 5
STAT 503 – Statistical Methods for Biologists Modified: 2024-01-08 soil_temp_c <- c(15.8, 15.5, 16.5, 15.4, 16.2, 16.3 # C 16.6, 17.2, 17.5, 17.4, 18.3, 21.0) qplot(n_seeds, bins=5) # D library(ggplot2) # E mean(n_seeds) # F qplot(soil_tmp_c, N_seeds) # G a. *[3 points] Briefly explain what each line of code does, after debugging. See the entry for line #B as an example. If you rearrange the lines during debugging, make sure to keep their original labels with them. [0.5 point per line] Line Interpretation #A A comment, suggesting a title for the code. #B creates a vector of 12 numeric values and stores it in memory with the name "n_seeds." #C Creating a vector of 12 numeric values and storing it in memory as a variable entitled "soil_temp_c" #D Using qplot function to illustrate a bar graph of the "n_seeds" vector values frequency. #E Importing the ggplot2 library #F Calculating the mean of values in "n_seeds" vector. #G Illustrating an scatter plot of "n_seeds" vector over the "soil_temp_c" b. [2 points] Describe each of the mistakes that you found in the code, and briefly explain how you fixed each one. This does not need to be complicated; something like, "In line #X, the object name thisBuggyCode should not have any capitals in it, so I changed it to thisbuggycode," is fine. 1- Missing comma in line #C between 16.3 and 16.6 2- Using a function from a library (qplot() from the ggplot2), before installing and importing it in line #D. In order to solve this problem, I installed the library using install(ggplot2). 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
STAT 503 – Statistical Methods for Biologists Modified: 2024-01-08 3- In rerunning line #D after installing the library, the argument bins = 5, gives us wide bands that overlaps with each other. Removing that parameter, and setting it as default, solves the problem. 4- In line #G, illustrating the plot, using qplot, requires to determine each argument of the function as bellow: > qplot( x= soil_temp_c, y = n_seeds) #G c. [1 point] Paste a copy of your final, debugged script here. You will be REQUIRED to include your R script with all future homework assignments that involve any work with R. > install.packages(ggplot2) > # seed germination > n_seeds <- c(11, 10, 20, 11, 22, 16, 16, 25, 24, 27, 27, 30) > soil_temp_c <- c(15.8, 15.5, 16.5, 15.4, 16.2, 16.3, 16.6, 17.2, 17.5, 17.4, 18.3, 21.0) > library(ggplot2) > qplot(n_seeds, bins = 30) > mean(n_seeds) > qplot(x= soil_temp_c, y = n_seeds) 7