439 sample midterm

pdf

School

Rutgers University *

*We aren’t endorsed by this school

Course

439

Subject

Mathematics

Date

Apr 3, 2024

Type

pdf

Pages

9

Uploaded by MasterEnergyCapybara27

Report
Question 1 - Multiple Choice - 16 points This question covers multiple topics. Each question is worth 2 points. 1. Suppose that site locations of 100 sites are given as latitude and longitude of each side. What is the best way to visualize this data? (a) histogram (b) scatter plot (c) bar plot (d) KDE plot 2. Your letter grade (e.g., A+, A, B+. . . ) in a class that grades on a curve is most accurately described as what kind of data? (a) ordinal (b) nominal (c) none of the above 3. The set that consists of all possible values of a random variable is called a (a) random range (b) sample space (c) specific range (d) none of the above 4. SVD and PCA applied to a matrix can be used to (a) factor matrices (b) find linearly independent columns (c) reduce dimensions (d) all of the above (e) none of the above 5. A data scientist must always consider potential sources of bias in a given dataset. (a) True (b) False (c) May be Page ii
6. Which data formats would be well suited for nested data? Select all that apply. (a) .csv (b) .xml (c) .ipynb (d) .json (e) .tsv 7. Which of the following are reasonable motivations for applying a log transformation? Select all that apply: (a) Perform dimensionality reduction on the data. (b) To help straighten relationships between pairs of variables. (c) Removing missing values. (d) Bring data distribution closer to random sampling. (e) To help visualize highly skewed distributions 8. The return type of the pandas.DataFrame.groupby function can either be a DataFrame or a Series object. (a) TRUE (b) FALSE Question 2 - EDA - 12 points 1. Suppose you are given a data set that contains the stock market performance from Jan 1, 1981 to January 1, 2019. The presidents Reagan (8 yrs), H.W. Bush (4 yrs), Clinton(8 yrs), W. Bush(8 yrs), Obama(8 yrs), Trump(2 yrs). The performances are given by the following chart. During EDA, what are some questions one can ask? We are looking for 3 brief, but good questions/observations. Page iii
(a) Question 1: (b) Question 2: (c) Question 3: 2. During the data cleaning process, is it always a good idea to remove records that contain missing values? Briefly Justify your answer. 3. TRUE or FALSE. Exploratory data analysis is the process of testing key hypotheses. 4. TRUE or FALSE. The structure of the data describes how it is formatted and organized. 5. TRUE or FALSE. Throughout the process of exploratory data analysis it is often necessary to transform and clean data. Question 3 - Data Visualization - 12 points 1. What is the best data type description for home prices in New York city? Circle the answer and briefly justify. (a) Nominal (b) Ordinal (c) Quantitative (d) Numerical 2. Justification: Page iv
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
3. Consider the following graph that shows registered male and female names and year that they were sampled. The graph seem to show some unlikely phenomenon. Assuming data were valid, briefly explain what might have caused this. 4. TRUE/FALSE The descriptive statistics of a data set such as mean and variance is a good metric to understand the distribution of the data. Justify your answer (briefly) 5. For each of the following cases, choose the ideal plot type from : 1D : Bar chart, Histogram, 2D: Scatter plot, line plot, box-whisker heatmap, 3D: scatter matrix, bubble chart (a) Plot 10,000 student grades consisting of letters A, B, C, D, F (b) compare chicken and beef prices from 50 states for 1-year of data (365 data points) (c) Compare the average, median, max and mean temperature in 3 different counties (d) Density of traffic in NY city during rush hour. 6. Consider the following heatmap showing height-weight distribution of Americans. State two important facts revealed by this chart. Please be brief. (a) Page v
(b) Question 4 - KDE - 15 points 1. What is the purpose of using kernel density estimators (KDE) in visualizaing data? Explain in 2-4 sentences 2. Consider the following histogram. Draw the best KDE that you think will represent this data. 3. The equation for Triweight Density function is given by K ( u ) = 15 / 16 * (1 - u 2 ) 2 where | u | < 1. Show that this function satisfies all 3 properties of a kernel density function. That is, (a) K ( u ) is non-negative for all u (b) K ( u ) is symmetric. That is k ( u ) = k ( - u ) (c) integraltext 1 1 k ( u ) du = 1 (hint. integraltext 1 1 1 du = 2 integraltext 1 1 u 2 du = 2 / 3 and integraltext 1 1 u 4 du = 2 / 5) Page vi
Question 5 - Probability Fundamentals - 15 points 1. Let X be the random variables that represents the outcome of a coin toss. Suppose a ”bias” coin has P ( X = H ) = 0 . 2. Write down the entries in the sample space for tossing a bias coin twice and their probabilities. P(H,H) = 2. Using the formulas E [ X ] = x X x * P ( X = x ) and var [ X ] = E [ X 2 ] - ( E [ X ]) 2 , Compute the expected value E ( X ) and var ( X ) 3. If a coin is a bias coin, is there a way to agree to a trial where you have a 50-50 chance of winning regardless of coin bias? Briefly explain your answer. 4. A certain couple tells you that they have two children, at least one of which is a girl. What is the probability that they have two girls? 5. There is a 30% chance that a user will click an advertisement on the page. It is known from past data that about 80% of the users who click on the ad buy the product. What percentage of people both clicked on the ad and bought the product? 6. suppose that the probability distribution of two random variables, Weather and Cavity is given by following chart. Answer the following questions. Show all work (a) What is P(weather = sunny) = Page vii
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
(b) What is P(weather = sunny | cavity = yes ) = WhatisP ( cavity = yes | weather = sunny ) = (c) (c) Are the two random variables weather and cavity independent ? Justify your answer. 7. It is suspected that children sleeping with night lamps develop nearsightedness. From a data sample, the following probabilities were observed. P(night lamp) = 0.4, P(nearsightedness) = 0.25, P(night lamp nearsightedness) = 0.2 Do you think that night lamps could be responsible for nearsightedness? Justify your answer. 8. It was also suspected that having a nearsighted parent may be responsible for nearsigtedness of a child. The following data were observed. of the 100 nearsighted parents sampled, 24% of their children grew up with a night lamp and had near sightedness. There is a 60% chance that a nearsightedness parent will have a nearsightedness child. It is also known that approximately, 40% of the nearsighted parents slept with a night lamp. Does this data validate or invalidate the fact that chidlren sleeping with night lamps might develop nearsightedness? Justify your answer. Page viii
Question 6 - Linear Algebra - 15 points 1. Given two vectors A = [3 , 4] and B = [6 , 8], find the cosine of the angle between A and B 2. Given two vectors A = [3 , 4] and B = [6 , 8], find the projection of A onto B 3. Suppose a data file contains 100,000 observations and 47 features each represented in a 100,000 x 47 matrix. What is the maximum possible number of linearly independent rows/observations in the matrix and why? 4. An eigenvalue x and eigenvector v of a matrix A is given by the equation : A v = x.v Explain the geometric interpretation of an eigenvalue and an eigenvector. 5. Consider the following matrix A. 1 0 0 1 1 1 The SVD of the matrix A yields the following results. where the matrices U , D , and V T are shown. Write a vector expression for rank-2 approximation of the original matrix A. Use only one decimal point in the answers. Do not simplify the answer. Page ix
Question 7 - Text processing - 15 points 1. Suppose three documents d 1 ,d 2 ,d 3 are represented by the vectors d 1 = [1 , 0 , 2 , 5 , 0] ,d 2 = [0 , 1 , 0 , 0 , 10] ,d 3 = [1 , 0 , 1 , 3 , 0] where vectors represents the word count for each word(5 different words in all documents). (a) Which of these two documents are ”similar”? Why? (b) Which of the two documents are significantly different from each other? Why? (c) Are these vectors linearly independent or independent? How would you interpret that in the context of documents? Show work for linear independence and explain. 2. Solve the following regex problems (a) write a regex for an identifier that must start with an upper-case alpha character and can be total number of upper or lower alphanumeric characters from 5 to 10 characters. (b) What would the following lines of code return? There are no spaces in any of the strings. re.findall(r” ˙ .*”, ”VIXX-Error.mp3.bak”) [note: findall(pattern, text) returns a list of matches] (c) What would this return? re.findall(r”[cat—dog]”, ”bobcat”) (d) what would this return? re.findall(r”a?p*[le]” , apple ”) Page x
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help