TML_Assignment_2

pdf

School

University at Buffalo *

*We aren’t endorsed by this school

Course

6261

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

6

Uploaded by ElderOstrich640

Report
CIS 6261: Trustworthy Machine Learning (Spring 2023) Homework 2 — Data Privacy & Differential Privacy Name: Ankireddypalli Sravani March 22, 2023 This is an individual assignment. Academic integrity violations (i.e., cheating, plagiarism) will be reported to SCCR! The official CISE policy recommended for such offenses is a course grade of E. Additional sanctions may be imposed by SCCR such as marks on your permanent educational transcripts, dismissal or expulsion. Reminder of the Honor Pledge: On all work submitted for credit by Students at the University of Florida, the following pledge is either required or implied: “On my honor, I have neither given nor received unauthorized aid in doing this assignment.” Instructions Please read the instructions and questions carefully. Write your answers directly in the space provided. Compile the tex document and hand in the resulting PDF. For this assignment, you will solve several data privacy problems. The third problem asks you to implement a differential privacy mechanism using Python3. Use the code skeleton provided and submit the completed source file(s) alongside with the PDF. Note: bonus points you get on this assignment *do* carry across assignments. 1
Problem 1: Syntactic Metrics (20 pts) Consider the data set depicted in Table 2 . Answer the following questions. (Justify your answers as appropriate.) Age Zip Code Sex Diagnosis 40-49 32605 M COVID-19 30-39 32607 M Broken Leg 30-39 32607 M Cancer 40-49 32611 F Heart Disease 20-29 32607 F Asthma 20-29 32607 F Heart Disease 40-49 32611 M Hypertension 40-49 32611 F COVID-19 Table 1: Anonymized Data Set 1. 1. (5 pts) What are the quasi-identifier(s)? What are the sensitive attribute(s)? Quasi-identifiers are a collection of attributes that can be combined with external dataset to uniquely identify an entry in dataset. In the dataset given above Age,Zip Code,Sex are Quasi-identifiers. Sensitive attributes are those that we want to safeguard or keep secret in a dataset. These could include things like personal preferences, financial information, or medical information. In the dataset given above Diagnosis is a Sensitive attribute. 2. (5 pts) What is the largest integer k such that the data set satisfies k -anonymity? What is the largest integer l such that the data set satisfies l -diversity? K - 2 l - 2 3. (10 pts) Modify the data set using generalization and suppression to ensure that it satisfies 4- anonymity and 3-diversity. Here we are looking for a solution that minimally affects the utility of the data. Write the modified data set below. Age Zip Code Sex Diagnosis 40-49 326xx * COVID-19 30-39 326xx * Broken Leg 30-39 326xx * Cancer 40-49 326xx * Heart Disease 20-29 326xx * Asthma 20-29 326xx * Heart Disease 40-49 326xx * Hypertension 40-49 326xx * COVID-19 Table 2: Modified Data Set. 2
Problem 2: Differential Privacy Mechanisms (40 pts) Social science researchers at the University of Florida want to conduct a study to explore the prevalence of crime among students. They have a list of crimes and for each they want to know if the proportion of students that committed a specific crime exceeds some threshold t (0 , 1). For example: is the proportion of students who have ever stolen a bike larger than t = 0 . 05? Researchers are ethical so they want to carefully design the study to ensure that participants respond truthfully and that privacy is protected. They reached out to you, a CIS 6261 student, to evaluate their methods. Suppose X is the database of students’ binary answers (yes (1) or no (0)) to the question of whether they have ever committed a specific crime and let t be a threshold t (0 , 1). Define g ( X ) to be number of yes (1) answers in the database X and let n = | X | denote the number of students in the database. Also define the following thresholding function parameterized by a threshold t : thresh t : R → { 0 , 1 } such that thresh t ( x ) = 1 if x t and thresh t ( x ) = 0 otherwise (i.e., if x < t ). Let f t be the query function. It is defined as follows. On input database X , let f t ( X ) = thresh t ( g ( X ) | X | ). In other words: f t ( X ) = 1 if the proportion of yes (1) in X is equal or larger than t (i.e., if g ( X ) n t ) and f t ( X ) = 0 otherwise ((i.e., if g ( X ) n < t )). Answer the following questions. 1. (5 pts) What is ∆ g , the global sensitivity of g ? Give the definition of global sensitivity and then give the answer. Definition: A single data point added or withdrawn from a dataset can influence the output of a function by a maximum amount known as Global sensitivity. let f be a function that takes as input a dataset D, and let D’ be a dataset that differs from D by at most one data point. The global sensitivity of f is defined as: g = max { D, D’ } abs(f(D) - f(D’)) where abs denotes the absolute value or magnitude of the difference between two quan- tities. 2. (5 pts) Now let’s turn to f t the actual query function. What is ∆ f t , the global sensitivity of f t ? f t = Your answer here. 3. (5 pts) Does ∆ f t depend on the thresold t ? (Why or why not?) Your answer here. 4. (5 pts) Give an example of a database X (with a least 5 entries) such that the local sensitivity of f t given threshold t = 0 . 3 is 0. First give the definition of local sensitivity then answer the question. Definition: Your answer here. X = Your answer here. Now let’s consider the following mechanisms. For each of them, you are asked to prove or disprove whether they satisfy differential privacy. (To prove differential privacy you need to show that the mech- anism satisfies the definition. To disprove you can simply give a counterexample.) Recall the definition of (pure) ε -differential privacy. 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Definition 1. A randomized algorithm F satisfies ε - differential privacy (for ε > 0 ) if for any two neighboring databases X, X and any S Range( F ) : Pr {F ( X ) S } ≤ e ε Pr {F ( X ) S } . 5. (5 pts) Mechanism A: F ( X ) = f t ( X ). That is the mechanism simply returns the output of f t . Prove or disprove. Your answer here. 6. (5 pts) Mechanism B: F ( X ) = thresh 0 . 5 ( f t ( X ) + z ), where z Lap(0 , b ) for b = f t ε . That is the mechanism computes the output of f t , adds Laplace noise to it, and then thresholds the value to 0.5 before returning it. Prove or disprove. Your answer here. 7. (5 pts) Mechanism C: F ( X ) = thresh t ( g ( X )+ z | X | ), where z Lap(0 , b ) for b = g ε . That is the mechanism computes the output of g , adds Laplace noise to it (calibrated to g), and then thresholds the value to t before returning it. Prove or disprove. Your answer here. 8. (5 pts) Mechanism D: F ( X ) = flip p ( f t ( X )), where flip p : { 0 , 1 } → { 0 , 1 } is a function to randomly flip the bit based on a biased coin flip p [0 , 1] (probability of heads is p ). Let flip p ( x ) = 1 x with probability p (the coin comes up heads) and flip p ( x ) = x with probability 1 p (the coin comes up tails). In other words, this mechanism computes the output of f on the database and then randomly flips it with probability p . For what value of p (if any) does mechanism D satisfy ε -differential privacy? Prove or disprove. State the relationship (if any) between p and ε . Your answer here. 4
Problem 3: Implementing DP Mechanisms (40 pts) For this problem you will implement several differential privacy mechanisms we talked about in class. Please use the comments in the Python files provided to guide you in the implementation. For this question, we will use the dataset data/ds.csv . It contains GPA and three favorite CISE courses for several students. For the purpose of calculating sensitivity, assume that the GPA range for any student is [1.0, 4.0]. 0. (5 pts) What is the (global) sensitivity of mean gpa query() ? (You can assume that the size of the dataset is known.) Your answer here. 1. (5 pts) Fill in the implementation of laplace mech() , gaussian mech() . Also fill in the (global) sensitivity in the mean gpa query() function. Note that the function also returns its privacy budget. You can test your implementation by running: ’ python3 hw1.py problem3.1 ’. (The provided code allows you to optionally specify the privacy budget epsilon (default ε = 1 . 0) as the last command line argument.) How close are the noisy answers to the true answer? Your answer here. 1b. ([Bonus] 5 pts) Implement truncate gpa() to make any noisy GPA value fall within the valid range (i.e., between 1.0 and 4.0). Does using truncate gpa() after the Gaussian or Laplace mechanism as done in the main () function preserve differential privacy? Why or why not? Your answer here. Now we want to use the Exponential mechanism to find the most popular CISE course. Fill in the implementation of exp mech() . The function takes a quality function qual score fn() and also its sensitivity delta qual score . Similarly to other mechanisms, the function also returns its privacy budget. (Hint: take a look at the data file before answering the following questions.) 2. (15 pts) Propose a quality score function q ( X, r ) to compute the most popular CISE course. What is the global sensitivity? Your answer here. Now implement your quality function in qual score most popular course() . Now test your im- plementation by running: ’ python3 hw1.py problem3.2 ’. Paste the plots it produced here. How close are the noisy answers to the true answer? Does it depend on the privacy budget epsilon ( ε )? Explain what you observe. Your answer here. 3. (5 pts) Now calculate the privacy budget of computing the mean GPA query using the Gaussian mechanism and then invoking the Exponential mechanism 10 times to get (10 random samples of) the most popular CISE course. Give an expression for both naive/sequential composition and advanced composition. The expression should be in terms of ε and δ assuming each query satisfies ( ε 0 , δ 0 )-DP. Naive composition: Your answer here. 5
Advanced composition: Your answer here. 4. (5 pts) Given you answer to the previous question, which composition theorem would you apply in which scenario? (Justify your answer.) Your answer here. 5. (5 pts) Now go back to the code in main () for problem 3.2 and locate the hardcoded course list that the Exponential mechanism uses. Why is the list of courses not computed directly from the dataset? (Justify your answer.) Your answer here. 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help