tml

pdf

School

University at Buffalo *

*We aren’t endorsed by this school

Course

6261

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

Uploaded by ElderOstrich640

CIS 6261: Trustworthy Machine Learning (Spring 2023) Homework 2 — Data Privacy & Differential Privacy Name: Your Name Here March 22, 2023 This is an individual assignment. Academic integrity violations (i.e., cheating, plagiarism) will be reported to SCCR! The official CISE policy recommended for such offenses is a course grade of E. Additional sanctions may be imposed by SCCR such as marks on your permanent educational transcripts, dismissal or expulsion. Reminder of the Honor Pledge: On all work submitted for credit by Students at the University of Florida, the following pledge is either required or implied: “On my honor, I have neither given nor received unauthorized aid in doing this assignment.” Instructions Please read the instructions and questions carefully. Write your answers directly in the space provided. Compile the tex document and hand in the resulting PDF. For this assignment, you will solve several data privacy problems. The third problem asks you to implement a differential privacy mechanism using Python3. Use the code skeleton provided and submit the completed source file(s) alongside with the PDF. Note: bonus points you get on this assignment *do* carry across assignments. 1

Problem 1: Syntactic Metrics (20 pts) Consider the data set depicted in Table 1 . Answer the following questions. (Justify your answers as appropriate.) Age Zip Code Sex Diagnosis 40-49 32605 M COVID-19 30-39 32607 M Broken Leg 30-39 32607 M Cancer 40-49 32611 F Heart Disease 20-29 32607 F Asthma 20-29 32607 F Heart Disease 40-49 32611 M Hypertension 40-49 32611 F COVID-19 Table 1: Anonymized Data Set 1. 1. (5 pts) What are the quasi-identifier(s)? What are the sensitive attribute(s)? In this dataset, the quasi-identifiers are the combination of Age, Zip Code, and Sex, because When linked with external data, quasi-identifiers are a collection of attributes or variables in a dataset that can potentially identify people. These characteristics do not immediately reveal an individual’s identity, but when combined with additional data, they can be used to re-identify or de-anonymize someone. The sensitive attribute in this dataset is the diagnosis column because a sensitive attribute is an attribute or variable in a dataset that contains private or sensitive data regarding an individual. 2. (5 pts) What is the largest integer k such that the data set satisfies k -anonymity? What is the largest integer l such that the data set satisfies l -diversity? Your answer here. 3. (10 pts) Modify the data set using generalization and suppression to ensure that it satisfies 4- anonymity and 3-diversity. Here we are looking for a solution that minimally affects the utility of the data. Write the modified data set below. The intervals 40-49 have four rows, and all other remaining intervals have four rows as well, thus we have four anonymity and three diversity. Age Zip Code Sex Diagnosis 40-49 326** * COVID-19 30-39 326** * Broken Leg 30-39 326** * Cancer 40-49 326** * Heart Disease 20-29 326** * Asthma 20-29 326** * Heart Disease 40-49 326** * Hypertension 40-49 326** * COVID-19 2

Problem 2: Differential Privacy Mechanisms (40 pts) Social science researchers at the University of Florida want to conduct a study to explore the prevalence of crime among students. They have a list of crimes and for each they want to know if the proportion of students that committed a specific crime exceeds some threshold t ∈ (0 , 1). For example: is the proportion of students who have ever stolen a bike larger than t = 0 . 05? Researchers are ethical so they want to carefully design the study to ensure that participants respond truthfully and that privacy is protected. They reached out to you, a CIS 6261 student, to evaluate their methods. Suppose X is the database of students’ binary answers (yes (1) or no (0)) to the question of whether they have ever committed a specific crime and let t be a threshold t ∈ (0 , 1). Define g ( X ) to be number of yes (1) answers in the database X and let n = | X | denote the number of students in the database. Also define the following thresholding function parameterized by a threshold t : thresh t : R → { 0 , 1 } such that thresh t ( x ) = 1 if x ≥ t and thresh t ( x ) = 0 otherwise (i.e., if x < t ). Let f t be the query function. It is defined as follows. On input database X , let f t ( X ) = thresh t ( g ( X ) | X | ). In other words: f t ( X ) = 1 if the proportion of yes (1) in X is equal or larger than t (i.e., if g ( X ) n ≥ t ) and f t ( X ) = 0 otherwise ((i.e., if g ( X ) n < t )). Answer the following questions. 1. (5 pts) What is ∆ g , the global sensitivity of g ? Give the definition of global sensitivity and then give the answer. Definition: The maximum absolute difference in the output of g for any two neighboring databases, where neighboring databases differ by at most one individual’s data, is defined as the global sensitivity of g. Let X and X’ to be two neighboring databases that differ by no more than one individual’s data. The global sensitivity of g is then: ∆ g = max ∥ g ( X ) − g ( X ′ ) ∥ , where ∥ denotes the absolute value. In this situation, g(X) simply represents the number of students in X who answered ”yes” to a certain crime, therefore g(X) is a count of ”yes” answers. Because each person’s answer can only vary by one (from 0 to 1 or from 1 to 0), the maximum absolute difference between g(X) and g(X’) is one. 2. (5 pts) Now let’s turn to f t the actual query function. What is ∆ f t , the global sensitivity of f t ? ∆ f t = max ∥ f t ( X ) − f t ( X ′ ) ∥ = max ∥ thresh t ( g ( X ) / | X | ) − thresh t ( g ( X ′ ) / | X ′ | ) ∥ . (Where the formula gives the absolute value). The threshold function affects g(X). In this case, g(X) is the number of students whose value of thresh t ( x ) = 1 3. (5 pts) Does ∆ f t depend on the thresold t ? (Why or why not?) Yes,the global sensitivity ft is affected by the threshold t. This is due to the thresholding function thresh being depending on the threshold t, and hence the value of f t ( X ) being dependent on t. In other words, as t varies, the thresh- olding function will transfer distinct g(X)/ ∥ X ∥ values to 0 or 1, resulting in different values of f t ( X ) . 3

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

4. (5 pts) Give an example of a database X (with a least 5 entries) such that the local sensitivity of f t given threshold t = 0 . 3 is 0. First give the definition of local sensitivity then answer the question. Definition: Local sensitivity evaluates how much a function’s output varies when a single record in the input database is added, removed, or updated. X = Suppose we have a database X consisting of 5 binary values indicating whether a student has committed a certain crime or not: x= { 1,0,1,1,1 } then g(x) is 4 and x is 5 then g ( x ) / | x | is 0.8 greater than threshold limit(0.3) and f t is 1 .If we modify the 4th values as to 0 then X1 is { 1,0,1,0,1 } then g(x1) is 3 and | X 1 | is 5 then g ( x 1) / | X | is 0.6 it is greater than threshold limit f t is 1 .now calculating the local sensitivity is difference between f t ( x ) and f t ( x 1) i.e | 1 − 1 | =0 Now let’s consider the following mechanisms. For each of them, you are asked to prove or disprove whether they satisfy differential privacy. (To prove differential privacy you need to show that the mech- anism satisfies the definition. To disprove you can simply give a counterexample.) Recall the definition of (pure) ε -differential privacy. Definition 1. A randomized algorithm F satisfies ε - differential privacy (for ε > 0 ) if for any two neighboring databases X, X ′ and any S ⊆ Range( F ) : Pr {F ( X ) ∈ S } ≤ e ε Pr {F ( X ′ ) ∈ S } . 5. (5 pts) Mechanism A: F ( X ) = f t ( X ). That is the mechanism simply returns the output of f t . This mechanism does not satisfy ε -differential privacy. Consider two neighboring databases, X and X’, which differ only in one element. Assume X has a 0 in that entry and X’ has a 1.Let t represent the f t thresholdvalue. letsbesetofpossibleoutcome = 0,1 andeventbetheoutputoff t is 1 . p = pr F ( X ) ∈ S = Pr { f t ( X ) } = 1 p 1 = pr F ( X 1 ) ∈ S = Pr { f t ( X 1 ) } = 1 Now, since X and X’ differ in only one entry, the output of ft on these databases can differ by at most 1/n, where n is the size of the database. This is because if X and X’ have n entries and differ in only one entry, then the maximum difference in the number of 1s between X and X’ is 1. Now, suppose t = 0.5, and let = ln (2) . Then we have: e = e l n (2) = 2 Let X be the database where all entries are 0, except for one entry which is 1. Let X’ be the database where all entries are 0. Then we have: p=1/n p 1 = 0 Pr { p = 1 /n } ≤ Pr { p 1 = 0 } 6. (5 pts) Mechanism B: F ( X ) = thresh 0 . 5 ( f t ( X ) + z ), where z ∼ Lap(0 , b ) for b = ∆ f t ε . That is the mechanism computes the output of f t , adds Laplace noise to it, and then thresholds the value to 0.5 before returning it. Prove or disprove. Your answer here. 7. (5 pts) Mechanism C: F ( X ) = thresh t ( g ( X )+ z | X | ), where z ∼ Lap(0 , b ) for b = ∆ g ε . That is the mechanism computes the output of g , adds Laplace noise to it (calibrated to g), and then thresholds the value to t before returning it. Prove or disprove. Your answer here. 4

8. (5 pts) Mechanism D: F ( X ) = flip p ( f t ( X )), where flip p : { 0 , 1 } → { 0 , 1 } is a function to randomly flip the bit based on a biased coin flip p ∈ [0 , 1] (probability of heads is p ). Let flip p ( x ) = 1 − x with probability p (the coin comes up heads) and flip p ( x ) = x with probability 1 − p (the coin comes up tails). In other words, this mechanism computes the output of f on the database and then randomly flips it with probability p . For what value of p (if any) does mechanism D satisfy ε -differential privacy? Prove or disprove. State the relationship (if any) between p and ε . Your answer here. 5

Problem 3: Implementing DP Mechanisms (40 pts) For this problem you will implement several differential privacy mechanisms we talked about in class. Please use the comments in the Python files provided to guide you in the implementation. For this question, we will use the dataset data/ds.csv . It contains GPA and three favorite CISE courses for several students. For the purpose of calculating sensitivity, assume that the GPA range for any student is [1.0, 4.0]. 0. (5 pts) What is the (global) sensitivity of mean gpa query() ? (You can assume that the size of the dataset is known.) The mean GPA query’s global sensitivity is determined by how the mean GPA is cal- culated and the amount of the dataset. If the mean GPA is calculated by averaging all GPAs in the dataset, the mean GPA query’s global sensitivity is 3/n, where n is the size of the dataset. This is due to the fact that modifying a single individual’s GPA can only affect the mean GPA by 3/n, which is the individual’s contribution to the overall average. 1. (5 pts) Fill in the implementation of laplace mech() , gaussian mech() . Also fill in the (global) sensitivity in the mean gpa query() function. Note that the function also returns its privacy budget. You can test your implementation by running: ’ python3 hw1.py problem3.1 ’. (The provided code allows you to optionally specify the privacy budget epsilon (default ε = 1 . 0) as the last command line argument.) How close are the noisy answers to the true answer? Both the Laplace noisy (3.22) and Gaussian noisy (2.77) responses are reasonably close to the true mean GPA (3.24). In contrast, the Gaussian approach appears to have added more noise into the result than the Laplace process. The Gaussian mechanism may be more sensitive due to the nature of its noise distribution. While the privacy budget () for both strategies is the same, the change in noise level is most likely due to mechanism selection and inherent features. 1b. ([Bonus] 5 pts) Implement truncate gpa() to make any noisy GPA value fall within the valid range (i.e., between 1.0 and 4.0). Does using truncate gpa() after the Gaussian or Laplace mechanism as done in the main () function preserve differential privacy? Why or why not? Your answer here. Now we want to use the Exponential mechanism to find the most popular CISE course. Fill in the implementation of exp mech() . The function takes a quality function qual score fn() and also its sensitivity delta qual score . Similarly to other mechanisms, the function also returns its privacy budget. (Hint: take a look at the data file before answering the following questions.) 2. (15 pts) Propose a quality score function q ( X, r ) to compute the most popular CISE course. What is the global sensitivity? Your answer here. Now implement your quality function in qual score most popular course() . Now test your im- plementation by running: ’ python3 hw1.py problem3.2 ’. Paste the plots it produced here. How close are the noisy answers to the true answer? Does it depend on the privacy budget epsilon ( ε )? Explain what you observe. 6

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Your answer here. 3. (5 pts) Now calculate the privacy budget of computing the mean GPA query using the Gaussian mechanism and then invoking the Exponential mechanism 10 times to get (10 random samples of) the most popular CISE course. Give an expression for both naive/sequential composition and advanced composition. The expression should be in terms of ε and δ assuming each query satisfies ( ε 0 , δ 0 )-DP. Naive composition: Your answer here. Advanced composition: Your answer here. 4. (5 pts) Given you answer to the previous question, which composition theorem would you apply in which scenario? (Justify your answer.) Your answer here. 5. (5 pts) Now go back to the code in main () for problem 3.2 and locate the hardcoded course list that the Exponential mechanism uses. Why is the list of courses not computed directly from the dataset? (Justify your answer.) Your answer here. 7

tml

Related Documents