Data Protection & Cybersecurity. Data Anonymization Exercises Exercise on statistical disclosure control You have found the following sanitized dataset that was released in 2020: Relationship single Gender Age City of birth * male 19-25 female 16-18 Málaga León Favorite TV Series Game of Thrones Game of Thrones in relationship male 12-15 Barcelona Friends! in relationship female 19-25 Valencia Big Bang Theory in relationship female 19-25 Madrid female 19-25 Málaga Big Bang Theory Game of Thrones single single male 16-18 León Game of Thrones single female 12-15 Barcelona Game of Thrones male 19-25 Valencia Big Bang Theory in relationship single Additionally you found a dataset with a TV show ratings survey. In the survey a value of 1 in the rating means "Definitely, I do not like it" and 5 "I am a great fan": Email Name Alice alice1995@email.com Bob bobbybob@email.com Charlie Eve Bob Alice Charlie Bob Charlie s9charchar@email.com evelyn@myhighscool.com bobbybob@email.com alice1995@email.com s9charchar@email.com bobbybob@email.com s9charchar@email.com TV Show Rating (1 is bad) Friends! 1 Friends! 4 Friends! 2 Friends! 1 Game of Thrones Game of Thrones 5 Game of Thrones 5 Big Bang Theory 3 Big Bang Theory 5 Alice Eve alice1995@email.com Big Bang Theory 2 evelyn@myhighscool.com Big Bang Theory 5 Assuming all candidates are present in both databases. 1. Where is Alice most likely born and what is most likely her relationship status? How do you infer such solution. Hint: There is enough data for a unique solution. 2. What can you learn about Charlie? 3. And what can you learn about Bob? 4. Which are the quasi-identifiers in the first dataset? 5. Does the first dataset satisfy k-anonymity? What is the value of k? 6. Suppose you are required to have a minimum of 2-anonymity, and you are allowed to add "invented" rows to achieve it. How many rows do you have to invent? 7. What approach do you propose to achieve 2-anonymity without "inventing" data?

icon
Related questions
Question
Data Protection & Cybersecurity. Data Anonymization Exercises
Exercise on statistical disclosure control
You have found the following sanitized dataset that was released in 2020:
Relationship
single
Gender
Age
City of birth
*
male
19-25
female
16-18
Málaga
León
Favorite TV Series
Game of Thrones
Game of Thrones
in relationship
male
12-15
Barcelona
Friends!
in relationship
female
19-25
Valencia
Big Bang Theory
in relationship
female
19-25
Madrid
Big Bang Theory
single
female
19-25
Málaga
Game of Thrones
single
male
16-18
León
Game of Thrones
single
female
12-15
Barcelona
male
19-25
Valencia
Game of Thrones
Big Bang Theory
in relationship
single
Additionally you found a dataset with a TV show ratings survey. In the survey a value of 1 in the
rating means "Definitely, I do not like it" and 5 "I am a great fan":
Name
Alice
Email
TV Show
Rating (1 is bad)
alice1995@email.com
Friends!
1
Bob
bobbybob@email.com
Friends!
4
Charlie
Eve
Bob
Alice
Charlie
Bob
Charlie
Alice
s9charchar@email.com
evelyn@myhighscool.com
bobbybob@email.com
alice1995@email.com
s9charchar@email.com
bobbybob@email.com
s9charchar@email.com
alice1995@email.com
Friends!
2
Friends!
1
Game of Thrones
1
Game of Thrones
5
Game of Thrones
5
Big Bang Theory
3
Big Bang Theory
5
Big Bang Theory
2
Eve
evelyn@myhighscool.com
Big Bang Theory
5
Assuming all candidates are present in both databases.
1. Where is Alice most likely born and what is most likely her relationship status? How do
you infer such solution. Hint: There is enough data for a unique solution.
2. What can you learn about Charlie?
3. And what can you learn about Bob?
4. Which are the quasi-identifiers in the first dataset?
5. Does the first dataset satisfy k-anonymity? What is the value of k?
6. Suppose you are required to have a minimum of 2-anonymity, and you are allowed to
add "invented" rows to achieve it. How many rows do you have to invent?
7. What approach do you propose to achieve 2-anonymity without "inventing" data?
Transcribed Image Text:Data Protection & Cybersecurity. Data Anonymization Exercises Exercise on statistical disclosure control You have found the following sanitized dataset that was released in 2020: Relationship single Gender Age City of birth * male 19-25 female 16-18 Málaga León Favorite TV Series Game of Thrones Game of Thrones in relationship male 12-15 Barcelona Friends! in relationship female 19-25 Valencia Big Bang Theory in relationship female 19-25 Madrid Big Bang Theory single female 19-25 Málaga Game of Thrones single male 16-18 León Game of Thrones single female 12-15 Barcelona male 19-25 Valencia Game of Thrones Big Bang Theory in relationship single Additionally you found a dataset with a TV show ratings survey. In the survey a value of 1 in the rating means "Definitely, I do not like it" and 5 "I am a great fan": Name Alice Email TV Show Rating (1 is bad) alice1995@email.com Friends! 1 Bob bobbybob@email.com Friends! 4 Charlie Eve Bob Alice Charlie Bob Charlie Alice s9charchar@email.com evelyn@myhighscool.com bobbybob@email.com alice1995@email.com s9charchar@email.com bobbybob@email.com s9charchar@email.com alice1995@email.com Friends! 2 Friends! 1 Game of Thrones 1 Game of Thrones 5 Game of Thrones 5 Big Bang Theory 3 Big Bang Theory 5 Big Bang Theory 2 Eve evelyn@myhighscool.com Big Bang Theory 5 Assuming all candidates are present in both databases. 1. Where is Alice most likely born and what is most likely her relationship status? How do you infer such solution. Hint: There is enough data for a unique solution. 2. What can you learn about Charlie? 3. And what can you learn about Bob? 4. Which are the quasi-identifiers in the first dataset? 5. Does the first dataset satisfy k-anonymity? What is the value of k? 6. Suppose you are required to have a minimum of 2-anonymity, and you are allowed to add "invented" rows to achieve it. How many rows do you have to invent? 7. What approach do you propose to achieve 2-anonymity without "inventing" data?
Expert Solution
trending now

Trending now

This is a popular solution!

steps

Step by step

Solved in 3 steps

Blurred answer