guide
Data Sharing: A Guide to Anonymization
Share this doc:
Storing and sharing data requires a thoughtful balance of privacy protection and business needs. Use this resource as a guide to consider your current practices and drive change in your organization.
Step 1: Take Stock Before You Share
Take stock of the data that you collect, store, and share. What types of data do you collect? How sensitive is that data? How precise is it?
Precision equates to identifiability. The more precise, or specific, data is... the more risk posed to the data owner if the data is compromised. There are two important relationships to keep in mind to protect data privacy.
Data precision should always have an INVERSE relationship with:
- Access
and
- Retention period
TL;DR: Precise data should be short-lived and less accessible. Aggregated data can be available to more people and stored for longer periods of time.
Step 2: Minimize and Coarsen Sensitive Data
Delete - or better yet never collect - sensitive data that you don't need.
Next, coarsen any remaining sensitive data that presents a privacy risk. Coarsening data means making it less precise using tactics like:
- Replacing personally identifiable data with internal uniquely generated values, or values generated by a keyed pseudorandom function
- Rounding values such as timestamps to be less specific
- Converting or truncating coordinates such as GPS coordinates to represent a broader area
Remember:
You will always have to balance the need for data privacy with business outcomes. That means minimizing and coarsening data enough that it is sufficiently anonymous, while still being able to carry out the operations and analysis needed for your business.
Step 3: Measure Impact
Obfuscating and anonymizing data is a critical step in protecting data privacy... but it's important to understand how successful your efforts were.
Measuring the impact of anonymization techniques is more of an art than a science. K-Anonymity and L-Diversity are two techniques that will drive your ability to show value and impact to stakeholders in your organization.
K-Anonymity:
Attributes are suppressed until each row is identical with at least K-1 other rows
Best Practice: Target a K-Anonymity of 5
A K-Anonymity of 5 means that you will have obfuscated the data such that for each record, there will be at least 4 others that are indistinguishable from it, making that record less individually identifiable
L-Diversity:
Where K-Anonymity hides an individual in the crowd by ensuring that any quasi-identifier appears in at least K records, L-Diversity measures the diversity of sensitive attributes in each data bucket
Check out these resources to learn more about K-Anonymity and L-Diversity.
For more on this topic, take the Course: