Skip to main content

Data & Intelligence

Beyond Encryption: Protect sensitive data using k-anonymity

Kid Playing Hide Seek Smiling Concept

Businesses and organizations now hold more personal information than ever before. Storing a lot of data may be useful in a variety of ways, such as reporting and analytics, which might expose PII that is linked to the data being analyzed. When data is being transmitted or stored, encryption is useful for protecting it, whereas anonymization is important for preserving data while it’s being utilized or released. Anonymization is better suited to smaller datasets while encryption works better with larger ones. For compliance purposes, a practical approach can be to combine encryption for data rest and in-transit and anonymization for data in use.

The goal of anonymizing data is not merely to obfuscate; it’s critical not to allow re-identify the data after anonymization. This implies considering metrics such as the amount of data, the kind of information in it, and the risk of identification while anonymizing. For most businesses, the goal will be to anonymize sensitive data such as PHI/PHI.

Data hiding

In 1997, MassGov compiled and released to the public a database of hospital visits by state employees for research purposes. All columns related to PII were removed, like name, phone number, address, social security number, etc. Generic identifies like gender, date of birth and zip code were retained for clustering and analysis purposes. The governor announced these measures sufficiently safeguarded the privacy of the state employees.

Unless, of course, a PhD student could come up with $20.

Latanya Sweeney purchased public voter records from Massachusetts, which included names, addresses, zip codes, and dates of birth. The recordset included the governor’s details. Because only one record in the hospital data satisfied all the governor’s gender, zip code, and date of birth criteria, it was simple to identify the governor’s prescriptions and visits. This is called a reidentification attack. Sweeney developed a formal definition of privacy called k-anonymity to prevent this type of attack.


The uniqueness of the recordset within the dataset was the key to the reidentification attack. Sweeney posited that if a dataset is k-anonymized, an attacker might be able to use another database to find out the demographic information of their target. But there will be many different people with the same information, so it will not be possible for them to know which one is their target.

Data Intelligence - The Future of Big Data
The Future of Big Data

With some guidance, you can craft a data platform that is right for your organization’s needs and gets the most return from your data capital.

Get the Guide

k-anonymity protects against re-identification by replacing values with impersonators in an anonymity set. An anonymized set is “a group of individuals who are linked to one another by a common attribute and with whom we wish to keep their identity hidden.” A dependent set utilizes the same anonymized value for more than one field in a record, whereas an independent set uses distinct anonymized values for each attribute.

Consider which columns might be used by the adversary you’re concerned about. Quasi-identifiers (or QIs) are data elements that, while not themselves sensitive, may be employed in a reidentification attempt. There is no single list of quasi-identifiers for all types of attacks. It is determined by the attack model. A dataset is said to be k-anonymous if every combination of values for demographic columns in the dataset appears at least for k different records.

Understanding what the k means in k-anonymity is critical to implementing an effective privacy protocol. The k-value is the minimum number of anonymized records that are sufficient to protect confidentiality when there is an adversary who can see all but one field per record, and tries to learn the remaining field. That is, for each record in the database, at least k-1 anonymized records exist such that an adversary cannot determine which of these k-1 records is the anonymized version of a single record.

The two most important components in turning a dataset into a k-anonymous table are generalization and suppression. The process of making a quasi-identifier value less precise, thus transforming (or generalizing) records with different values, is known as generalization. Consider an example in which you must convert a whole number into a numerical ranger. Suppression is a method to enhance generalization usefulness by removing outliers from the original data set and generating a new one.

There has been research that has questioned the effectiveness of k-anonymity, particularly for large dataset.


Yves-Alexandre de Montjoye et al. (2013) found that the reidentification risk of an individual from an anonymous database may be approximated using a function of their “relative” change in information content, which means the more elements of their information are revealed, the more likely they are to be “reidentified.”.

According to a 2014 research led by Mark Yatskar, k-anonymization may be easily broken down. Many of the individuals identified in the cell-phone data set created by Yatskar and his team have been reidentified.

In a 2015 study conducted by Vanessa Teague, credit card transactions from 1.1 million people in Australia were made public. For privacy reasons, the data was anonymized using a technique that removes the name, address, and account numbers of each person. If four additional details about an individual, such as the place where a purchase was made and the time it occurred, were known, researchers found that 90% of credit card users could be re-dentified.

The researchers were able to develop a new algorithm that did not have the same flaws as the initial one. The team unveiled a novel method for anonymizing called “l-diversity anonymization” in this study. They found that their technique “reduces transaction traceability by more than an order of magnitude” compared to other anonymization techniques. So, what is l-diversity? That’s for another blog.


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

David Callaghan, Solutions Architect

As a solutions architect with Perficient, I bring twenty years of development experience and I'm currently hands-on with Hadoop/Spark, blockchain and cloud, coding in Java, Scala and Go. I'm certified in and work extensively with Hadoop, Cassandra, Spark, AWS, MongoDB and Pentaho. Most recently, I've been bringing integrated blockchain (particularly Hyperledger and Ethereum) and big data solutions to the cloud with an emphasis on integrating Modern Data produces such as HBase, Cassandra and Neo4J as the off-blockchain repository.

More from this Author

Follow Us