Skip to main content

Data & Intelligence

Beyond Encryption: Implementing anonymity algorithms

African Computer Programmer Writing Code

Businesses and organizations now hold more personal information than ever before. Storing a lot of data may be useful in a variety of ways, such as reporting and analytics, which might expose PII that is linked to the data being analyzed. When data is being transmitted or stored, encryption is useful for protecting it, whereas anonymization is important for preserving data while it’s being utilized or released. Anonymization is better suited to smaller datasets while encryption works better with larger ones. For compliance purposes, a practical approach can be to combine encryption for data rest and in-transit and anonymization for data in use.

The goal of anonymizing data is not merely to obfuscate; it’s critical not to allow re-identify the data after anonymization. This implies considering metrics such as the amount of data, the kind of information in it, and the risk of identification while anonymizing. For most businesses, the goal will be to anonymize sensitive data such as PHI/PHI. In an earlier post, I spoke about using k-anonymity for protecting sensitive data. However, there had been research that has questioned the effectiveness of k-anonymity, particularly for large datasets. I followed up with another post about l-diversity, a technique that addresses some of the concerns around k-anonymity. Finally, I reviewed t-closeness, which in turn dealt with some concerns around l-diversity.

To be effective, these algorithms need to be implemented. Most of the research has been about privacy concerns around specific datasets to be released to the public. We are concerned about reducing the potential threat of non-malicious insider activity by protecting internal datasets.

Practical Enterprise Considerations

Two major challenges have been addressed in this blog series. The first is how to identify the most practical and effective algorithms to achieve “practical privacy”. Ensuring zero percent chance of re-identification risk is not the industry standard and is not required by current privacy protection laws. However, researchers have shown there is a very small number of additional data points (as low as four) needed to identify an individual from anonymized data sets. The second, and potentially more difficult, challenge is how to implement anonymization controls across the polyglot data sources used by most corporations. Many banks have a customer’s account information on a mainframe while capturing data exhaust from their cell phone in the cloud. Many reporting and analytics use cases require aggregations from these multiple data sources.

Insider risk

The risk of outside malicious actors accessing and misusing sensitive data is typically handled through network security protocols, access controls, data encryption, etc. Managing insider risk is a more common use case for implementing internal data privacy practices. An insider is a known user with legitimate access to data. Insider threat refers to an insider acting with malicious intent. Insider risk is concerned with the likelihood and potential business impact of a threat incident, regardless of intent.

A study by Aberdeen found that out of 3,255 confirmed data breaches over two years, 33% involved insiders and 78% of those insider incidents involved unintentional data loss or exposure. Considering about 2% of data movement events result in data exposure, implementing data privacy controls as a component in a security posture can significantly reduce the potential threat of non-malicious insider activity.

Data Intelligence - The Future of Big Data
The Future of Big Data

With some guidance, you can craft a data platform that is right for your organization’s needs and gets the most return from your data capital.

Get the Guide

The question now becomes achieving a practical balance between effectiveness and utility. This question does not have a clear-cut answer as it very dependent upon several internal factors related to your business needs and data metrics.

Practical Enterprise Implementation

Assume the goal is to secure PII from authorized users in such a way that allows them to continue their work. Ideally, there should be a sense of confidence that if an employee is working on a confidential dataset on an unsecured wi-fi hotspot in a coffee shop, there is no risk of a secure data leakage incident. They have been told not to, but …

The first step is to take a data inventory and identify where sensitive data exists. This can be mainframes, databases, flat files, Excel, etc. It is possible to build crawlers in python that can find sensitive data, including custom-formatted sensitive data such as customer identification codes.

Once you have identified the source data on the servers, you have a decision to make. Do you modify the source tables to replace them with anonymized data or do you implement an abstraction layer and revisit ACL controls to prohibit users from directly accessing tables that have sensitive data?

The answer may depend on the source data and your data movement services. For example, you may have legacy mainframe databases that contain the audited golden record of all your critical data. A possible option may be to have the privacy algorithms built into the data movement pipelines that populate the more user-friendly relational database systems that drive OLTP and OLAP functions.

There are commercial applications that you can leverage to implement privacy protocols, or you can roll your own using an open-source tool like ARX.


Data privacy is an important consideration for any business, and it can be difficult to balance the need to protect sensitive data with other needs like ease of use. It’s imperative that you take time to do a thorough inventory of all your company’s sources of customer information in order to identify what should be anonymized or encrypted. Once you know this, it will make choosing among k-anonymity, t-closeness and l-diversity much easier.

This blog series hopefully provided some insight into when one method may work better than the other depending on the nature of your organization’s data flow and legacy systems. For example, most organizations will want to start with k-anonymity where k has a value of ~10. That seems to satisfy most business use cases. You may want to revisit this practice after all the issues have been worked out and see if there is a need to include t-closeness.

If you’re ready to move to the next level of your data-driven enterprise journey by exploring how privacy can be used to enhance your data security profile, contact with Perficient’s Data Solutions.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

David Callaghan, Solutions Architect

As a solutions architect with Perficient, I bring twenty years of development experience and I'm currently hands-on with Hadoop/Spark, blockchain and cloud, coding in Java, Scala and Go. I'm certified in and work extensively with Hadoop, Cassandra, Spark, AWS, MongoDB and Pentaho. Most recently, I've been bringing integrated blockchain (particularly Hyperledger and Ethereum) and big data solutions to the cloud with an emphasis on integrating Modern Data produces such as HBase, Cassandra and Neo4J as the off-blockchain repository.

More from this Author

Follow Us