Data & Intelligence

Protect PII with anonymized datasets for Data Scientists with differential privacy

The Secret To Software Tool Integration

Businesses and organizations now hold more personal information than ever before. Storing large amounts of structured and unstructured data may be useful in a variety of ways, such as reporting and analytics, but it might expose PII that is linked to the data being analyzed.As organizations are increasingly under pressure to comply with data privacy laws and regulations, it is important that personal information about customers or employees not be compromised. Data scientists typically perform analytics on a representative sample from a big data store; generally ten percent. Differential privacy has been shown to be an effective data anonymization algorithm for this type of task. ARX is a comprehensive open source software for anonymizing sensitive personal data and can be leveraged to support this type of analysis.

Differential Privacy

Consider the problem of using data to learn about a population without learning about specific individuals within the population. De-identification using k-anonymity is the most widely used algorithm but it is subject to some fairly straightforward re-identification attacks under some conditions. Other algorithms such a l-diversity and t-closeness involve a loss of data and are still subject to attacks (such as skewness) or are difficult to implement. Differential privacy is not an algorithm, but rather a mathematical definition of what it means to have privacy. Meeting this requirement ensures that the output of a differentially private analysis will be roughly the same regardless of the specific input. Differential privacy is a model that guarantees the privacy of data by ensuring that the probability of any possible output of the anonymization process does not change “by much” if data of an individual is added to or removed from input data.  What this means is that the output of the process must be the same whether or not it uses your specific data. This makes it very difficult for attackers to derive information about specific individuals.

Differential privacy has several important advantages over previous privacy techniques:

  • all information is assumed to be identifying information
  • it is resistant to privacy attacks based on additional information, so it can effectively defend against linking assaults that may be used with de-identified data.
  • It is compositional in that we may assess the privacy loss of running two differentially private analyses on the same data by adding up the separate privacy losses for the two studies.

There is also one key difference: differential privacy is a data processing method rather than a property of a dataset.

Differential privacy is becoming more and more important as we move towards a world where data is constantly being collected and analyzed. For a concrete implementation, you need an algorithm that satisfies the differential privacy definition and a mechanism to implement that algorithm on your data.


Data Intelligence - The Future of Big Data
The Future of Big Data

With some guidance, you can craft a data platform that is right for your organization’s needs and gets the most return from your data capital.

Get the Guide

SafePub is a differential privacy algorithm that provides for truthful data anonymization with strong privacy guarantees.

The algorithm does not alter original input data or fabricate artificial output data, because the changes are real. Instead, records are chosen at random from the input group, and their features are simplified. This achieves truthfulness, which is not available in a number of other privacy algorithms that perturb the data.

The SafePub algorithm takes a dataset as input as well as parameters for anonymization, search, the privacy parameter ε, and steps and outputs another smaller, anonymized dataset. SafePub will loop through a random sample of the initial dataset and initialize a set of transformation. For each set of transformations, SafePub will anonymize the data using that transformation and then run a set of data quality checks. These checks include granularity and intensity, discernibility, non-uniform entropy, statistical classification and group size.


ARX (Anonymization and Reweighting eXtended) is a comprehensive open source software for anonymizing sensitive personal data. ARX is designed to be applied in different domains, such as data integration and information extraction.

ARX focuses on two major concepts of data privacy: anonymity and reweighting. Anonymity implies that the probability of attributing any person’s attribute to another should not alter if a new record is added or deleted from input data. Reweighting is a technique for adjusting the probability of assigning the same attribute to different persons in reaction to changes in input data.

ARX implements differential privacy through computing an approximation of the Laplace distribution, at several different scales. The final output is an anonymized dataset which has similar statistical properties to the original dataset. This means that it is very difficult for attackers to infer information about individuals in the input data by looking at the anonymized output.


When it comes to anonymizing sensitive personal data, there are a few different options available. Implementing differential privacy on data extracts from big data sources on-premise or in the cloud with SafePub provides the privacy standards needed by businesses for compliance and truthfulness needed by data scientists for their model accuracy. For implementation, one of the most comprehensive and well-supported solutions is ARX. ARX is an open-source software that provides a variety of privacy protections, including differential privacy.

To learn more about how to enhance your privacy and data governance stance within your data science or BI teams, contact our Data Solutions leadership team at or

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

David Callaghan, Solutions Architect

As a solutions architect with Perficient, I bring twenty years of development experience and I'm currently hands-on with Hadoop/Spark, blockchain and cloud, coding in Java, Scala and Go. I'm certified in and work extensively with Hadoop, Cassandra, Spark, AWS, MongoDB and Pentaho. Most recently, I've been bringing integrated blockchain (particularly Hyperledger and Ethereum) and big data solutions to the cloud with an emphasis on integrating Modern Data produces such as HBase, Cassandra and Neo4J as the off-blockchain repository.

More from this Author

Follow Us