Skip to main content


From Data Swamp to Data Lake: Data Classification

Start Up Meeting

This is the fourth blog in a series that explains how organizations can prevent their Data Lake from becoming a Data Swamp, with insights and strategy from Perficient’s Senior Data Strategist and Solutions Architect, Dr. Chuck Brooks. Read the previous blog, here


Businesswomen Discussing In Meeting Against Window

In the first article in this series, I explained the five components necessary to prevent a Data Lake from Becoming a Data Swamp. In this blog, we discuss the fourth capability: Implementing classification-based security in the Data Lake.

Data classification tags data according to its type, sensitivity, and value to the organization if altered, stolen, or destroyed. It helps an organization understand the value of its data, determine whether the data is at risk, and implement controls to mitigate risks. Data classification also helps an organization comply with relevant industry-specific regulatory mandates such as SOX, HIPAA, PCI DSS, and GDPR.




Data Sensitivity level and Classification Models

Data is classified according to its sensitivity level—high, medium, or low.

  • High sensitivity data—if compromised or destroyed in an unauthorized transaction, would have a catastrophic impact on the organization or individuals. For example, financial records, intellectual property, and authentication data.
  • Medium sensitivity data—intended for internal use only, but if compromised or destroyed, would not have a catastrophic impact on the organization or individuals. For example, emails and documents with no confidential data.
  • Low sensitivity data—intended for public use. For example, public website content.


There are typically four data classification levels in information security:

Public: data that is in, or can be in, the public domain and can be openly shared with anyone outside of the organization. For example, a  datasheet about the company’s products and services.

Internal: company-wide data that is kept within the organization and, while not sensitive, should not be shared externally. For example a guide about how to get help from the IT helpdesk.

Confidential: domain-specific data that can be shared with specific people or teams and contains sensitive company information. For example, a price list for one of the company’s products.

Restricted: highly sensitive information that should only be available on a need-to-know basis. For example, employee agreements.

If a database table, database column, file, or other data resource includes data that can be classified at two different levels, it’s best to classify all the data at the higher level.


Types of Data Classification

Data classification can be performed based on content, context, or user selections:

  • Content-based classification—involves reviewing files, documents, and database columns and classifying them
  • Context-based classification—involves classifying files, documents, and database columns based on metadata like the application that created the file.
  • User-based classification—involves classifying files, documents, and database columns according to a manual judgment of a knowledgeable user. Individuals who work with documents can specify how sensitive they are—they can do so when they create the document, after a significant edit or review, or before the document is released.


Applying Your Data Classifications Policy

Once you have classified data you must classify people. Assigning people to classification allows for people to only see data at their assigned classification level and below.  It is this association that makes sure that data/knowledge workers will have access to all the data they need to identify trends and make smarter business decisions based on data.

Read the next blog in the series, here.


Perficient’s Cloud Data Expertise

The world’s leading brands choose to partner with us because we are large enough to scale major cloud projects, yet nimble enough to provide focused expertise in specific areas of your business. Our cloud, data, and analytics team can assist with your entire data and analytics lifecycle, from data strategy to implementation. We will help you make sense of your data and show you how to use it to solve complex business problems. We’ll assess your current data and analytics issues and develop a strategy to guide you to your long-term goals. Learn more about our Google Data capabilities, here.

Download the guide, Becoming a Data-Driven Organization with Google Cloud Platform, to learn more about Dr. Chuck’s GCP data strategy.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Chuck Brooks

Dr. Chuck is a Senior Data Strategist / Solution Architect. He is a technology leader and visionary in big data, data lakes, analytics, and data science. Over a career that spans more than 40 years, Dr. Chuck has developed many large data repositories based on advancing data technologies. Dr. Chuck has helped many companies become data-driven and develop comprehensive data strategies. The cloud is the modern ecosystem for data and data lakes. Dr. Chuck’s expertise lies in the Google Cloud Platform, Advanced Analytics, Big Data, SQL and NoSQL Databases, Cloud Data Management Engines, and Business Management Development technologies such as SQL, Python, Data Studio, Qlik, PowerBI, Talend, R, Data Robot, and more. The following sales enablement and data strategy results from 40 years of Dr. Chuck’s career in the data space. For more information or to engage Dr. Chuck in an engagement, contact him at

More from this Author

Follow Us