A few years back I worked for a client that was implementing cell level security on every data structure within their data warehouse. They had nearly 1,000 tables and 200,000 columns — yikes! Talking about administrative overhead. The logic was that data access should only be given on a need-to-know basis. The idea would be that users would have to request access to certain tables and columns.
Need-to-know is a term frequently used in military and government institutions that refers to granting access to sensitive information to cleared individuals. This is a good concept, but the key here is the part about “granting access to SENSITIVE data.” The key is that the information has to be classified first, then need-to-know (for cleared individuals) is applied.
Most government documents are not sensitive. This allows the administrative resources to focus on the sensitive, classified information. The system for classifying information as Top Secret, Secret, and Confidential, has relatively stringent rules for, but also discourages the over classification of information. This is because when a document is classified, its use becomes limited.
This same phenomenon is true in the corporate world. The more a set of data is locked down, the less it will be used. Unnecessary limiting an information’s workers access to data obviously does not help the overall objectives of the organization. Big Data just magnifies this dynamic and unnecessarily restricting access to Big Data is the best way to limit its value. Unreasonably lock down Big Data, its value will be severely limited.
Now this is not to say, certain data should not be restricted. Social Security Numbers (SSN), HIPPA governed data elements, and Account numbers are a few examples. We do need solutions to restrict access to this critical information but that systems should restrict escalating those controls to information that should not be as tightly controlled.
A Classify, Separate, and Secure strategy is quite effective for securing only critical data elements. Classify information, if possible at the field/column) level, using specific, consistent, guidelines that do not unnecessarily restrict information. When we load information into a data reservoir (or data lake), we Separate sensitive information from unrestricted information. This should be executed at the column level in tables. For example, if a table has field containing SSNs, physically separate this into another table. Masking may also be appropriate, and depending on the other data elements, we may want to not the sensitive data columns into our cluster. This prevents the security escalation effect that happens when we classify a table as sensitive because of just one column of sensitive data. Lastly, we Secure the sensitive information. This may be in another directory or system (like Apache Accumolo). The objective is focus our efforts into locking down the secure information and minimizing the administrative overhead.
“Unnecessary limiting an information’s workers access to data obviously does not help the overall objectives of the organization. ”
Maybe everyone doesn’t need access every single piece of data, but if you keep all the data locked in an ivory tower than how are people supposed to use it to make better decisions?