I see companies start down their Big Data/NoSQL journey with a Proof of Concept mindset and they almost always end up funding a science project by confusing early wins on established products with progress. Cassandra is ten years old and DataStax has 500 customers in 50 countries. This stuff works; what you need is a Proof of Compliance. Can you go to production at your specific company? Most of the projects I see fall down on security compliance, not performance. In their latest upgrade, DataStax Enterprise has improved their advanced security offering making it easier to develop a small-scope, time-boxed PoC that can actually demonstrate how a highly available, highly scalable, always on persistence layer can also demonstrate rigorous enterprise compliance characteristics of your current databases. There’s a lot of fun to be had with DSE, but first you need to eat your vegetables.
Implementing sufficient global data security measures to ensure compliance around PII and NPI is a real challenge for open-source NoSQL databases. Sarbanes Oxley, Basel II, the Health Insurance Portability and Accountability Act (HIPAA), the General Data Protection Regulation (GDPR), and the Payment Card Industry Data Security Standard (PCI DSS) expose regulated industries to substantial reputational and financial risk. Let’s take a look at what the open source product offers.
Apache Cassandra’s security model provides for role-based authentication and authorization at the internal, or database, level. An internal-only authentication model precludes integrating external authentication models such as LDAP, Kerberos and Active Directory. This is almost certainly a deal-breaker from a corporate security perspective. Cassandra’s TLS/SSL encryption is available between both the client and the database cluster as well as intra-node to provide for encryption for data in-flight. However, there are many use cases where encryption for data-at-rest, often for a relatively small subset of data, is mandatory. If you were to start along this open-source path, you would be able to get a feel for how to use a columnar store database, but this actually give very little insight into how to offer a columnar store database solution in a regulated industry. I hear the argument that it’s best to start with small steps. Honestly, DataStax Academy makes getting a small, realistic use case up and running very easy. It’s better to focus on how you are going to get to production from day one.
Do the hard things first. — Michael Bloomberg
I propose that instead of starting with a small “Hello, World” use case that is pretty much guaranteed to work, you start with a small “Hello, My Company” project. In this project, we are going to do the simplest version of a production-grade customer journey. The assumption is that DSE is fast enough, resilient enough and scalable enough to meet your data requirements. We need to know if the following are sufficient for your corporate security and governance requirements:
- authentication
- authorization
- encryption
- auditing
DataStax Enterprise supports SSL encryption for client-to-node encryption and node-to-node encryption. All communication occurs over TCP sockets and can be secured by using the standard Java Security SSL/TLS implementation in the JVM. Since DSE support BYO root certificate authority, you can just use a self-signed certificate. Use the OpsCenter LifecycleManager to automate the process of preparing the certificates and distributing certificates. Use all for inter-node encryption.
Use DSE Unified Authentication to manage external authentication and authorization with LDAP or kerberos. (I’m going to assume LDAP.) Unified Authentication is composed of DSE Authenticator, DSE Role Manager and DSE Authorizer. Authenticator supports validating user identity with either LDAP or kerberos and is a prerequisite for enabling authorization and role management. Role Manager matches the user LDAP group names to DSE roles. Authorizer analyzes a request against a resource’s role permissions before allowing execution.
By enabling SSL with LCM and locking down ports and integrating our database access with the corporate LDAP, we have a persistence layer that isn’t in directly violation of most corporate security policies. One step below compliance is negligence, so this is not really that impressive yet. Let’s get better.
We have already set up encryption for data over the wire. Next we need to setup transparent data encryption for tables, hint files, commit logs, and configuration properties. Using local encryption at this stage and then consider using a KMIP encryption key for remote storage and management. You do have to do the key distribution manually across the cluster with the local encryption because this isn’t handled by LCM.
The final step in setting up a NoSQL persistence layer with production-grade enterprise characteristics is an auditing mechanism. You can capture data to a log file or a table. I recommend a log file because there may already be mechanisms in place in your organization that you can leverage out-of-the-box. In the logback.xml file, configure logging levels and other options the same way you do in a java application but also mask sensitive data. From the beginning, use a regex for keyspace filtering for targeting keyspaces and limit the number of event categories (maybe just data manipulation) and specify roles to filter but choose to monitor the whole cluster. To mask sensitive data change the umask setting on the audit files to 600 for OS-level security breaches and use the encoder.pattern element of the logback.xml to redact sensitive information.
%replace(%msg){"password='.*'", "password='xxxxx'"}
Now imagine if you create two tables: customer_sales and product. Create some basic prepared statements to do CRUD operations that minimize SQL Injection exposure. Create a setup of users and groups that have different levels of access and use Authorizer to enable row-level access control (RLAC). Now come up with a process for making sure that legal operations are permitted and illegal operations are are captured by a rapid identification and response mechanism. Maybe quantify your processes and practices using a guide.
Now its time to put in some quality Patrick McFadin time and build out a ridiculously fast, highly-scalable data powerhouse on top of your production-grade Proof of Compliance.