Perficient Enterprise Information Solutions Blog

Blog Categories


Posts Tagged ‘Big Data’

The End of IT?

shutterstock_136745267_350A few years ago Adrian Cockcroft, cloud architect at Netflix at the time, posted this blog which caused quite a stir among the IT community. It described how Netflix had almost done away with DevOps (or even plain Ops) using the cloud (AWS in this case) to come up with yet another new IT buzzword called NoOps.

Many in the DevOps community took strong issue to this, arguing that Ops by any name, whether NoOps or DevOps, is still Ops. A lot of the Platform-as-a-Service (PaaS) vendors jumped on the NoOps bandwagon even declaring the following year to be the definitive year of NoOps.

Vendors like Heroku, AWS Elastic Beanstalk and AppFog tout their PaaS platforms as pure development based without any need for operations support. I witnessed this in person during a Heroku workshop (by the way Heroku itself is hosted on AWS), it’s frighteningly simple and easy to create a website or web service using any of the supported language platforms and connect to a set of standard database backends and tools, and it scales efficiently and the setup is a breeze if you have ever worked on any kind of multi-stack project.

I think a key drawback in PaaS currently is that unless the project is a self-contained one or all your company’s data and services are located on the cloud or are accessible externally, it is difficult to punch enough holes through your company’s firewall to justify the move to PaaS especially if the data is sensitive. I think organizations are still uncomfortable with the idea of owning highly sensitive data hosted on systems outside of their control. Also being locked into a limited toolset or a particular database might not appeal to every project owner given the proliferation of specialized software resources especially in the Big Data landscape. Read the rest of this post »

Cassandra NoSQL Data Modeling Snip-pet

Data modeling in Cassandra is a little tricky and requires a combination of science and art.  Think of the Cassandra column family as a map of a map: an outer map keyed by a row key, and an inner map keyed by a column key. Both maps are sorted. To maximize Cassandra’s capabilities and for long term maintenance need’s, it’s better to analyze, know and follow certain high level rules while implementing Cassandra.

A few things to consider while implementing Cassandra:

  • Column based
  • Cluster
  • Nodes
  • Duplicated data
  • Distributed data platform
  • Performance should scale linearly when more nodes are added to the cluster
  • Writes in Cassandra is cheaper than reads and less problematic.
  • Denormalization and duplication are encouraged in Cassandra. Efficiency in Cassandra is partly because of data duplication
  • Forget about what you know about Joins in RDBMS because there’s no Joins in Cassandra

In Cassandra, you have clusters and nodes, you want to make sure that during write’s, data is written to all cluster nodes evenly. Rows are spread around the cluster based on a hash of the partition key, which is the first element of the PRIMARY KEY. To increase read efficiency make sure that data is read from as few nodes as possible.

A Cassandra stress test below with Consistency level set to ALL and ONE proves why it’s better to read from as few nodes as possible


  • Cassandra-stress read n= 2000 cl=ALL no-warmup –rate threads=1


Stress Test Cassandra


  • Cassandra-stress read n= 2000 cl=ONE no-warmup –rate threads=1



Isolate Clusters by functional areas and criticality. Use cases with similar criticality from the same functional area share the cluster and reside in different Keyspaces(Database). Determine Queries and Build Model based on those queries. Design and think about query pattern up front and design column families also ahead. Another reason why this rule should be followed is that unlike relational database, it’s not easy to tune or introduce new query patterns in Cassandra. In other words, you can’t just introduce or add a complex SQL (TSQL, PLSQL etc.) or Secondary Indexes to Cassandra because of it highly distributed nature.

On the high level bases, Below are some of the things you need to do to determine with your query pattern:

  • Enforcing uniqueness in the result set
  • Filtering based on some set of conditions
  • Ordering by an attribute
  • Grouping by an attribute
  • Identify the most frequently used query pattern
  • Identify queries that are sensitive to latency

Create your queries to read from one partition. Keep in mind that your data is replicated to multiple nodes and so you can create individual queries that reads from different partition. When you query reads from multiple nodes, It has to go to each individual nodes and get the data and this takes time. But when it gets the data from one node it saves time.

An example would be the create Table below:

CREATE TABLE users_by_email( Name VARCHAR, Dob TIMESTAMP, Email VARCHAR, Join_date TIMESTAMP, PRIMARY KEY (email));

CREATE TABLE users_by_join_date( Name VARCHAR, Dob TIMESTAMP, Email VARCHAR, Join_date TIMESTAMP, PRIMARY KEY (join_date,email));

The above creates tables that enables you to read from one partition and basically, each user gets their own partition.

If you are trying to fit a group into a partition, you can use a compound PRIMARY KEY for this example:

CREATE TABLE groups (groupname text, username text, email text, join_date int, PRIMARY KEY (groupname, username)).




NoSQL NoSecurity – Security issues with NoSQL Database

As More companies’ debate on adopting a Big Data Solution. Some of the discussion that comes across is whether to use Hadoop or Spark, NoSQL database or continue using their current RDBMS. The ultimate question is “is this technology for us?”  NoSQL database are highly scalable, provide better performance, designed to store and process a significant amount of unstructured data at a speed 10 times faster than RDBMS, high availability and strong fail over capabilities.   So why hesitate to use a NoSQL database.

Security is a major concern for IT Enterprise Infrastructures. Security in NoSQL databases is very weak, Authentication and Encryption is almost nonexistence or is very weak when implemented. The following are security issues associated with NoSQL databases:

  • Administrative user or authentication is not enabled by default.
  • It has a very weak password storage
  • Client communicates with server via plaintext(MongoDB)
  • Cannot use external encryption tools like LDAP, Kerberos etc
  • Lack of encryption support for the data files
  • Weak authentication both between client and the servers
  • Vulnerability to SQL injection
  • Denial of service attacks.
  • Data at rest is Unencrypted.
  • The Available encryption solution isn’t production ready
  • Encryption isn’t available for client communication.

With all this security problems, it best to understand that NoSQL databases are still new technologies and more security enhancements will be added to newer version. Enterprise package Cassandra tools provided by companies like Datastax does have more security enhancements and hence is more secure and provide companies with all the security needed.

Datastax enterprise provides:

  • Client to node encryptions for Cassandra which includes an optional, secure form of communication from client machine to database cluster. Client to server SSL. This ensures that data is not compromised inflight.
  • Administrators can create, drop and alter internal users using CQL that are authenticated to Cassandra database cluster
  • Permissions can be granted to user to perform certain task after their initial authentication
  • JMX authentication can be enabled and tools such as nodetool and Datastax OpsCenter can be configured to use this authentication
  • Ability Configure and use external Security tools like Kerberos
  • Provides a Transparent data encryption (TDE) to help protect at rest data. (At rest data is data that has been flushed from the memtable in system memory to the SSTables on disk)

Hadoop, Spark, Cassandra, Oh My!

On Monday, I reviewed why Spark will not by itself replace Hadoop, but Spark combined with other data storage and resource management technologies creates other options for managing Big Data.  Today, in this blog post we will investigate how an enterprise should proceed in this new, “Hadoop is not the only option” world. Big Decisions

Hadoop, Spark, Cassandra, Oh My!  Open source Hadoop and NoSQL are moving pretty fast. No wonder some companies might feel like Dorothy in the Land of Oz.  To make sense and things and find the yellow brick road, first we need to understand Hadoop’s market position.
HDFS and YARN can support the storage of just about any type of data.  With Microsoft’s help, now it is possible to run Hadoop on Windows and leverage .NET.  Of course most are running on Linux and Java.   HBASE, Hive, and Cassandra all can run on top of Hadoop.   Hadoop and HIVE support is quickly becoming ubiquitous across data discovery, BI, and analytics tools sets.  Hadoop is maturing fast from the security perspective.  Thanks to HortonWorks, Apache Knox and Ranger have delivered enterprise security capabilities.   Cloudera and IBM both have their own stories on enterprise security as well.   WANdisco provides robust multi-site data replication and state of the art data protection.  The bottom line is that Hadoop has and is continuing to mature AND there is an extensive amount of support from most vendors and related Apache open-source projects.   Read the rest of this post »

Spark Gathers More Momentum

Yesterday, IBM threw its weight behind Spark. This announcement is significant because it is a leading indicator of a transition frspark-logoom IT-focused Big Data efforts to business-driven analytics and Big Data investments. If you are interested in learning more about this announcement and what it means in the bigger picture, I wrote a blog entry on our IBM blog, which can be found here.

Big Data Challenges

Big_data_challengesAs companies start adapting to handle Big Data, the challenges still remains. Barring the obvious applications, the challenges of getting the value out of the new-found data continues to be on the top of the list. ROI’s and potential revenues are yet to be realized. As the technology and the usage becomes more sophisticated we will start to see the results.

From IT perspective top two challenges are Governance and Skills. Securing the Big Data for greater use within the organization is complex and the technology is evolving. Securing the right people with in-depth knowledge in managing Big Data bigger challenge.  And these two aspects will feed into the bigger challenge of ‘How to get value out of Big Data’.

Organizations find that the key resources who needs to be driving the Big Data are also the same resources who are so vital to managing existing core enterprise applications. Balancing the precious time of key resources and leveraging external thought leadership / expertise is key to successful Big Data initiatives.

Identifying the strengths of the organization and prioritizing the critical areas of investment is key to successful Big Data initiatives.

  • Managing Big Data initiatives (Value creation, thought leadership, Monetizing)
  • Managing the existing enterprise applications (integration, combining enterprise data)
  • Implementing Governance (security, oversight, process)
  • Infrastructure and tools (Hadoop, Analytics, in-motion)

Data Quality – Don’t Fix It If It Ain’t Broke

What is broke?  If I drive a pickup truck around that has a small, unobtrusive crack in the windshield and a few dings in the paint, it will still pull a boat and haul a bunch of lumber from Home Depot. Is the pickup broke if it still meets my needs?

pickup truck haulingSo, when is data broke? In our legacy data integration practices, we would profile data and identify all that is wrong with the data. Orphan keys, in-appropriate values, and incomplete data (to name a few) would be identified before data was moved. In the more stringent organizations data would need to near perfect for it to be used in a data warehouse. This ideal world or perfect data was strived after, but rarely obtained. It was too expensive, required too much business buy in, and lengthen BI and DW projects.   Read the rest of this post »

Big Data Changes Everything – Has Your Governance Changed?

A few years ago, Big Data/Hadoop systems were generally a side project for either storing bulk data or for analytics. But now as companies  have pursued a data unification strategy, leveraging the Next Generation Data Architecture, Big Data and Hadoop systems are becoming a strategic necessity in the modern enterprise.

shutterstock_124189609 (1)

Tupungato /

Big Data and Hadoop are technologies with so much promise and a very broad and deep value proposition. But why are enterprises struggling to see real-world results from their Big Data investments? Simply put it is governance.   Read the rest of this post »

Analytics in the Digital Transformation Era

Successful Enterprises compete on many capabilities ranging from product excellence, customer service and marketing to name a few. Increasingly the back office / Information Technology (IT) is becoming a strategic player in the Digital Business Model which supports these key capabilities. In other words back office/IT Capability itself is becoming a differentiator. All of the key strategies like Customer Excellence, Product Excellence, and Market Segmentation depend on the successful Digital Business Model.

Having more data especially noisy data is complex to deal with. New platforms and tools are a must to make it possible to deal with them. Working with internally captured Enterprise data to answer strategic questions like “Should there be a pricing difference of life, annuities, and long-term care?” or setting up the benchmark for “Servicing cost per policy for life, annuities, and long-term care” can only go that much far. Ingesting and integrating the external data including machine data will change the way pricing and segmentation is done today.

In the technology space a wide variety of capabilities in terms tools / platforms, architecture offering Time to Market opportunities to leading edge predictive / prescriptive models to enable Business to operate and execute efficiently. What this all means is that Business has to embrace the Digital transformation happening faster than ever.

Traditional Analytics


Key strategies from IT should include two kinds of applications / platforms for dealing with new analytical and old analytical methods. The first kind is slow-moving or traditional Enterprise data which ends up in the warehouse and made available for ‘What happened’ questions, traditional reporting, business intelligence / Analytics etc.

Fast Analytics

The second kind is the real-time analytical response to the interactive customer, keeping in constant touch through multiple channels while providing seamless interaction and user experience. Technologies, platforms, architecture and applications are different for these two types of processing.

In the new world of Information management, traditional Enterprise applications and Data Warehouse becomes another source rather than the complete source of Data. Even absence of data is relevant information if the context is captured. Analytics is becoming more real-time with adaptive algorithms influencing different outcome based on the contextual data. Building the modern Information platforms to address these two different needs of the enterprise is becoming the new standard.

Lambda Architecture for Big Data – Quick peek…

In the Big Data world Lambda architecture created by Nathan Marz is a standard technique applied to solve many predictive analytics problems. This architecture effectively delivers the streaming data and batch data to combine the past information with the current changes, producing a comprehensive platform for predictive framework.


Lambda Architecture

On a very high generic level the architecture has 3 components.

  • Batch Layer, which has all the processed batch data from the past.
  • Speed Layer or real-time feed of similar or same information.
  • Servicing layer which holds the batch views relevant for the queries needed by the predictive analytics

Lcapambda architecture solves the issue of intended output can change because of code changes.  In other words enhancement in code for better data processing is achieved by keeping the original input data intact or read only. Though some may claim that Lambda architecture is an exception to CAP theorem is debatable.

In reality, programming for batch and the stream typically needs two different set of codes. This is an issue because business logic and other enhancements has to be done in two different places. Creating a single API for both batch and real-time data can be one way to hide the complexity for the higher level code but the fact remains there are two different branches for processing at the lower level.

Extended lambda Architecture

Assuming you are satisfied with the limitations of Lambda architecture, most predictive analytics needs past data along with  the data captured within the enterprise. Including those key data will enhance the overall quality and provide the most available data for the predictive engine.

As the industry matures, these techniques will become more robust and will provide the best available data faster than ever. As we now take star schemas and their variations as a given for Data Warehousing, Lambda architecture and their variations will be prevalent in the near future as well.