Welcome to “Big Data and You (the enterprise IT leader),” the Enterprise Content Intelligence group’s demystification of the “Big Data” . The words “Big Data” are thrown around board rooms like a magic fix to any data related shortcomings the enterprise has. “Big Data” is not a magic wand, but the technologies which fall under its umbrella really do have the potential to turn troves of unexplored knowledge into actionable insights. We’re setting out to cut through the buzz words and get to the heart of how “Big Data” technologies can be leveraged to reshape our thinking (and the ways they cannot).
To understand the meteoric rise of the buzzword, a brief history lesson. I’ve seen whole blog posts explaining the synthesis of “Big Data” back to prehistoric times. Exposition for the modern IT shop, however, can start in 1997 with the debut of Google.
In 1997, Google changed the way we would think about data, though it took a while to sink in, both for them and us. As Google set about solving their problem of ever increasing data Volume, Variety and Velocity, they determined that a paradigm shift was necessary to keep up. With the increased popularity of the search engine, the average user began to expect the ability to discern relevant answers from infinitely large, and growing, pools of data, near instantaneously.
The paper Google published in 2004 on their MapReduce framework and the specialized distributed filesystem required to operate it was a lightbulb moment for many in the information industry. While the concepts were simple, and not entirely novel, MapReduce and distributed file systems had not been previously formalized and framed as tools to be used independent of other systems. After Google’s debut, their now vanquished competitors began to play catch up. Lucene and Nutch were developed by Doug Cutting to search indexes and create them, respectively. Through some foresight by Mr. Cutting and the other folks at Yahoo!, these technologies were donated to the Apache Software Foundation. What followed was a deluge of new development and a proliferation of new products. From this proliferation, we got Hadoop and it’s ecosystem, Lucene and it’s children Solr and Elasticsearch. These technologies enable everything from IBM’s incredible, Jeopardy champion Watson to modern system monitoring tools.
What made these technologies so unique? Why did they change the way we think? Why are there so many of them?
Before the paradigm shift of 2003/4 data was defined as structured or unstructured. Structured data was accessed using SQL Queries, unstructured data was images or other files that were found by identifier and returned whole. Before that, structured data was stored on the mainframe and access was limited to the capabilities of COBOL programs. The paradigm shift was that as data volume, variety and velocity grow, the storage paradigms and access patterns must evolve to match the use case. Generic standards, while useful for training developers, will always have downsides for some kind of access.
In the past you needed to think about the data structures: How many fields, how are they related to each other, how the objects related to each other. In the new world order, we care less about the data itself and its intricacies than we care about where it’s coming from and how you want to access it. Are there some situations where you need to search on every field or will key-value retrieval suffice? We now care about intent more than forcing data into our pre-existing notions of efficiency.
So what is Big?
Big Data is Rich Data. Big Data Technologies give us the ability to use the richness of the dataset with real time responses–the ability to ask a new analytical question, on the fly, without rebuilding the data mart. These technologies rely on the cheapness of storage, the cheapness of memory, and the grace of the open source community. As this series continues we’ll be exploring some specific technologies, their intended use cases and some examples of how we’ve leveraged them to solve problems faced by our clients. The goal: enable you to make informed choices on “Big Data:” Big Data is not magic, but it is pretty incredible.