Skip to main content

Data & Intelligence

Big Data and You: What is Data Variety?

Welcome to “Big Data and You (the enterprise IT leader),” the Enterprise Content Intelligence group’s demystification of the “Big Data”.  Of the three V’s (Volume, Velocity, and Variety) of big data processing, Variety is perhaps the least understood.  The modern business landscape constantly changes due the emergence of new types of data. The ability to handle data variety and use it to your advantage has become more important than ever before.  With traditional data frameworks, ingesting different types of data and building the relationships between the records is expensive and difficult to do, especially at scale.  All paths of inquiry and analysis are not always apparent at first to a business.  Perhaps one day the relationship between user comments on certain webpages and sales forecasts becomes interesting; after you have built your relational data structure, accommodating this analysis is nearly impossible without restructuring your model.

What makes big data tools ideal for handling Variety?

The key is flexibility.  In general, big data tools care less about the type and relationships between data than how to ingest, transform, store, and access the data. Apache Pig, a high-level abstraction of the MapReduce processing framework, embodies this flexibility.  Pig is automatically parallelized and distributed across a cluster, and allows for multiple data pipelines within a single process.  New data fields can be ingested with ease, and nearly all data types recognizable from traditional database systems are available to use.  In addition, Pig natively supports a more flexible data structure called a “databag”.  This object represents a collection of tuples, but can be used to hold data of varying size, type and complexity.

Sample Data Grouped into a Pig DataBag

Data Intelligence - The Future of Big Data
The Future of Big Data

With some guidance, you can craft a data platform that is right for your organization’s needs and gets the most return from your data capital.

Get the Guide

Transformation and storage of data in Pig occurs through built-in functions as well as UDFs (User Defined Functions).  These functions can be written as standalone procedures in Java, Javascript, and Python and can be repeated and used at will within a Pig process.  There are storage methods available natively and in common Pig UDF repositories for writing the data to different file formats.  Custom load and store functions to big data storage tools such as Hive, HBase, and Elasticsearch are also available.

Sample Pig UDF Written in Java

Sample Pig UDF Written in Java

Flexibility in data storage is offered by multiple different tools such as Apache HBase and Elasticsearch.  Which storage system will provide the most efficient and expedient processing and access to your data depends on what access patterns you anticipate.  HBase, for example, stores data as key/value pairs, allowing for quick random look-ups.  If the access pattern for the data changes, the data can be easily duplicated in storage with a different set of key/value pairs.  This practice with HBase represents one of the core differences between relational database systems and big data storage: instead of normalizing the data, splitting it between multiple different data objects and defining relationships between them, data is duplicated and denormalized for quicker and more flexible access at scale.  Elasticsearch, on the other hand, is primarily a full-text search engine, offering multi-language support, fast querying and aggregation, support for geolocation, autocomplete functions, and other features that allow for unlimited access opportunities.

One of the places where a large amount of data is lost from an analytical perspective is Electronic Medical Records (EMR).  All you can analyze with a relational database system is the data that fits into nicely normalized, structured fields. With big data technologies like Pig and Elasticsearch, you can unwind valuable unstructured physician data such as written notes and comments from doctor’s visits.  With the MapReduce framework you can begin large scale processing of medical images to assist radiologists or expose the images in friendly formats via a patient portal.  With Kafka, Storm, HBase and Elasticsearch you can collect more data from at-home monitoring sources (anything from pacemaker telemetry to Fitbit data) at scale and in real time.  The flexibility provided by big data allows you to start building databases correlating measurements to outcomes and explore the predictive abilities of your data.

[Thanks to Eric Walk for his contributions]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.