Column-Oriented Storage for Massive Data / Blogs / Perficient

We have been using a number of database management system (DBMS) such as SQL Server, DB2, and Oracle, but probably we don’t care about how the data is stored and organized in disk. Actually most of popular DBMS are built upon the row-oriented structure which stores its content by row; while today’s DB market, column-oriented DB is more attractive and becoming a key part of massive data storage & computing engine. The famous database vendors for column-oriented include Teradata, Sybase IQ, MonetDB and Vertica.

What is Column-Oriented?

Comparing to the concept of row-oriented, column-oriented DBMS stores data tables as sections of columns of data rather than as rows of data, you can imagine that the data for the same column are stored in the same section and another column dataset for another section. It will help to understand through an example in which we have two dimensional tables logically regarding employee info:

Employee Id	First Name	Last Name	Phone Number
1	David	Yang	13812345678
2	John	Lu	13823456789
3	Smith	Zhang	13834567890

The data is serialized into one form or another in the hard disk. In the row-oriented system the storage will be like:

#1:1,David,Yang, 13812345678;#2:2, John, Lu, 13823456789;#3:3, Smith, Zhang, 13834567890;

While in the column-oriented solution, the format will be:

Choosing a Global Software Development Partner to Accelerate Your Digital Strategy

To be successful and outpace the competition, you need a software development partner that excels in exactly the type of digital projects you are now faced with accelerating, and in the most cost effective and optimized way possible.

Get the Guide

1:#1,2:#2,3:#3;David:#1,John:#2,Smith:#3;Yang:#1,Lu:#2,Zhang:#3; 13812345678:#1, 13823456789:#2,13834567890:#3;

In general the column-oriented system serializes all of the values of a column together, then the values of the next column, and so on. Through this example we can understand the storage in DBMS.

Why Column-Oriented?

Whatever row or column oriented data storage has its advantage and disadvantage. Column-oriented data system possess advantages in data warehouses, customer relationship management (CRM) systems, library card catalogs, and other ad-hoc inquiry systems where aggregates are computed over large numbers of similar data items. Particularly over large data volume, there are much research and recommendation of column-oriented data store as it performed outstanding. Two major features in computing efficiency from wikipedia;

Column-oriented organizations are more efficient when an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data, because reading that smaller subset of data can be faster than reading all data.
Column-oriented organizations are more efficient when new values of a column are supplied for all rows at once, because that column data can be written efficiently and replace old column data without touching any other columns for the rows.

In data warehousing or ad-hoc analytics practice, the users don’t compose their query over many fields but may cover large historical data sets, thus column-oriented form is really good choice.

Column Family Oriented in HBase

We know that HBase is key part of the Hadoop family big data open source framework. By nature HBase acts just as its definition “Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google’s Bigtable.”

In HBase, the important concept is Column family instead of column itself. Columns in HBase are grouped into column families. All column members of a column family have the same prefix. For example, the columns courses:history and courses:math are both members of the courses column family.

In HBase table structure, it must contain Primary key, versioning (timestamp) and column family. For example:

Row Key	Time Stamp	Column Family anchor
“com.cnn.www”	t9	anchor:cnnsi.com = “CNN”
“com.cnn.www”	t8	anchor:my.look.ca = “CNN.com”

“com.cnn.www” is the primary key, it has 2 versions t9, t8 and usually t9 is the latest one. It also contains column family Anchor.

This is a column-oriented organization with key-versioning-value combination, which is serialized into Hadoop File System (HDFS) with multiple nodes. Through the special HBase operation such as Get, Put, Scan, Delete, the developers or users can easily retrieve the filtered data and can also populate data from other data source into HBase system.

Column-Oriented Storage for Massive Data

by Kent Jiang on July 3rd, 2013 | ~ minute read

Choosing a Global Software Development Partner to Accelerate Your Digital Strategy

Tags

Leave a Reply

Kent Jiang

Categories

Follow Us