Skip to main content

Development

Column-Oriented Storage for Massive Data

We have been using a number of database management system (DBMS) such as SQL Server, DB2, and Oracle, but probably we don’t care about how the data is stored and organized in disk. Actually most of popular DBMS are built upon the row-oriented structure which stores its content by row; while today’s DB market, column-oriented DB is more attractive and becoming a key part of massive data storage & computing engine. The famous database vendors for column-oriented include Teradata, Sybase IQ, MonetDB and Vertica.

What is Column-Oriented?

Comparing to the concept of row-oriented, column-oriented DBMS stores data tables as sections of columns of data rather than as rows of data, you can imagine that the data for the same column are stored in the same section and another column dataset for another section. It will help to understand through an example in which we have two dimensional tables logically regarding employee info:

Employee Id

First Name

Last Name

Phone Number

1

David

Yang

13812345678

2

John

Lu

13823456789

3

Smith

Zhang

13834567890

The data is serialized into one form or another in the hard disk. In the row-oriented system the storage will be like:

#1:1,David,Yang, 13812345678;#2:2, John, Lu, 13823456789;#3:3, Smith, Zhang, 13834567890;

While in the column-oriented solution, the format will be:

1:#1,2:#2,3:#3;David:#1,John:#2,Smith:#3;Yang:#1,Lu:#2,Zhang:#3; 13812345678:#1, 13823456789:#2,13834567890:#3;

In general the column-oriented system serializes all of the values of a column together, then the values of the next column, and so on. Through this example we can understand the storage in DBMS.

Why Column-Oriented?

Whatever row or column oriented data storage has its advantage and disadvantage. Column-oriented data system possess advantages in data warehouses, customer relationship management (CRM) systems, library card catalogs, and other ad-hoc inquiry systems where aggregates are computed over large numbers of similar data items. Particularly over large data volume, there are much research and recommendation of column-oriented data store as it performed outstanding. Two major features in computing efficiency from wikipedia;

  • Column-oriented organizations are more efficient when an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data, because reading that smaller subset of data can be faster than reading all data.
  • Column-oriented organizations are more efficient when new values of a column are supplied for all rows at once, because that column data can be written efficiently and replace old column data without touching any other columns for the rows.

In data warehousing or ad-hoc analytics practice, the users don’t compose their query over many fields but may cover large historical data sets, thus column-oriented form is really good choice.

Column Family Oriented in HBase

We know that HBase is key part of the Hadoop family big data open source framework. By nature HBase acts just as its definition “Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google’s Bigtable.”

In HBase, the important concept is Column family instead of column itself. Columns in HBase are grouped into column families. All column members of a column family have the same prefix. For example, the columns courses:history and courses:math are both members of the courses column family.

In HBase table structure, it must contain Primary key, versioning (timestamp) and column family. For example:

Row Key Time Stamp Column Family anchor
“com.cnn.www” t9 anchor:cnnsi.com = “CNN”
“com.cnn.www” t8 anchor:my.look.ca = “CNN.com”

“com.cnn.www” is the primary key, it has 2 versions t9, t8 and usually t9 is the latest one. It also contains column family Anchor.

This is a column-oriented organization with key-versioning-value combination, which is serialized into Hadoop File System (HDFS) with multiple nodes. Through the special HBase operation such as Get, Put, Scan, Delete, the developers or users can easily retrieve the filtered data and can also populate data from other data source into HBase system.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Kent Jiang

Currently I was working in Perficient China GDC located in Hangzhou as a Lead Technical Consultant. I have been with 8 years experience in IT industry across Java, CRM and BI technologies. My interested tech area includes business analytic s, project planning, MDM, quality assurance etc

More from this Author

Follow Us
TwitterLinkedinFacebookYoutubeInstagram