Before 2000, primary challenges for companies were to enable the systems so that transactional data could be captured faster for organizational productivity, now gear is shifted towards delivery of information to the business users through reporting, analytical system and actionable drill down dashboard etc that organization have stored in files, data, audio and video stream etc on propriety clustered and open source file system based on their business need and suitability.
Organizations are using storage technology for decades to store the information on clustered file system which are mounted on multiple servers and few are not but complexities of the underlying storage environment increases as new servers/system are added for scalability.
Now, organizations, which want to monetize, better analyze and capitalize the information channels and integrate with the business, depending on the big data storage, opted in the past they are facing challenges of large scale data indexing, availability on demand with low latency. So some of them are choosing/changing or integrating with better enterprise class large scale data storage through some connecter technology to trap and monetize the value of information they have in reduced time manner.
We should know the technology, protocols, network challenges when thinking of adopt new big data storage and their features. There are few architectural approaches how clustering works in such scenario.
Shared Disk: It uses SAN storage area network at a block level. It has again few approaches, some distribute information across all over the server in cluster and some employ centralized metadata server. SGI CXFS, IBM GPFS, DataPlow, Microsoft CSV, Oracle CFS, Redhat GFS, SUN QFS, VMware VMFS, Ceph etc are most widely used cluster file systems.
Distributed file system: It uses a network protocol and Lustre’s data storage technology is very popular on this, Ceph has also come up with this and Microsoft has DFS too.
NAS: It uses file based protocol i.e. NFS, SMB (Server Message Block)/CIFS (Common Internet File System).
Shared Nothing Architecture: Each storage nodes communicates changes to other or to master for replication. Ceph, Lustre and Hadoop are few implementers for this.
The most suitable technology selection reduces time to solution and control the budgets as well, so based on the above architecture, let us list down the general and most critical SLAs from the solutions.
Common selection parameters for big data storage technology?
- High availability
- Scalability of data storage with fixed IT budgets
- Fault tolerance
- Cost of ownership, commodity hardware
- Global workload sharing
- Map Reduce algorithm support on high bandwidth
- Reduced time to solution
- Centralize storage management
- Support wide range of hardware and software
- High application IO support for analytic system
- Event stream processor/storage
- Caching of data for better performance
- Holistic network design (Unified Ethernet Fabric)
List of Big data Storage technology available in the market
Most of the features are claimed to be supported by these listed except Unified Ethernet Fabric that is separate case and CISCO has network related offering to scale out the big data storage.
HDFS
It is the de facto solution in big data technology for large scale data processing over clusters of commodity hardware and is very much suitable However if you are trying to process dynamic datasets (data in motion) , ad-hoc analysis or graph data structure, please stop and read about Goggle’s better alternative to map reduce paradigm (Percolator, Dremel and Pregel). Cassandra and other Enterprise version of HDFS are trying to provide improvement and solutions in this area.
GPFS
It has been available in the market since 1993 and thousands of organizations are using it (Pharmaceutical, Financial Institutes, Life Science, USA National Whether Forecast, Energy Sector etc). It runs on commodity hardware as well and support many OS and platform. It claims to work with low latency ad-hoc analysis and streaming data at very high volumes. It is a propriety offering from IBM and of course one of the very suitable big data storage options if licensing is not a concern. Cluster Manager Failure, File System manager level failure, Secondary cluster, configuration Server failure and Rack Failure are claimed to be addressed with GPFS SNC.
Lustre Distributed File System
This is a very recognized scalable cluster computing file storage system that is widely used by super computers and it has open licensing. There are many commercial suppliers of this bundled with hardware like netApp and Dell. It also claims to fulfill all of requirement listed above including low latency for analytical system.
Isilon’sOneFS by EMC
This is a major induction in the big data storage arena and companies like Oracle and IBM are taming this big data beast. EMC was has re-engineered HDFS and created its own version of data storage later in 2011. It uses the MapR File system. MAPR File System claim to be the alternate file system to HDFS which has full random access read /write right. Snapshots and mirrors advanced feature that addressed centralized metadata of name node in HDFS Single point failure issue.
NetApp’s RAID array on HDFS.
Netapps claims it’s improvement to HDFS to make it faster and reliable but still rely on HDFS.
Clever safe Dispersed Computation file system highly scalable object based storage.
Appistry and KosmosFS are few more computational big data storage options.
Conclusion
In order to monetize and present actionable insights of information to business for provocative organizational decision making, analytic system heavily rely on data storage technologies that drives how data is made available to frontend middle ware application at a faster rate and reduced interval. As we know GFS based HDFS is cheaper and rock solid but for scalability over Peta bytes perhaps the enterprise class solution may be EMC, IBM or GPFS or many available in the market?
But remember many commercial offering don’t run on commodity hardware and cost advantage of HDFS and related bundles are fundamental for current success and growing popularity. Low latency issue with HDFS can be addressed with right design/implementation by skilled big data technical experts that are organization choice weather they desire for per-bundled commercial offering or open source HDFS that allows customizing organizational needs in a flexible manner but it all depends on business requirement and investment budgets for the solutions.
Thanks for your marvelous posting! I actually enjoyed reading it, you
may be a great author.I will be sure to bookmark your blog and definitely will come back down the
road. I want to encourage you to continue your great work,
have a nice afternoon!