One of the benefits of the Hadoop is its ability to be configured to address a number of diverse business challenges and integrated into a variety of different enterprise information ecosystems. With proper planning these analytical big data systems have shown to be valuable assets for companies. However, without significant attention to data architecture best practices this flexibility can result in an crude April Fool’s joke resulting in a system that is difficult to use and expensive to maintain.
- Establish and Adhere to Data Standards – A data scientist should be able to easily find the data he/she is seeking and not have to worry about converting code pages, changing delimiters, and unpacking decimals. Establish a standard and stick to it then convert the data to the standard encoding and delimiter during the ingestion process.
- Implement a Metadata Configured Framework – Remember when ETL was all hand-coded? Don’t repeat the sins of the past and create a vast set of point to point custom Sqoop and Flume jobs. This will quickly become a support nightmare. If the costs of a COTS ETL tool are prohibitive, then build a data ingestion and refining framework of a small number of components that can be configured using metadata. The goal for a new data feed to be added by configuring a few lines of metadata versus scripting or creating code for each feed.
- Organize Your Data – This practice may seem obvious, however, we have seen a number of Hadoop implementations that look like a network file share vs. a standards driven data environment. Establish a directory structure that allows for the different flavors of data. Incremental data (aka delta’s), consolidated data, data that transformed, user data, and data stored in Hive should be separated by into different directory structures. Leverage a directory naming convention; then publish the standard so that data scientists/users can find the data they are seeking.
Addressing these three best practices will ensure that your Big Data environment is usable and maintainable. If you are implementing or considering a Big Data solution, Perficient has the thought-leadership, partnerships, and experience to may your Big Data program a success.