Skip to main content

Data & Intelligence

Three Big Data Best Practices

One of the benefits of the Hadoop is its ability to be configured to address a number of diverse business challenges and integrated into a variety of different enterprise information ecosystems.  With proper planning these analytical big data systems have shown to be valuable assets for companies.  However, without significant attention to data architecture best practices this flexibility can result in an crude April Fool’s joke resulting in a system that is difficult to use and expensive to maintain.

Three Big Data Best PracticesAt Perficient, we typically recommend a number of best practices for implementing Big Data. Three of these practices are:

  1. Establish and Adhere to Data Standards – A data scientist should be able to easily find the data he/she is seeking and not have to worry about converting code pages, changing delimiters, and unpacking decimals.   Establish a standard and stick to it then convert the data to the standard encoding and delimiter during the ingestion process.
  2. Implement a Metadata Configured Framework – Remember when ETL was all hand-coded?   Don’t repeat the sins of the past and create a vast set of point to point custom Sqoop and Flume jobs. This will quickly become a support nightmare.   If the costs of a COTS ETL tool are prohibitive, then build a data ingestion and refining framework of a small number of components that can be configured using metadata.   The goal for a new data feed to be added by configuring a few lines of metadata versus scripting or creating code for each feed.
  3. Organize Your Data – This practice may seem obvious, however, we have seen a number of Hadoop implementations that look like a network file share vs. a standards driven data environment.   Establish a directory structure that allows for the different flavors of data.   Incremental data (aka delta’s), consolidated data, data that transformed, user data, and data stored in Hive should be separated by into different directory structures.   Leverage a directory naming convention; then publish the standard so that data scientists/users can find the data they are seeking.

Addressing these three best practices will ensure that your Big Data environment is usable and maintainable.   If you are implementing or considering a Big Data solution, Perficient has the thought-leadership, partnerships, and experience to may your Big Data program a success.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Bill Busch

Bill is a Director and Senior Data Strategist leading Perficient's Big Data Team. Over his 27 years of professional experience he has helped organizations transform their data management, analytics, and governance tools and practices. As a veteran in analytics, Big Data, data architecture and information governance, he advises executives and enterprise architects on the latest pragmatic information management strategies. He is keenly aware of how to advise and lead companies through developing data strategies, formulating actionable roadmaps, and delivering high-impact solutions. As one of Perficient’s prime thought leaders for Big Data, he provides the visionary direction for Perficient’s Big Data capability development and has led many of our clients largest Data and Cloud transformation programs. Bill is an active blogger and can be followed on Twitter @bigdata73.

More from this Author

Follow Us