Skip to main content

Data & Intelligence

BI Tools – Source Control

This one’s easy. Every team needs source control. Period. No exceptions. A team of one needs source control.

Source control tools serve two primary purposes in BI development groups:

Version control during development including playing a part in code reviews and ensuring the security (backup) of the code base.

Deployment support as a controlled repository for all code moving between environments (code promotion). This also implies the ability to group artifacts across all tools into a deployment group and track testing and support issues to a particular code base.

Source control isn’t unique to BI, but the variety of tools in the BI environment creates an unusually complex environment.  Also, many of the tools may not create artifacts (files) that are easily controlled outside of a tool’s proprietary environment (think ETL or reporting tools).

Branching and Merging

When making recommendations to clients, I often take a multi-tier approach to source control to take advantage of integrated repositories in many tool platforms.  The general criteria for defining a source control strategy are:

  1. Usability – for each tool, what features make source control the most usable?  Often ETL, BI, modeling, etc. tools include integrated repositories that contribute to the development efforts beyond simple versioning.  Source control should not reduce productivity.
  2. Branch/merge support – how does each source control tool handle the possibility of multiple simultaneous development tracks (branches)?  In a production environment there will be bug fixes that need to be applied independently of the development branch, and in iterative environments on large teams, multiple development tracks may need to be managed.  Merging of binary artifacts must also be considered, although I’ve found that with a little scripting, most tools can handle it.
  3. Version labeling – the capability to group a particular version of each artifact into a release package.
  4. Scripting support – can the repositories be controlled from a programming and/or scripting environment?  Most can, and this is absolutely essential for continuous integration and “deploy from source control” processes.
Data Intelligence - The Future of Big Data
The Future of Big Data

With some guidance, you can craft a data platform that is right for your organization’s needs and gets the most return from your data capital.

Get the Guide

A concept I constantly preach is “deploy from source control.”  If you can’t deploy an entire environment directly from source control, you can’t (reasonably) be assured that the deployment is traceable and repeatable.  In practice this means that every script, every application, every static data set, etc. is applied to an environment build directly from source control via a deployment script that checks out the artifact and executes it against the environment appropriately.

A tiered approach means that for each tool a development source control strategy may be adopted that then feeds the centralized deployment source control process.  This allows developers to be as productive as possible while still providing a controlled, repeatable deployment process.

This level of rigor is becoming more common due to an emphasis on corporate controls and SOX and similar regulation.  However, a strong case can be made that these practices are beneficial and defensible for a team of nearly any size.  Deployment from source control allows:

  1. Repeatable deployments – deployments that can be tested in a test or pre-production environment to assure a smooth production build.
  2. Reliable forensics – troubleshooters can be confident that they know exactly what code/data was deployed and in what manner.
  3. System integrity – manual access to the production system can be severely restricted since normal operation (including maintenance deployments) don’t require an individual to have escalated privileges.  This enables “break the glass” strategies where administrators only access privileged accounts in exceptional circumstances that are then carefully audited after.
  4. Frequent deployments – deployment automation enabled by source control allows administrators to deploy environments quickly at a low cost.  This enables multiple specialized test environments, frequent environment refreshes, and true re-tests (build from known state) when deployment defects are discovered.  In iterative development, continuous integration methods build on these concepts.

Source control tools are widely available, and free, open source tool options are many.  Again, I encourage clients to evaluate the tools they (probably) already own and determine their suitability before looking around.

As the options for source control are so numerous, I won’t attempt to list them here.  I will say that I’ve found the distributed model of source control as implemented in Mercurial and Git to be a significant advance in the usability of source control.  I personally have Mercurial running on my laptop and version control working documents of all kinds (including Word documents like proposals, RFPs, etc.) in addition to any code I many be developing.  It’s basically zero-footprint and it allows me to confidently interact with colleagues with no concern regarding lost work.

Joel Spolsky of Joel on Software fame put together a little Mercurial DVCS tutorial here.  And, the Tekpub folks have a nice video tutorial here.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Chris Grenz

More from this Author

Follow Us
TwitterLinkedinFacebookYoutubeInstagram