Welcome to part 3 of the “Managing BI Performance Problems series.” Last time I discussed the importance of indexing and how approach it. But I also said that indexing, while cheap and easy, could only solve cheap and easy problems. So then, how to address more complicated problems with performance? The first answer is aggregate tables. Richard Kimball (arguable father of data warehousing) once wrote:
“Aggregates can have a very significant effect on performance, in some cases speeding queries by a factor of one hundred or even one thousand. No other means exist to harvest such spectacular gains.”
And to be sure he’s not understating it. I’ve seen queries go from running in 4 minutes to running in 4 seconds with proper use of aggregate tables. That being said, most companies could really make good use of aggregate tables. Yet few companies actually employ them and many of those that do implement aggregate tables seldom use them.
Properly creating an aggregate table is a bit more involved than simply running an aggregating SQL statement and slapping it into the data warehouse. Below are some tips on how to properly approach implementing aggregate tables.
Cultivate a deep functional knowledge
There are many reasons as to why many companies don’t use aggregate tables effectively. Many users simply accept that their reports might be running on hundreds of thousands (if not millions) of rows and simply wait it out. Or even if the users do complain many developers throw up their hands and say that there’s just no way to make a report with so many rows run quickly.
The Future of Big Data
With some guidance, you can craft a data platform that is right for your organization’s needs and gets the most return from your data capital.
Even when a developer sees that an aggregate table could be of use on a certain query many build to that specific query and then leave out others that could have benefited from the change. And what ends up happening is a plethora of aggregate tables get built.
I mentioned in the previous article that a deep understanding of how users are approaching the data, what questions they’re asking of it, and what actions they will take depending on the answers, can only help a development team. Not to sound like a broken record but the same is true here. It will cut back on the number of misguided aggregation attempts, facilitate in more thorough fix implementation and lastly in certain cases it enables the development team to pro-actively address possible performance problems.
Build dimensions in a disciplined manner
Whether you’re an “Inmon-ite” or a “Kimball-ite” or somewhere in between what makes and breaks aggregate possibilities is the levels of conforming dimensions. As such setting a strong architectural vision for your data warehouse is crucial. Set up good and simple rules to follow when creating new dimensions so that if a need for an aggregate table arises it can actually be implemented. Sure, it can take time to make certain that everything is built in the most scalable way as possible. But in the end being able to dovetail solutions in neatly because of a disciplined approach early on is well worth the cost.
When looking at performance problems look for chances to aggregate and save.
Sometimes teams are so busy throwing hardware at a performance problem or thinking that their vendor software will save them that they never get to actually creating aggregate tables. True, giving the server more RAM will make things better. It is also true that the meta-layer software is getting better with each iteration. But it is also true that neither of these approaches are very scalable. It is always better to tackle the root of performance problems in all cases. As such, if performance problems point to something “in the code” the possibility of creating an aggregate table should be investigated early on in the process.
Keep aggregate tables simple
I have seen some aggregate tables so convoluted that it was impossible to really tell whether or not it was aggregating the data correctly. Remember that aggregate tables are ultimately invisible to the user, as such they should be relatively simple. A good rule I follow is that you should be able to explain what you’re doing in an aggregate table to a non-technical user in a single sentence.
Help the users help themselves
Users tend to be their own worst enemy. Too many times users demand that every report they run have detailed data on it, thereby negating any possibility of improving performance through aggregation. Train users to start at a “high” data level and let them drill down to the data they need. Make this idea of “Start high and drill down” a leading paradigm when helping users construct their reports. In keeping with this, aggregate tables will yield the highest possible return. Further eventually the users will see that they don’t need detailed data on everything and adjust their usage accordingly.