The IT Leader's Guide to Multicloud Readiness
This guide provides practical key insights and important factors to consider to make informed decisions in your multicloud journey.
If you follow the SQL Server community at all, you’ve probably heard a lot of buzz around PDW (Parallel Data Warehouse). This is the first in a series of blogs I am going to be writing about PDW. In this series, I am going to cover everything from PDW nuts and bolts to how to deliver “Big Data” solutions to end users with PDW as the foundation.
In this first installment, I am going to introduce you to PDW at a high level and talk about how PDW fits into the larger BI/Big Data market landscape.
Very Large Databases and Microsoft PDW
A “Very Large Database”, or VLDB, is a database system that stores volumes of data that traditional database systems just can’t handle. Typically, data in the dozens or hundreds of terabytes – or higher – is where you hear people start to talk about VLDBs as a solution.
Enter Microsoft’s PDW. PDW is a self-contained data warehouse appliance that is architected from the ground up using Massively Parallel Processing (MPP) to maximize the storage and retrieval of very large volumes of data. It was introduced a few years ago, and version 2 was just released in April of this year.
In the jump from v1 to v2 Microsoft made several architectural changes to the internal networking and storage that greatly reduced PDW’s purchase price. When compared to the VLDB competition, MS PDW v2 is now the lowest cost per terabyte in the market, beating out the likes of Teradata, Netezza, Greenplum and Exadata. These same architectural changes also increased performance over PDW v1 as well.
PDW and Hadoop
Hadoop is another big name in the Big Data arena. It is basically an MPP file storage scheme. There is much promise in this area, especially in terms of being able to store and access unstructured data (think large files of metadata, like server logs or social media content).
One of the big advances in PDW version 2 is its ability to interact with Hadoop. PDW uses a technology called Polybase to access unstructured data in Hadoop. And one of the amazing things Microsoft is doing with this technology is giving customers the ability to query Hadoop directly with SQL. For those of us who are familiar with SQL, this is a tremendous advantage as compared to having to learn the Hadoop query language called Hive.
Engaging Clients Interested in PDW
Due to the expense of the PDW appliance, the process of engaging clients is quite different than for traditional data warehouse projects. There are basically two scenarios where a consultant may become involved in a PDW. The first of those two scenarios is when a client has already decided they want to purchase a PDW appliance. The second is when a client is considering PDW, but want some sort of proof of concept done before they decide to buy.
From a consulting perspective, the first scenario would normally be preferable. But the proof of concept scenario is very likely to occur just as often. Unfortunately, the proof of concept phase is not like a typical PoC for a regular client engagement. Microsoft is kind enough to have several appliances available for PoC endeavors, but access and availability are limited. There is a formal process that consulting companies must go through with Microsoft to start a PoC engagement, and it takes 6 to 8 weeks to complete the PoC. It takes a strong relationship with Microsoft, and the technical expertise to deliver.
Part 1 Conclusion
PDW v2 has hit the market, and clients are excited what it brings to the table. It’s a game-changer for enterprises that deal with large volumes of data and are familiar with the SQL Server platform. Stayed tuned in the coming weeks as I dive deeper into the guts of PDW, and talk about things like appliance architecture, purchase options, storage choices, data movement, and how to deliver your BI from on top of the PDW platform.