In previous posts I’ve provided an introduction to Microsoft’s cloud-based Cortana Analytics Suite, and we’ve looked at what data and event ingest capability it provides. This time, I want to shed some light on what happens after Ingest. Where do you land the data so that you can work to transform it into actionable intelligence?
Traditionally, Azure has provided basic storage in both Blob and Table formats. In the Cortana Analytics Process — the steps used to go from problem to toolset to full Cortana Analytics-based solution — Azure Blob storage, Azure VM-based SQL Server, and Azure SQL Database are all defined as common first destinations for the ingest process. The choice here is determined largely by business needs and data characteristics. A common scenario is:
- First, load raw data into Azure Blob storage (using Python, javascript, or a number of other techniques)
- Next, create a Hadoop cluster with that data using HDInsight, and then use Hive queries to manipulate the Hadoop data.
- Finally, the results of those Hive queries can be exported, saved back to Blob storage, or even exported to Azure Machine Learning (more on that in future posts).
For org- or enterprise-level solutions, CAS provides more comprehensive offerings: Azure Data Lake and Azure SQL Data Warehouse. Azure SQL Data Warehouse is an elastic, massively parallel data warehouse that can grow, shrink, and pause as needed — up to a size of Petabytes. Effectively, this offering is SQL Server PDW in the cloud rather than in the APS appliance. Virtually all the same capabilities are there — notably including Polybase, which allows T-SQL querying across both structured and unstructured data. Obviously, data from on-premises sources can be incorporated via ETL.
Currently in preview, Azure Data Lake provides a unified solution capable of ingestion, storage, and analysis of any kind of data at any size. Built on Apache YARN, the Data Lake service can scale dynamically and offers fully managed Hadoop, Spark, Storm, and Hbase clusters.
A new tool that Microsoft is including in the CAS is an EIM (Enterprise Information Management) service called Azure Data Catalog. This is a fully managed service that serves as a system of registration and discovery for enterprise data sources. Users can register and annotate those data sources to provide context and metadata, and then they can be discovered and used for reporting, analytics, etc. Data Catalog basically crowdsources knowledge in your organization, helping unlock and uncover uses for and sources of important data.
As we’ve seen, Cortana Analytics leverages existing and new Azure services to provide comprehensive Ingest and Store capability for Big Data solutions, as well as advanced cloud-based Data Warehousing scaling up to the Petabyte range. In addition, they are providing some innovation in terms of kind of a “crowdsourced data governance” tool.
Next time, we’ll take a look at what happens next after we have our data at rest. We’ll review the Cortana Analytics Process for building Data Science solutions, and in the process we’ll examine some other new and existing tools bundled in the suite.