In the last post in this series, we looked at why Azure Machines are the most likely candidate for workable SQL Server-based cloud BI implementations. Today, we talk turkey about actual limitations and things you need to know as you look at a potential solution in this environment.
The first thing to know is that Azure VMs come in sizes relating to:
- The number of available CPU cores
- Memory capacity
- Available OS and VM disk space
- Maximum number of 1TB data disks (and associated IOPS count)
Definitely examine the options in light of your own requirements but, for general BI usage, VMs of size “Large” and above are the only real candidates. You will want a virtual machine with at least 4 or 8 CPU Cores if you want to run SQL Server Enterprise Edition. Obviously running SharePoint has its own considerations, and for SharePoint VMs you should follow appropriate sizing guidelines.
The Large VM is the smallest size with at least 4 cores, and it also has 7GB RAM and up to 8 1TB data disks each rated at 500 maximum IOPS. Sizes beyond that range to the beefiest available, the “A7.” With 8 cores, 56GB of RAM, and up to 16TB data storage, this type of VM provides substantial BI infrastructure, and could host a significantly sized Data Warehouse, Cube, or Tabular model.
Note that the uppermost storage limit in Azure VMs is 16TB (again, on 16 1TB disks) and that IOPS will also be limited by the data disk breakdown. Also be aware that Azure’s Cloud service role instances (i.e. Web Roles and Worker Roles) require more disk space than a virtual machine. System files will take up 4 GB of space for the Windows pagefile, and another 2 GB of space for the Windows dumpfile — so account for that space in your storage calculations.
What other concerns than pure hardware capability are there? In two words: data movement. If you are familiar with writing ETL processes, you will already understand how connectivity issues can bugger up a data extraction. ETL processes must be vetted in the cloud environment to ensure that throughput and execution windows are viable. Some typical SSIS techniques don’t all work when, for example, moving data from an on-premise to a cloud environment (AKA “Hybrid data movement”). For these moves, you will always have a restricted bandwidth and high latency network; the Cloud in this scenario basically means “internet.” The following steps are recommended for hybrid data movement (as well as on-premises data movement) to load data the fastest way possible:
- Bulk load data; avoid RBAR operations that incur “roundtripping” between on-prem and cloud sources/destinations, Consider ways in which the Load operation might be moved entirely into the cloud VM (i.e. extracting data to files at the source, and then moving the files to the Azure VM for actual load by SSIS — also called a “two-stage transfer”)
- Consider compressing the data before sending it to the destination. This can help with both latency and, if moving data from Azure to on-premise, can save charges related to data egress from Azure.
- While SSIS will not automatically retry an operation, you can add code to handle this. A For loop container and its MaximumErrorCount property can get you most of the way there.
- Log the load process as minimally as possible. Extra activity while the load is being performed will only increase the risk of packet loss or other such related issues.
- Avoid SORT in queries and SSIS data flows (not to mention other “blocking” type data flow tasks). These transformations delay data movement into the destination tables. If the operations are required, try to adjust the design such that you can load simply and quickly, and apply such rules later.
- And on that point about “cloud = internet”: don’t forget security! I personally think it’s worth considering aftermarket SSIS tasks that can support secure connections between your on-prem and cloud installations.
For more detail about ETL with Azure BI, this whitepaper offers a thorough discussion of Hybrid data movement concerns, and techniques for countering them in SSIS.
Next time, I’ll cover where things sit with Office 365, SharePoint Online, and the upcoming Power BI. Cheers!