Skip to main content

Data & Intelligence

Databricks on Azure versus AWS

Boxers In Action

As a Databricks Champion working for Perficient’s Data Solutions team, I spend most of my time installing and managing Databricks on Azure and AWS. The decision on which cloud provider to use is typically outside my scope since its already been made by the organization. However, there are occasions where the client is using both hyperscalers already or they have not yet moved to the cloud. It heloful in those situations to be able to advise the client on the advantages and disadvantages of one platform over another from a Databricks perspective. I’m aware that I am skipping over Google Cloud Platform, but tI want to focus on the questions I am actually asked rather than questions that could be asked. I am also not advocating for one cloud provider over another. I am limiting myself to the question of which AWS versus Azure from a Databricks perspective.

Advantages of Databricks on Azure

Databricks is a first-party service on Azure, which means it enjoys deep integration with the Microsoft ecosystem. Identity management in Databricks is integrated with Azure Active Directory (AAD) authentication, which can save time and effort in an area that I have found can be difficult in large, regulated organizations. The same is true of the deep integration with networking, Private Links and Azure’s compliance frameworks. The value of this integration is amplified if the client also uses some combination of Azure Data Lake Storage (ADLS), Azure Synapse Analytics, or Power BI. The Databricks integration with these products on Azure is seamless. FinOps gets a boost in Azure for companies with an Azure Consumption Commitment (MACC) as Databricks’ costs can be applied against that number. On the topic of cost management, Azure spot VMs can be used in some situations to reduce cost. Azure Databricks and ADLS Gen2/Blob Storage are optimized for high throughput, which reduces latency and improves I/O performance.

Disadvantages of Databricks in Azure

Databricks and Azure are tightly integrated when you are staying within the Microsoft ecosystem. Azure Databricks uses Azure AD, role-based access control (RBAC), and network security groups (NSGs). These dependencies will require additional and sometime complex configurations areIf you want to use take a hybrid or multi-cloud approach. Some of these advanced networking configurations require enterprise licensing or additional manual configurations in the Azure Marketplace.

Advantages of Databricks on AWS

Azure is focused on seamless integration with Databricks under the assumption that the organization is a committed Microsoft shop. AWS takes the approach of providing more dials to tune in exchange for greater flexibility.  Additionally, AWS offers a broad selection of EC2 instance types, Spot Instance options, and scalable S3 storage, which can result in better cost and performance optimization. Finally, AWS has more instance types than Azure, including more options for GPU and memory-optimized workload. AWS has a more flexible spot pricing model than Azure. VPC Peering, Transit Gateway, and a more granular IAM security controls than Azure make AWS a stronger choice for organizations with advanced security requirement and/or organizations committed to multi-cloud or hybrid Databricks deployments. Many advanced features are released in AWS before Azure. Photon is a good example.

Disadvantages of Databricks in AWS

AWS charges for cross-region data transfers, and S3 read/write operations can become costly, especially for data-intensive workloads. This can result in higher networking costs. AWS also has weaker native BI Integration when you compare Tableau on AWS versus PowerBI on Azure.

Conclusion

Databricks is a strong cloud database on all the major cloud providers. If your organization has already has committed to a particular cloud provider, Databricks will work. However, I have been asked about the differences between AWS and Azure enough that I felt I wanted to get all of my thoughts down in one place. Also, I recommend a multi-cloud strategy for most of our client organizations for Disaster Recovery and Business Continuity purposes.

Contact us to discuss the pros and cons of your planned or proposed Databricks implementation so we can help you navigate the technical complexities that affect security, cost and BI integrations.

 

 

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

David Callaghan, Solutions Architect

As a solutions architect with Perficient, I bring twenty years of development experience and I'm currently hands-on with Hadoop/Spark, blockchain and cloud, coding in Java, Scala and Go. I'm certified in and work extensively with Hadoop, Cassandra, Spark, AWS, MongoDB and Pentaho. Most recently, I've been bringing integrated blockchain (particularly Hyperledger and Ethereum) and big data solutions to the cloud with an emphasis on integrating Modern Data produces such as HBase, Cassandra and Neo4J as the off-blockchain repository.

More from this Author

Follow Us