As we move into the modern cloud data architecture era, enterprises are deploying 2 primary classes of data integration tools to handle the traditional ETL and ELT use cases.
The first type of Data integration tool is GUI-Based Data Integration solutions.
Talend, Infosphere Datastage, Informatica, and Matillion are good examples. These tools leverage a UI to either configure a data integration engine or compile code for data integration. GUI Integration tools promise fast, friendly user interfaces to rapidly create new data pipelines. Also, GUI-based data integration tools have a proven record of increasing developer productivity. They are good for organizations that have:
- Many data integration pipelines to manage.
- Complex MDM requirements and business rules that need to integrate into data pipelines.
- An ubiquitous relational database ecosystem.
- Requirements to move data to and from cloud platforms (e.g. AWS, Azure, GCP)
The second type of Data Integration is the Script/Code-based Data Integration Solutions.
The IT Leader's Guide to Multicloud Readiness
This guide provides practical key insights and important factors to consider to make informed decisions in your multicloud journey.
Script/Code-based data integration leverages a serious of tools to develop a data pipeline. This capability usually requires:
- A programming language like Python or Scala
- A data processing framework such as Spark
- An orchestration tool similar to Apache Airflow.
Code/Scripts are constructed in vertices or nodes using a programming language and framework. These vertices then are structured in Directed Acyclic Graphs (DAGs) by the orchestration tool. DAGs can scale to handle very large (think 10s of Terabytes per day) data pipelines. DAGs are also extremely useful for handling customized or complex processing that one would see in Artificial Intelligence or Machine Learning use cases.
The 0.5: Cloud Native
When I was initially socializing the two types of Cloud ETL blog idea, a counterpart asked, “What about cloud-native?” Good question! The cloud-native options are just flavors of the two types of Data Integration. For instance, AWS Glue and Google DataProc have UIs that generate code (e.g. Python and Scala). Unlike their legacy counterparts with a rich UI functionality, these cloud-native tools still require editing the generated code (usually Python or Scala). The cloud-native tools are quickly catching up, but they still need to add significant functionality to their UIs to be able to garner the same productivity gains as traditional GUI-based solutions.