“Extraction, Transformation, and Loading” also known as ETL is among one of the key steps in data warehouse life cycle. Generally source data structures in operational data store are not similar to target data structures. To solve the issue of incompatible data structures ETL acquires data from operational data stores and transform it into information that can be used by decision support systems. As ETL is such an important part of data warehouse lifecycle, it is important that ETL is completely understood and properly implemented. However ETL is a very vast concept and not particularly defines tools and processes that needed to be involved in ETL. This post addresses the issues faced in extraction, transformation, and load components of data warehousing.
Common ETL issues
Requirements and Business Needs
ETL must start system with gathering in one place all the known requirements, realities, and constraints affecting the ETL system. Following the requirements, a number of architectural decisions must be made at the beginning of your ETL project. These decisions are major commitments because they drive the ETL implementation. The architectural decision includes selection of hardware, software, coding practices, personnel, and operations.
Business needs are the information requirements of the end users of the data warehouse. ETL team must make sure to introduce the business needs into the data warehouse. Without these business needs the decision support system will not be very effective.
Data Profiling
Data profiling is a systematic examination of the quality, scope, and context of a data source to allow an ETL system to be built.
Dirty data source may require Elimination of some input fields completely
· Flagging of missing data and generation of special surrogate keys
· Best-guess automatic replacement of corrupted values
· Human intervention at the record level
· Development of a full-blown normalized representation of the data
Security Requirements
Security becomes very interesting when it comes to data warehousing. The data warehouse seeks to publish data widely to decision makers, whereas the security interests assume that data should be restricted to those with a need to know. As the job of ETL process is to read data from several operational data stores, improper or restrictive security can cause ETL process to become hard to understand and hard to implement. ETL team needs to work in a special environment, since they need full read/write access to the physical tables of the data warehouse.
Data Latency
Data latency obviously has a huge effect on the architecture and the system implementation. Nowadays many modern applications require data warehouses to behave like operational data stores. It may be impossible to reprocess data according to the old processing algorithms if enough time has passed. If the data latency requirement is sufficiently urgent, the architecture of the ETL system must convert from batch oriented to streaming oriented.
Available Skills
An ETL system shouldn’t depend on critical processing modules of programming platform if those programming skills are not in house, and those skills cannot be reasonably acquired and kept.
The choice of architecture is a fundamental and early decision in the design of the ETL system. The choice of architecture affects everything and a change in architecture almost always means implementing the entire system over again from the very start.
Current Solutions to solve these ETL Issues
To overcome these critical issues in ETL many vendors have introduced dedicated ETL tools that can help in ETL processing. For example, SQL Server 2005 comes with dedicated ETL tool called SQL Server Integration Services that provides simpler, faster, and cheaper ETL development.
Tool based ETL systems provide easy to use interfaces; this makes it easier for technical people with broad business skills to use ETL tools effectively. Many ETL tools have integrated metadata repositories that can synchronize metadata from source systems, target databases, and other Business Intelligence tools. Many ETL tools also offer in-line encryption and compression capabilities. ETL tool can often manage complex load-balancing scenarios across servers, avoiding server deadlock.
In the long run purchasing an ETL tool actually reduces the cost of building and maintaining a data warehouse. We need to keep in mind that ETL tools only take care of ETL process at the physical implementation level and logical level ETL design must be created before the implementation of ETL process. The main goal of ETL is to make data available to data warehouse for querying and analysis; if business requirements are not followed properly than every ETL technique will fail regardless of the implementation details.