In this technology driven world, where the world is moving towards AI, machine learning, Internet of Things (IOT) etc. the key entity is data. At present, if technology is a religion then Data is the God. It is everything and it is present everywhere in different forms.
Understanding Data is like a job of an “Oracle” which will help you to predict future, forecast what’s going to come next. Therefore for any Organization in any domain to have an edge over other organization under same domain they should be very good at one particular skill which is “Data warehousing – Accumulation and compartmentalization of data which in future can be used to study patterns, forecast trends and generate reports” and also they should be able to do it with performance and agility which makes sure that the business gets the information they need early as sunrise for any decision making process.
So, as a part of this blog, we are going to see how to handle huge amount of data using Informatica ETL tool without any compromise on the performance and the do’s and don’ts on this approach.
What is External Loader?
In Informatica, external loader option increases the performance of a session with huge volumes of data. External loader configuration is applicable in
- DB2
- Sybase IQ
- Oracle Teradata databases.
While using external loader, the PowerCenter session does not connect to the target database directly to load data. Instead it invokes the external application for the specific database (autoloader for DB2 EEE, sqlldr for Oracle, etc.), which load the rows into the target table. During the load, Intermediate stage files will be generated. The tables are loaded directly from the flat files, which is faster than SQL commands.
In this method, the session creates three files while loading.
- Control file
- Intermediate stage file
- External Loader Log file
- Control File: This file contains information about the target flat file, such as data format and loading instruction for the external loader. The control file has the extension “*.ctl”.
- Intermediate stage file: This file contains the source data, that needs to be loaded into the target table. The file format can be altered in the session level.
- External Loader Log file: While using External Loader, the session generates two logs:
- Session log – Contains the External loader initialization info and completion info
- External Loader log – Contains the detailed information about the steps that take place in External loader. The log has extensive Info regarding its performance.
Difference between External Loader and Bulk load Option
Advantages and Disadvantages of External loader
Advantages: The major advantage of using external loader is in the significant improvement of performance. Especially in the large volume table, external loader speeds up the process.
Disadvantages: If the Informatica job fails, the recovery is not possible due to the bypass of database logs. The only option during job failure is truncate and re-load.
Right platform for using External loader
External loader requires the right set of platforms to work at its best. We can select External Loader Option for any platform, by following the guidelines given below.
- Databases such as DB2, Sybase IQ, Oracle, or Teradata databases.
- Dropping and creating the indexes along with constraints before and after job run.
- Table can have any identity columns in it, unlike bulk mode.
- Table can have row change timestamp or auto generating feature in its columns.
External loader in Informatica
Now let us find more on implementing the External Loader Option in informatica in detail.
For demonstrating External loader in informatica, we are selecting the database as DB2. An existing mapping with Database Connector will be used for External loader for better comparison. The table with the following feature will be implemented. Before the demonstration of external loader, the mentioned pre-requisites must be fulfilled.
Pre-requisites
The first step in EL (External Loader) implementation is creating the pre-session and post session commands for dropping and re-creating the indexes, constraints and so on. Now let’s dive deep into the implementation.
Step 1: Indexes are not allowed for the table which will be loaded via EL (External Loader) approach. Identify all the indexes and constraints in the table, make a list of pre-commands and post-commands for dropping and re-creating the indexes and constraints.
Pre-Session Database statements:
drop index schema_name.index_name1; alter table schema_name.table_name alter foreign key constraint_name not enforced;
alter table schema_name.table_name drop unique constraint_name;
Post-Session Database statements:
create unique index schema_name.index_name1 on schema_name.table_name ( index_name1 asc ) not partitioned in tablespace_name allow reverse scans compress no; alter table schema_name.table_name alter foreign key constraint_name enforced; alter table schema_name.table_name add constraint constraint_name unique (column1, column2);
Creation of External Loader Connection:
The external loader connection for DB2 is created in the Connections tab present in the workflow manager. When the tab “Connections” is clicked, the below shown options are displayed.
While clicking the highlighted Option “Loader”, “Loader Connection Browser” pop-up will be displayed as given below.
The Future of Big Data
With some guidance, you can craft a data platform that is right for your organization’s needs and gets the most return from your data capital.
In the browser, click the Option “New” for creating a new External Loader Connection. Next a “Select subtype” pop-up will be displayed as given below. In this window, select “DB2 EE External Loader” highlighted Option.
In the next window, we will be filling the Connection’s Object definition. Information such as Database name, Username and password are filled as per our requirement. The rest of the details are to be filled as given below.
Changes to incorporate EL approach in existing relational mapping
To incorporate the EL approach in a workflow, the below sample session with relation target instance will be taken into consideration.
In the next section, we will find the step wise list of all the changes that are required to integrate EL approach into this existing job.
Step 1: Under the Mapping tab, select the intended target instance. Beneath the Writers tab, click the drop down as shown below. Select the highlighted Option “File Writer” in the writer tab.
Step 2: In the highlighted drop down of “Type” under Connections section, click the option “Loader”.
Step 3: After the selection of Loader Option in “Type” Column, the EL Connector “EXTRNAL_ODS” that we had created initially, will be selected in the “Value” column.
Step 4: After the loader connection selection, the target table name as well as prefix name has to be typed in the target section compulsorily.
Challenges faced during implementation:
Once the implementation of all the steps are completed, EL (External Loader) process is initiated for testing. The execution of pre-session commands has completed successfully. When we initiate the EL job the first time, the job had failed with the error given below.
Error Message:
Message: The server does not support fixed-width files with the DB2 external loaders.. Target instance [HLT_PLN_RT_QUALFR]
Cause of the error: In EL approach, the relational target instance will be changed into flat file. We have to select the default file setting in the target instance as fixed width. For the integration service to read the flat file, it must be of the delimited type.
Solution: Change the file setting from fixed length to delimited for the target instance in session level.
Error Message:
2018-12-11 22:24:51 : ERROR : (4260066 | WRITER_1_*_1) : (IS | IS_DEV) : NODE01_NASETLDEV01 : WRT_8047 : Error: External loader process [16908530] exited with error [4]
Cause of the error: Initially we had set the flat file target’s delimited file properties as given below, while implementing EL approach.
Initially under the Optional quotes option, we selected the value “None”, which led the failure.
Solution: The Optional quotes setting should be “Double”.
After troubleshooting the above-mentioned errors in the session, the EL job completed successfully without any further issues.
Run statistics for Original vs EL approach:
In this section, we will compare the run statistics of the original job against the job with EL approach. Below statistics with the busy percentage details depicts that the job with EL approach has improved performance comparatively.
Original Job:
Run Time for Original Job (h:mm:ss): 0:55:00
***** RUN INFO FOR TGT LOAD ORDER GROUP [1], CONCURRENT SET [1] ***** Thread [READER_1_1_1] created for [the read stage] of partition point [SQ_INC_CDC_RATE_QUALIFIER] has completed. Total Run Time = [3258.978390] secs Total Idle Time = [2893.409420] secs Busy Percentage = [11.217287] Thread [TRANSF_1_1_1] created for [the transformation stage] of partition point [SQ_INC_CDC_RATE_QUALIFIER] has completed. Total Run Time = [3098.616687] secs Total Idle Time = [1626.629197] secs Busy Percentage = [47.504665] Thread work time breakdown: exp_SOURCE_ANCHOR: 92.438070 percent exp_FLAG_RECORD: 2.737940 percent rtr_INSERT_UPDATE: 3.389831 percent lkp_HLT_PLN_RT_QUALFR: 1.434159 percent Thread [WRITER_1_*_1] created for [the write stage] of partition point [FF_ODS_TGT_SKIP, HLT_PLN_RT_QUALFR_INS, HLT_PLN_RT_QUALFR_UPD, HLT_PLN_RT_QUALFR_DEL] has completed. Total Run Time = [3204.533197] secs Total Idle Time = [248.310910] secs Busy Percentage = [92.251261
EL Approach job:
Run Time for EL Job (h:mm:ss): 0:40:29
***** RUN INFO FOR TGT LOAD ORDER GROUP [1], CONCURRENT SET [1] ***** Thread [READER_1_1_1] created for [the read stage] of partition point [SQ_INC_CDC_RATE_QUALIFIER] has completed. Total Run Time = [1393.412757] secs Total Idle Time = [1133.556475] secs Busy Percentage = [18.648909] Thread [TRANSF_1_1_1] created for [the transformation stage] of partition point [SQ_INC_CDC_RATE_QUALIFIER] has completed. Total Run Time = [1348.161069] secs Total Idle Time = [71.640107] secs Busy Percentage = [94.686087] Thread work time breakdown: exp_SOURCE_ANCHOR: 91.446345 percent exp_FLAG_RECORD: 2.799378 percent rtr_INSERT_UPDATE: 3.888025 percent lkp_HLT_PLN_RT_QUALFR: 1.866252 percent Thread [WRITER_1_*_1] created for [the write stage] of partition point [FF_ODS_TGT_SKIP, HLT_PLN_RT_QUALFR_INS, HLT_PLN_RT_QUALFR_UPD, HLT_PLN_RT_QUALFR_DEL] has completed. Total Run Time = [2341.820833] secs Total Idle Time = [1233.939335] secs Busy Percentage = [47.308551]
In the end, if the business have the liberty to house the Intermediate files, then external loader becomes a better suited approach.