Skip to main content

Development

An Error Handling Approach in Datastage Project

In year of 2012 I was involved in a datastage ETL project to load data from external flat files, as well as Oracle CRM data source into Teradata data warehouse. Comparing to other ETL appliances, IBM Datastage offers much flexible stages, components to accommodate each type of data feed business requirement. I would like to introduce an error handler solution utilized in the project for your reference.

Requirement

The administrator or power users will be monitoring ETL data flows and determine which stages or file run into error in the periodic jobs. Each individual data jobs will generate necessary error information file with specified format at project level. Now those separate piece of information need to be collected and put into centralized database. Basically the admin user need to know when & where the issue happens, what’s reason etc.

Approach

I believe there can be many solutions to be deployed to achieve the goal. We just wanted to make things simple and easy to implement in Datastage.  The overall steps are depicted in following diagram. The assumption is that we have different sequence job with different frequency, in each of which there are couple of parallel jobs. The reason that we are to use Sequential file instead of dataset file is that it is difficult to manipulate dataset file in Unix shell script. In this process,

1) Each parallel job will write error information into its individual sequential file.

2) In sequence job it will call each parallel job first and then go to a CommandExecute activity calling Unix shell to cat all files into single one and then copy to a pre-defined folder.

3) A designed error handling job need to be scheduled on the regular basis and watching on that specified folder. Once the sequential file onboard, it read error information and load to Teradata tables.

4) As long as the sequential files are processed without error, it should be move to archive folder. Otherwise, admin will have to look at the error and process it again.

 untitled

Pros

We don’t have to do manipulation in Datastage dataset file and call error handler data load job in many main sequence jobs so as to reduce degree of code coupling.

Cons

This is asynchronous way and it’s a bit hard to track whether the parallel job error information has been loaded into Teradata or not; And need to consider how to rollback if error out in one of the steps.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Kent Jiang

Currently I was working in Perficient China GDC located in Hangzhou as a Lead Technical Consultant. I have been with 8 years experience in IT industry across Java, CRM and BI technologies. My interested tech area includes business analytic s, project planning, MDM, quality assurance etc

More from this Author

Follow Us
TwitterLinkedinFacebookYoutubeInstagram