What is IBM’s Infosphere DataStage? Well to some it up, it’s a ETL tool, which extracts data, transforms it and applies business rules and then loads it to any target.
Now before we can get started using IBM’s Infosphere DataStage you would need to have already setup a DataStage project. When you start a DataStage client you are prompted to connect to a project. Each project contains:
- DataStage jobs.
- Built-in components. These are predefined components used in a job.
- User-defined components. These are customized components created using the DataStage Manager or DataStage Designer
A complete project may contain several jobs and user-defined components.
What is a DataStage Project?
There is a special class of project called a protected project. Normally nothing can be added, deleted, or changed in a protected project. Users can view objects in the project, and perform tasks that affect the way a job runs rather than the job’s design. Users with Production Manager status can import existing DataStage components into a protected project and manipulate projects in other ways.[1]
What is a DataStage Job?
Now that you understand what a DataStage project is you will need to know what a DataStage job is. This is the core development coding in which will extract, transform and load your data to your target. Below you will see a sample DataStage job pulling from a source database, joining from another data will performing multiple lookups before loading to its target database.
To create a DataStage job you would need to use specific stage such as “join” which allows you to join different sources or a transformer, which allows you to apply business/transforming rules. Below are just a few DataStage Stage options used while creating a DataStage job.
If you go back to the sample job above, you will notice “Links” that connect each stage. These links are used to move over specific data fields to your stages and target.
4 Components of Infosphere DataStage
Now lets go back a little. Infosphere DataStage has 4 different components. First is the Designer.
Here you can create, compile and even run your jobs.
Second is Director.
The Director allows you to run and monitor your DataStage jobs.
Third is the Administrator.
Here you can create projects, users and specify roles.
Last is your Web Console in which you can change passwords.
In order to use DataStage, you will need to have your project created, domain and your user id and password. Contact your local DataStage Admin to set you up with the following info.
Once you are logged into DataStage you will see the following, Palette, Standard, Debug Bar, Repository, and Property Browser
Palette – A list of all stages and activities used in the Datastage Palette
Standard – Used to save and run the job from the Designer
Repository – Which stores all the jobs, table definitions, transforms, etc that we create
Property Browser – Name and description of the job
Debug Bar – Provides you the tools to debug your job
Note: Every toolbar can be hidden (closed), and rendered visible again via the View menu.
The Director
Now that we talked about the Designer lets discuss about the Director. The Director has 4 basic views, which are;
- Status
– Status of each job/job sequence
- Log
– Log of each job/job sequence
- Schedule
– Jobs/job sequences queued for later execution
- Monitor
The Status view bar identities (IP address) of the DataStage server and the project to which Director client is connected. In the status bar (at the bottom) are the count of jobs in the currently selected category and, on the bottom right, the server time. All times reported are the server time. This is very important for offshore developers, for whom local time might be substantially different from the server time.
Each job has its most recent status displayed in the Status column. Also displayed are the most recent start time, the most recent finished time and the corresponding elapsed time (rounded to the nearest whole second) plus the description of the job (one of the job’s properties).
The director log, gives details about the job providing you the log file of the job.
Also, in the director you can schedule your jobs to run at specific times as well as monitor the progress.
The Administrator
But before we can do any of this we would need to setup our project, user ids and etc. To do this you will need to use the Administrator client. The Administrator is used primarily to create and delete projects and, mainly for newly-created projects, to set project-wide defaults, for example when job logs will automatically be purged of old entries. These defaults apply only to objects created subsequently in the project; changing a default value does not affect any existing objects in the project. You can also setup your user ids within the Administrator client by applying certain users to a project.
I hope this gave you a basic understanding of IBM Infosphere DataStage ETL tool. Please check out the great posts within the site and please look out for my next Infosphere DataStage posting getting into how to create a DataStage job.