What is Unstructured Data?
Unstructured data is an information that does not have a predefined data model or does not fit well into relational tables. It is broadly classified into two types
- Non-Textual unstructured data is a multimedia data like still images, videos, and MP3 audio files
- Textual unstructured data are like email messages, instant messages, excel etc…,
The objective of this presentation is to project, how to design a Datastage job that extracts data from Excel file with multiple sheets and write into a text file or comma separated file.
When to use Unstructured Data?
Usually many of you may raise a question; Why can’t we use sequential file stage to read an excel file?
Of course, we can use the sequential file stage to read the data in excel file, but only if the sheet contains data in a single sheet. If data is spread across the sheets, we simply can’t use the sequential stage.
The beauty of Unstructured Data stage is that it has additional transformations to pull out data from the entire work sheet. Transformations can be extended to have more options in our output file like comment, author etc…,
The latest version of IBM Infosphere Datastage (11.5) has a limitation on handling Unstructured Data. It can support only Microsoft Excel files as data sources.
Implementation in Datastage
You can find the unstructured data stage under Files section palette in Palette in the designer. Data can be read from or written to any sources.
Step by Step Process:
- From the File section of Palette, drag an Unstructured Data stage and a dataset stage to the canvas and link two stages with copy stage. Stage looks as below
Source File: Excel sheet with employee details in two different work sheets, “Employee.xlsx”
Datastage Job has been designed with three stages:
- Unstructured Data Stage (Source)
- Copy Stage
- Dataset Stage (Target)
Next step is to define the properties to extract the data.
Below is the snapshot of Unstructured Data stage properties:
- Click on the Configure button highlighted in the above image, you will get the properties window of Configuration
Here is the snapshot for the configure properties:
Step 1a: Give the path of source file in the File name field, as in the snapshot below
Step 1b: In the Range expression field, give the range of the data in worksheet
What is Data Range?
When you use the Unstructured Data stage, you can extract data from a specified data range in a Microsoft Excel spreadsheet.
Data range represents a cell, a row, a column, or a selection of cells that contain one or more continuous blocks of cells. A data range is specified by the range expression. In the Unstructured Data stage, you can use a range expression to specify the data range to extract
The Range option property of Unstructured Data stage allows you to specify the data range either by selecting the Specify the start row option or the Specify the entire data range option. If you select the Specify the start row option, then identify the start row. Unstructured Data stage then identifies the end row of the data range. If you select the Specify the entire data range option, then you must specify the start and end rows of the data range to be extracted.
Sheet names to skip: Using this option we can skip the sheets, if that are not required
Then, click on the load button for importing the metadata
Step 1c: Click on Import to load the metadata in Map tab, as below
- we have few more options to be added in our output file like Row is hidden, Sheet name, comment, author etc…, in the Property tab
Step 2: I have used Copy stage connected to the Unstructured_Data stage to drop the columns if not needed and mapped to the target file (DataSet)
Step 3:Load data into the target DataSet
The employee information of two sheets are merged in one sheet in the output as below
With this, we got to know how to extract data from multiple sheets in an excel sheet using Unstructured Data in Datastage. This can be primarily used, when we process organizational/enterprise data that is used for calculating metrics and also to join data from multiple sheets for the reporting purpose.