Prior to running your project on a grid system, you must ensure that your grid environment is configured.
Why Do You Need Grid Configuration in Datastage?
- Grid computing will enhance the performance of the server through maximum utilization of compute nodes to one or more projects simultaneously.
- Enables both grid distribution methods simultaneously
- Allows you to assign jobs to specific servers in thegrid
- Allows you to assign a parallel job to run across multiple servers
Platforms you can use:
Redhat / SusE
AIX/Power
Why is the Data integration Grid Driving Rapid Customer Adoption?
You can make better decisions when you have better data yields.
- Grid-based integration makes it possible for companies to process and analyze larger data volumes, create a consolidated view of data, and put the right data into the enterprise data warehouse and other critical enterprise applications
- More sources of data, more data from each source, better matching, real-time versus batch
- Better business decisions
- Enhanced customer relationships
- More cross selling and upselling
- New services delivered to customers
Reduced Data Integration Costs.
- Reduce administration and operating cost –centralization of staff.
- Reduced data integration project costs – lower cost per project delivered by data integration center of excellence versus siloed projects.
- Reduced hardware cost.
What are the Benefits of Grid Computing?
- Low cost hardware
- High-throughput processing
- Resource manager monitors availability of hardware at startup / job deployment time
- SLA (Service Level Agreement) – It have consistent run times and isolates job concurrent execution.
Comparison of Before and After Grid Configuration
Before Grid:
Architecture & proliferation of SMP servers:
• Higher capital costs through limited pooling of IT assets across silos
• Higher operational costs
• Limited responsiveness due to more manual scheduling and provisioning
• Inherently more vulnerable to failure
• No ability to exploit available capacity when other teams are idle
After Grid:
“Virtualized” infrastructure:
• Creates a virtual data integration collaboration environment
• Virtualizes application services execution
• Dynamically fulfills requests over a virtual pool of system resources (nodes)
• Offers an adaptive, self-managed operating environment that guarantees high availability
• Delivers maximum available capacity to anyone participating in the grid
Grid Environmental Variable:
APT_GRID_ENABLE
• YES: Current osh will intercept the run script to create a new configuration file
• NO: Use the existing configuration file
APT_GRID_QUEUE
• Name of the Resource Manager queue the job will be submitted to
APT_GRID_COMPUTE_NODES
• The number of compute nodes required for the job
• Used to request the number of compute nodes in the dynamically created configuration file
• A compute node is a server that can be used for processing
• Not e.g. dedicated for IO or DB2
• Default value is 1
APT_GRID_PARTITIONS
• Used to create multiple partitions for each compute node • Default value is 1
Resource Management
• Tracks resources (nodes) based on which jobs are already running, which servers are down
• Queues jobs when no resources are available
• Provides a list of nodes that are assigned for a job
• Extensive advanced features
• We leverage a subset of the features
• Manager node where tasks are scheduled and resources allocated
• Usually happens on the head node
• Compute nodes have agent processes that communicate back to the manager
• Jobs (scripts or executables) are started on compute node, not head node
Grid Enable Tool kit:
What does it do?
• Prebuilt integration with resource managers
• Coordinates activities between the parallel framework and the resource manager
• Creates the parallel configuration file to drive the dynamic assignment of compute resources
• Logging (interaction w/ RM, usage details)
Workflow of GRID: