What is Load Leveler?
Load leveler in datastage is used to manage the resource space in server and run the scheduled DS jobs according to it.
How is Load Leveler Work Load Management Done?
Load management is used for the following:
- Job Management:
- Workload Balancing – (to maximize the use)
- Control – (Centralized –System admin)
- Usability – (Command line interface)
- Supports NFS, DFS, AFS, and GPFS
Job Management includes the following:
- Build, Submit, Schedule, Monitor
- Change Priory
- Terminate
Load Leveler Cluster submits only machines (compute nodes) in it.
What is Included in a Load Leveler Cluster?
Job Manager/Scheduler Node (public or local)
- Manages jobs from submission through completion
- Receives submission from user, sends to Central Manager, schedules jobs
Central Manager
- Central resource manager and workload balancer
- Examines requirements and find the resources
Execute Node
- Runs work (serial job steps or parallel job tasks) dispatched by the Central Manager
Resource Manager
- Collects status from executing and job manager
Region Manager
- Monitors node and adaptor status of executing machines
Submit-only Node
- Submits jobs to LoadLeveler from outside the cluster.
Examples of Load Leveler Commands:
To Submit a batch:
#llsubmit file name
Example:
#llsubmit script.ll
Queue Information:
Provides information about each queue.
#llclass
Example output:
Name MaxJobCPU MaxProcCPU Free Max Description
d+hh:mm:ss d+hh:mm:ss Slots Slots
———– ———- ———- —– —– ———————
interactive undefined undefined 4 8 Interactive Parallel jobs running on interactive node
workq unlimited unlimited 0 56 Default queue, up to 56 processors
preempt unlimited unlimited 16 48 queue resevered for on-demand jobs, up to 48 processors
checkpt unlimited unlimited 16 104 queue for checkpointing jobs, up to 104 processors, Job
running on this queue can be preempted for on-demand job
——————————————————————————–
“Free Slots” values of the classes “workq”, “preempt”, “checkpt” are constrained by the MAX_STARTERS limit(s)
View Job Status:
To list all jobs in queue:
#llq
To list a job of a specific user:
# llq –u username
To determine why a job has not started:
#llq – s job –id
The class of this job step is “checkpt”.
Total number of available initiators of this class on all machines in the cluster: 8
Minimum number of initiators of this class required by job step: 32
The number of available initiators of this class is not sufficient for this job step.
Not enough resources to start now.
This step is top-dog.
Considered at: Fri Jul 1 12:12:04 2016
Will start by: Tue Jul 1 18:10:32 2016
Generate a long listing rather than the standard one
# llq –l job – id Job Status States
Canceled | CA | The job has been canceled as by the llcancel command. |
Completed | C | The job has completed. |
Complete Pending | CP | The job is completed. Some tasks are finished. |
Deferred | D | The job will not be assigned until a specified date. The start date may have been specified by the user in the Job Command file or it may have been set by LoadLeveler because a parallel job could not obtain enough machines to run the job. |
Idle | I | The job is being considered to run on a machine though no machine has been selected yet. |
NotQueued | NQ | The job is not being considered to run. A job may enter this state due to an error in the command file or because LoadLeveler can not obtain information that it needs to act on the request. |
Not Run | NR | The job will never run because a stated dependency in the Job Command file evaluated to be false. |
Pending | P | The job is in the process of starting on one or more machines. The request to start the job has been sent but has not yet been acknowledged. |
Rejected | X | The job did not start because there was a mismatch or requirements for your job and the resources on the target machine or because the user does not have a valid ID on the target machine. |
Reject Pending | XP | The job is in the process of being rejected. |
Removed | RM | The job was canceled by either LoadLeveler or the owner of the job. |
Remove Pending | RP | The job is in the process of being removed. |
Running | R | The job is running. |
Starting | ST | The job is starting. |
Submission Error | SX | The job can not start due to a submission error. Please notify the Bluedawg administration team if you encounter this error. |
System Hold | S | The job has been put in hold by a system administrator. |
System User Hold | HS | Both the user and a system administrator has put the job on hold. |
Terminated | TX | The job was terminated, presumably by means beyond LoadLeveler’s control. Please notify the Bluedawg administration team if you encounter this error. |
User Hold | H | The job has been put on hold by the owner. |
Vacated | V | The started job did not complete. The job will be scheduled again provided that the job may be reschellued. |
Vacate Pending | VP | The job is in the process of vacating. |
To cancel a job:
# llcancel job –id
# llcancel job -u username
Job History and Usage Summaries
# llsummary -u estrabd /var/loadl/archive/history.archive
Check status of each node
# llstatus
$ llstatus
Name Schedd InQ Act Startd Run LdAvg Idle Arch OpSys
tstdevn01 Avail 4 2 Idle 0 1.01 0 Power5 AIX53
tstdevn02 Down 0 0 Busy 8 8.31 9999 Power5 AIX53
tstdevn03 Down 0 0 Idle 0 0.00 9999 Power5 AIX53
tstdevn04 Down 0 0 Idle 0 0.01 9999 Power5 AIX53
tstdevn05 Down 0 0 Busy 8 7.73 9999 Power5 AIX53
tstdevn06 Down 0 0 Busy 8 9.03 9999 Power5 AIX53
tstdevn07 Down 0 0 Busy 8 7.98 9999 Power5 AIX53
tstdevn08 Down 0 0 Busy 8 9.01 9999 Power5 AIX53
tstdevn09 Down 0 0 Busy 8 8.73 9999 Power5 AIX53
tstdevn10 Down 0 0 Busy 8 8.00 9999 Power5 AIX53
tstdevn11 Down 0 0 Idle 0 1.04 9999 Power5 AIX53
tstdevn12 Down 0 0 Idle 0 0.00 9999 Power5 AIX53
tstdevn13 Down 0 0 Idle 0 0.00 9999 Power5 AIX53
tstdevn14 Down 0 0 Busy 8 8.07 9999 Power5 AIX53
Power5/AIX53 14 machines 4 jobs 64 running
Total Machines 14 machines 4 jobs 64 running
The Central Manager is defined on tstdevn01
The BACKFILL scheduler is in use
All machines on the machine_list are present.