Management wants business processes to operate consistently, perfectly, and on time. Service Level Agreements (SLA) were developed to insure that this happens. These agreements govern the lives of many IT professionals who must monitor, evaluate, control, watch over, and check off on processes to avoid consequences. For example, an SLA might specify that failing to meet certain service levels could result in fines, loss of business, negative performance reviews, or even job termination. This pressure can overwhelm people to the point that it affects their lives away from work. But, these agreements appear to be here to stay.
Why Did SLAs Originate?
Several factors contributed to the rise of the SLA:
- System consolidation puts more processes into a single footprint. This makes the stakes higher because even one event failing could affect an entire business.
- IT outsourcers have introduced the idea that they can run a company’s computers better than the in-house staff. They promise management they’ll deliver work on time, all the time, or pay a fine. To avoid having your job outsourced, you need to learn the new rules of running a data center. One of these rules is that automation is good. The reason many outsourcers can offer lower costs is because they use automation effectively. We know—they are some our best customers.
- Government regulations, such as Sarbanes-Oxley Act (SOX) and Health Insurance Portability and Accountability Act (HIPAA), introduce other requirements to the mix.
- Around-the-clock availability is now expected. Management looks at “always on” resources like the Internet and wants the same for their IT operations.
So, if SLAs are here to stay, how can you handle them with the least amount of stress? The answer is automation.
How the Robot Products Help You Handle SLAs
In an IBM i-centric data center, Robot Schedule is the heart of any SLA program. It can work in conjunction with Robot Schedule Enterprise to control processes in a consistent manner on IBM i, UNIX, and Windows servers. And, it provides the cornerstone for ensuring that these processes run on time.
Checks and balances are important in any schedule. You can’t just create a batch stream of jobs and hope to meet your SLA. As you build your schedule, you must incorporate mechanisms to check the status of critical processes. Robot Schedule, Robot Console, and Robot Alert all work together to help you do this.
- Use Robot Schedule Job Monitors to track whether critical jobs finish on time:
- The Job Overrun Monitor checks if a job runs longer than it should.
- The Job Underrun Monitor checks if a job completes too quickly.
- The Late Start Monitor checks if a job starts later than its scheduled time.
- Use Robot Schedule and OPerator Assistance Language (OPAL) to check the status of critical jobs.
- Use the Robot Schedule Good Morning report to analyze system job activity.
- Use Robot Schedule to create reactive jobs that notify you when critical jobs complete successfully.
- Use Robot Schedule with Robot Console Resource Monitoring to track jobs, job queues, subsystems, objects, controllers, and more.
- Use Robot Alert for text, pager, or email notification of any of the above actions.
Using Robot Schedule Job Monitors
Robot Schedule has three job monitors you can define for each critical job on your system. Depending on the requirements of your SLAs, you can monitor for jobs that run too long (job overruns), complete too quickly (job underruns), and jobs that start later than their scheduled time (late starts). Each monitor has several options for notification of the potential problem. You can set monitor thresholds that allow enough time to solve the problem before your SLAs are in danger of not being met.
Job Overrun Monitor
The Job Overrun job monitor allows you to specify the actions Robot Schedule should take if a job runs longer than it should. You can specify either a maximum time (in hours and minutes) the job should take to complete, or a time by which the job should finish (the choice depends on your SLA requirements):
Select Maximum duration to monitor a job based on how long it takes to complete. Enter the maximum time (in hours and minutes) the job can take to complete.
Must complete by:
Select Must complete by to monitor a job based on a time by which the job should finish. Enter the time by which the job should have completed.
Specify the actions Robot Schedule should take if the job does not complete in the time allowed. You can select to end the job, or send a warning to one, or any combination, of the following: the job’s message queue, a Robot Alert device, or the Robot Network Status Center.
Job Underrun Monitor
The Job Underrun job monitor allows you to specify the actions Robot Schedule should take if a job completes too quickly.
Enter the minimum time (in hours and minutes) the job should run before completing.
Specify where Robot Schedule should send a warning if the job completes faster than the time specified. You can select any combination, but must select at least one, of the following: the job’s message queue, a Robot Alert device, or Robot Network. Note: If you don’t specify an action, an event is logged in the job’s completion history and in the Job Monitor Events Log.
Late Start Monitor
The Late Start job monitor allows you to specify the actions Robot Schedule should take if a job starts later than its scheduled run time. You can enter either the maximum amount of time (in hours and minutes) after its scheduled run time that the job can start, or the latest time by which the job must start.
Later than scheduled by:
Enter the maximum time (in hours and minutes) after its scheduled run time that the job can start.
Must start by:
Select Must start by to monitor a job based on a time by which the job should start. Then, enter the time by which the job should have started.
Specify the actions Robot Schedule should take if the job doesn’t start within the time specified. You can select to end the job, or send a warning to one, or any combination, of the following: the job’s message queue, a Robot Alert device, or the Robot Network Status Center.
Using OPAL to Check Critical Jobs
You can use OPAL code to set up another type of Robot Schedule command-type job called a later-checker job. This type of job runs well after its critical job should have finished. The OPAL code in this job includes the RBASNDMSG command to send a message that the critical job is still running. You can run later-checker jobs at regular times, or only after an IPL, and you can use them to check the status of important subsystem and communication jobs. They help reduce the need for early morning physical checks around the data center. Because you are using OPAL code, you don’t need to keep track of the Robot Schedule job number, just the name.
For example, you could set up a later-checker job using the following command (where jobx is the name of your critical job):
RBTALRLIB/RBASNDMSG MSG('jobx STILL RUNNING') TOPG(SUPPORT)
The OPAL code for this job, shown below, uses the ACTJOB keyword to check the status of the critical job. If the job is not active, it is skipped.
This later-checker job runs every day, but sends a message only when jobx is running late.
Using the Good Morning Report to Analyze Job Activity
The Good Morning Report in Robot Schedule, which summarizes job processing during a specific time period, is both a great source of information and a great tool to help you analyze and manage your schedule. To schedule the Good Morning Report in a Robot Schedule job, enter the RBTGM command in the Job Properties Command Entry window and press (or click) the Prompt button to display the command prompt panel. Or, select Job History Reports in the tree view and right-click Good Morning Report in the list view to display the report setup window.
The Good Morning report can include the following information:
- Number of jobs that ran
- Number of jobs that completed, both normally and abnormally
- Total batch execution
- Jobs that ran outside the average runtime
- Jobs that ran outside the job forecast
- Deviation from past runs
- Number of jobs that ended in error (jobs with a status of E)
To see the number of jobs that varied from the average runtime, enter a percentage of deviation. For example, if you enter 15, the Good Morning Report shows you the jobs that ran outside a 15 percent deviation of their average runtime.
To see the number of jobs that varied from a specific forecast, enter the forecast name and deviation. For example, if you enter a forecast name and 30, the Good Morning Report shows you the jobs that ran within 30 minutes of the specified forecast. To select from a list of forecasts, click the Prompt button next to the Forecast Name field.
Using Reactive Jobs to Indicate Successful Job Completion
Some users like to set up reactive jobs to notify them when a critical job completes normally. This approach has its pros and cons. On the plus side, it provides a “peace of mind” reminder each day. On the minus side, you are notified every day, even when things are going right.
You can set up a Robot Schedule reactive job that is triggered by the normal completion of a backup job or other critical job. The reactive job has Robot Alert send a message when the critical job completes. Because most jobs run and complete at fairly consistent times each day, you know the approximate time by which you should be notified. If you aren’t notified, you check the system.
Using Robot Schedule with Robot Console Resource Monitoring
A final way to check for late-running jobs is by using Robot Schedule with Robot Console resource monitoring. Resource monitoring lets you check on the availability of jobs, subsystems, job queues, objects, controllers, and so on. You also can check to see whether a batch job is running.
You can create a Robot Schedule job to run the Robot Console RBCCHKRSC command for any resource, at any time of the day. If the resource is not in the correct status, the Robot Schedule job fails, and Robot Alert notifies you. Robot Console also monitors QSYSOPR for inquiry messages or other critical messages that can affect your night processing. For example, with Robot Console and Robot Alert, you are notified within seconds if a night processing routine has a decimal data error, or if a file is full.
Automation and SLAs
One of the biggest and often hardest-to-quantify benefits of automation is stress reduction. Without automation, you can spend a lot of time “fighting fires” and discussing what went wrong with night processing. With automation, monitoring your computer is automatic—you have a pulse on your computer operations, no matter where you are. Automating your systems can really help you deliver on your SLAs, while reducing your stress.