In my recent post, Anatomy of the Ideal Background Job, I asserted the ideal job should be monitorable.
Sling Health Checks provide a basis for monitoring a job and AEM’s Health Reports console provides an intuitive, visual representation of the status of the Sling Health Checks for administrators.
Proactive monitoring, on the other hand, requires a machine-readable status.
Ideally, we want an alert go off in a NOC whenever a job fails or reports an unusual status. That way, we can be immediately alerted to the issue instead of having to wait until it is noticed by a customer or support engineer.
Using the Sling Health Checks along with a Nagios script, you can institute proactive monitoring of your AEM jobs and application by your Adobe Managed Services Customer Success Engineering team.
The first step is to enable an endpoint to monitor the Sling Health Checks by configuring the Sling Health Check Servlet.
Next, create the Sling Health Checks, as I previously covered in the Anatomy of the Ideal Background Job. Sling Health Checks can be used for other parts of your application as well, not just background jobs, so you may want to implement monitoring on services or servlets to ensure your website is functioning correctly.
Once you have your Sling Health Check defined, create a script for Nagios to execute to monitor the Health Check. This script should return a status of 0 if the Health Check is in an OK state and non-0 if not.
Here’s an example script created by my most recent project’s excellent Customer Support Engineer Jasmeet Dhiman:
#!/bin/bash # #check_doctor #"Doctor Import" #"Health Library Import" #"Locations Import" # #This script if Doctor Import service is working as expected. # # Author: Jasmeet Dhiman set -o pipefail export HOME=/home/nagios STATUS=$(curl -s -u admin:"$(/bin/pass CQ_Admin)" 'http://localhost:4502/system/health?tags=client,author&combineTagsWithOr=false&httpStatus=WARN:418&httpStatus=ERROR:500&format=json' | jq -r '.results[0].status') NAME=$(curl -s -u admin:"$(/bin/pass CQ_Admin)" 'http://localhost:4502/system/health?tags=client,author&combineTagsWithOr=false&httpStatus=WARN:418&httpStatus=ERROR:500&format=json' | jq -r '.results[0].name') echo $STATUS | grep OK > /dev/null 2>&1 if [ $? == 0 ] then ERROR=0 else echo $STATUS | grep WARN > /dev/null 2>&1 if [ $? == 0 ] then ERROR=1 else ERROR=2 fi fi EXITMESSAGE="The current status of ${NAME} is ${STATUS}" echo $EXITMESSAGE exit $ERROR
This script reads the results of the Sling Health Checks to find the status of a single job and then reports the status of that job.
Now that you have the script and Health Check, you can update your Run Book and work with the Adobe Managed Services team to get it all deployed and configured. In the Run Book, you will need to provide the contact procedures and information if any of the Health Checks enter an error state.
Once you have the Health Check monitoring in place with AMS Nagios, Adobe Managed Services will get alerts any time the Health Check goes into an Error or Warning state.