Do you like being awakened by a phone call from an unfriendly manager because a system you are responsible for has failed?
Do you enjoy a good old-fashioned scolding by an executive because a financial system you support is running so slow that the user community has become unproductive?
Do you have so much time available that you avoid scripting a little automation for those routine and repeatable tasks?
If you answered YES to any of these questions, please stop reading and call your doctor. You are sick.
You are a system administrator responsible for supporting and managing a multi-server production environment where a variety of crucial transaction and analytic applications are running to produce financial data that drives a number of key C level decisions. You are confident that a failure could not occur because you have spent countless hours tweaking the operating system and virtual machine settings on those systems. In fact, you haven’t seen a failure or received a performance complaint since going live last month.
It’s 0’dark thirty and you are comfortably asleep with your face buried in your overly soft pillow dreaming that you are in the CIO’s office and she is offering you that director position that just recently became vacant. Just as you are accepting the position, you notice outside her window a police car passing with its siren blaring… and you wake up, realizing that the siren is really your mobile phone. Thankfully, you forgot to place your phone on “vibrate.” Key users of a system you support have been calling the help desk because they can’t log in. You answer the phone and learn that users in Europe and Asia have been unable to log in since around 3:00 AM.
Explore key considerations, integrating the cloud with legacy applications and challenges of current cloud implementations.
After a brief apology to your spouse for the early AM wake up, you rub the sleep out of your eyes and run to the kitchen to fire up your laptop. After what seems like an eternity (really only 8 minutes) you are connected to the corporate VPN and launch a remote desktop to one of your EPM servers. You quickly realize that all of the Oracle EPM services are stopped on the Windows servers and reflexively kick off the startup script. After another 30 minutes, you are able to verify that everything is operational and communicate this to the after-hours help desk. Your deed is done. Only an hour of lost sleep… for now.
The next Wednesday morning around the same time, you get another rude awakening that the same thing has occurred. This time, it was your boss calling to find out what is going on and she wants root cause. So, you break out the laptop again, apologize to your spouse, connect to the VPN and kick off a startup script. You are wide awake this time knowing that this is not an isolated occurrence so you are determined to find the cause of the failure. After a ten minute scan through the Windows event logs, you find the culprit. Your bright and shiny servers were not put into a Windows update rotation during planned maintenance periods. The servers had received updates two weeks in a row that were automatically installed and the servers restarted as a result of those updates.
You do not have any tools in place to identify issues, automatically resolve issues, or alert you when some issues cannot be automatically resolved.
You need a Remote Monitoring and Management (RMM) tool.
What is Remote Monitoring and Management?
Remote monitoring and management is the proactive monitoring of system activity and often includes capability for reactive remediation and alerting of issues identified. There are a number of software vendors that produce RMM suites of tools. A few of the more populate vendors of RMM software are Computer Associates, Nagios, Manage Engine, ConnectWise, and Kaseya. Most RMM suites provide the following functionality:
- Monitor services, processes, URLs, and ports for up/down and trigger alerts when down conditions arise.
- Secure tunneled remote desktop / VNC capability without the need for a separate VPN connection.
- Automatic remediation of condition based failures such as system restarts, process failures, or excessive resource utilization.
- Manage Windows patching by adhering specific maintenance windows.
Sounds pretty smart huh? Well, it can be. Let’s break down your situation described above and identify how just about any RMM system could have helped you these past two weeks.
|Issue||Resolution with RMM|
|You didn’t know there was a failure until the user community complained.||Monitor the server, services, and processes for up/down and provide text, telephone, and email alerting when failures occur. You could have learned of the failure before the first user called the help desk.|
|Down time continued after the servers restarted because the services did not start up gracefully.||Monitor the Windows System Event Log for event ID 6005, “The event log service was started”. When this event is encountered, trigger a remote script to start the appropriate services again.|
|You had to wait for VPN connectivity then connect via RDP to determine the cause of failure.||Most RMM tools use VNC or other popular third-party remote desktop tools such as Splashtop, Screen Connect, or Team Viewer. These tools typically facilitate their own secure connection to the server and do not require a VPN connection. Additionally, most RMM tools enable remote viewing of event logs. In fact, you could have an alert on any events that indicate a restart.|
|Windows patches snuck in and encouraged a restart of your servers.||You are not the first person that this caught by surprise. There is hope however. If you use an RMM tool to manage patching, then this same tool can alert you when a server is out of compliance or is actually unmanaged giving you an opportunity to get the server into a patch management rotation.|
There are a lot of ways to accomplish much of the same results without investing in another software package. We decided to invest in RMM software after doing this the hard way for several years for a number of customers we were providing managed services for. These tools have enabled our SupportNet team to keep our finger on the pulse of the systems we support. Use of these tools for monitoring not only provides alerting for actual failures, but also for expected failures based on symptoms such as excessive resource utilization, stuck WebLogic threads, and even latency in URL response.
The benefits? Increased availability, reduced response time when failures occur, improved forensics capability (post mortem), and fewer apologies to your spouse.