Security Incident Management
Incident Management can be defined as “effectively managing unexpected disruptive events with the objective of minimizing impacts and restoring normal operations” (1). For security-related incidents involves all of the steps prior, during, and subsequent to an information security incident. This may have consequences far beyond the restoration of normal service. For example, the inclusion of law-enforcement, notification of customers/clients, and public relations management efforts. The purpose of establishing Incident Management is to ensure that the organization anticipates and prepares ahead of time the capability to rapidly and effectively respond to outages, attacks, intrusions, and other security-related events.
Incident Response Team
Incident Response Teams (IRT) are typically well trained to identify an on-going situation, take immediate protective steps to limit/contain the damage, and ensure capture of forensic evidence to facilitate later research and investigation. These teams may be full-time business units (i.e. financial institutions, health-care organizations, government organizations, etc.) or share other IT Security responsibilities (i.e design/system review, policy education, secure system automation, secure coding practices, hardened operating systems, etc.). As an analogy, consider the unexpected event of a house fire. There is a time between ignition and detection, notification of the emergency, the arrival of “first response” in the form of firefighters, and mitigation of the initial incident by extinguishing the flames. However, after the fire is out there are still a large number of activities that often follow, such as clean up of the damage to restore normal use of the household, potentially an investigation as to the cause of the fire (either accidental, natural, or man-made), and some form of remediation for prevention (up to and including law enforcement!).
Likewise, IT Security IRT will have a well-defined approach to managing an exceptional business event that impairs or disrupts normal IT system behavior.
Detection
The initial part of any incident management is based on the detection of an exceptional condition or event. This may be via an automated intrusion detection system (IDS), observation of abnormal system behavior or resource utilization (e.g. excessive CPU use), reports from customers of a system outage, or other detection mechanisms. Proactive system surveillance and monitoring include several approaches for ensuring the integrity and resilience of any “defense in depth” approach to security awareness. One such approach is to conduct regular penetration testing against production and non-production systems to look for weaknesses in the firewall, application, data protection, or other security features of the IT organization.
Once a possible event has been detected the appropriate IRT is notified to immediately begin a validity determination (to rule out “false positive” events), capture and protection of log and audit data for forensic analysis, utilization of anti-malware/vulnerability remediation applications, and deployment of other intrusion prevention systems (IPS).
Triage
Once a security incident is detected and verified, it is important to determine the nature of the threat and the appropriate response. This security “triage” is similar to the treatment of multiply injured patients when there are limited medical resources; the most critical injuries are treated first. In the case of an IT Security event, the intent is to contain the systematic damage or information leakage to prevent any non-affected systems from becoming compromised. There is typically a period of time while the IRT determines what the exact attack/incident involves, which makes pre-planning with a set of prepared contingencies absolutely critical. These plans must be considered well ahead of any potential incident and include a hierarchy of notification points (see Escalation below), establish a set of emergency response procedures intended to minimize exposure, and provide for a business continuity plan that can be executed while the primary IT systems are not available.
Escalation
Security incident management typically involves the determination of the need for additional skills and/or knowledge for resolution. Given the highly technical nature of IT system management, it is a common practice to establish multiple levels of response to an incident. For most organizations a three-tiered response structure with defined escalation points is considered sufficient:
Tier 1 – Call Center / Service Desk – This function allows for IT system users to report incidents, system flaws, outages, and other abnormal IT events. A call center is usually provided with a set of standard procedures to diagnose and evaluate the seriousness of any reported incident. These can include workarounds for system behavior flaws, assistance in recovery from a given error condition, or other routine assistance. For situations that are of greater impact, such as an outage, the incident is forwarded to the Tier 2 team for handling.
Tier 2 – Incident Response Team (IRT) – As discussed above the IRT is a part or full-time group that is brought in to handle more serious IT security incidents. It is important that the Tier 1 team be well trained in when to call upon the Tier 2 response team given the disruptive nature of such requests. This training should include the ability to recognize a potential security incident (intrusion, denial-of-service attack, data breach, etc.) and know the correct point of contact for escalation.
Tier 3 – Technical Support and System Development – The final response team is typically the IT technical teams who have a deep understanding of the network, software, data, and other aspects of the IT environment. These groups are contacted and brought into an IT security event by the IRT to assist in the identification of the threat, containment of loss, and remediation for future prevention.
Root Cause Analysis
Once the immediate situation around an IT Security incident is resolved, the next step is to conduct an investigation of the root-cause for the incident. This can be a straightforward exercise for easily identified causes (e.g. system design flaw), or may involve deep analysis of log files, audit records, database changes, and other forensic evidence. In some cases, it will be necessary to involve Law Enforcement and the corporate legal department as part of the overall investigation.
- Collect Data – the sources of information will often be found in system logs, audit records, data change records, file edits, or other persistent records of a system change. For example, if a user’s credentials have been subverted then any audit records generated after that point can be used to investigate changes made under those credentials.
- Construct Causal Factor Chart – this exercise is to create a connected list of possible causes for the security event. One such model is the “fish-bone” chart that has a series of connected causes that directly lead to a specific event. Alternatively, risk/threat trees can be used to investigate all of the possible causes of an event.
- Identify Root Cause – The identification of the actual root-cause for a security event may take a great deal of time and investigation. However, taking action before the root-cause is understood may result in unnecessary business complications and revenue loss.
- Generate Remediation Approach – As the final step in root-cause analysis, a remediation approach is created and applied. These ‘controls’ can be for additional detection and monitoring, restriction of a network segment, installation of additional/upgraded firewalls, etc.
Prevention
As a final aspect of security incident management, preventative action should be considered for all known risks. While not every risk can be mitigated for a reasonable cost, many of the common system errors can be avoided. Ensuring a regular operating system patching schedule, ensuring that applications use a secure third-party component version, audit logs are reviewed for irregularities, intrusion detection and prevention system are deployed, are amongst the many measures that can be taken to reduce or eliminate unexpected system security events.