Should Incidents Be Re-Opened?

Should Incidents be re-opened? The simple answer is: yes, if it was Closed incorrectly. Incorrect closure may include incorrect or incomplete testing or failure to confirm service restoration with the customer or user. However, IT environments are complex and reality is seldom so simple. I advocate instead against reopening Incidents, after a 2-3 day Resolved period.

The best trade-off, in general, is to allow a 2-3 day burn in period, during which the request is fulfilled or the Incident is Resolved. Resolved means service has been restored, the affected parties have been notified, and all records have been updated. The contact now has 2-3 days to test and validate before the Incident record is Closed, generally automatically by the tool workflow. Once Closed, the Incident cannot be reopened.

There exists perverse incentives to create multiple Incidents, particularly in a pay-per-issue billing model. On the other hand, there is also the opposite perverse incentive to re-open Incidents for new Incidents or requests, and to include multiple, unrelated requests in the same issue. Sometimes this happens just out of laziness, i.e. it is faster to reply to an existing email than to fill in a new one.

In addition there is gray area between what is a new Incident and what is an existing Incident. Some errors are intermittent. Restarting the device or application may restore service, but the Incident may occur again in a few hours, days, or weeks. In this case a Problem record should be raised, but the Incident may reoccur before the Problem Management and Change Management processes can run their course. Are these repeat Incidents new or existing? Every organization should have its own answer and it depends on the Incident. A 2-3 day separation between recurrences is a good, general policy to distinguish between new and existing ones.

Organizations who choose to re-open Incidents should track these Incidents. An independent party should verify they were re-opened appropriately, and any inappropriate activities should be managed through administrative or disciplinary actions, hand-slapping, or public humiliation. If this sounds bureaucratic or patriarchal, it is. In general it is easier to define in terms of time and enforce with a tool.

The 2-3 day Resolved period is not perfect for all situations and not suitable for all organizations. However, I have found through experience it is a good solution that is flexible, widely applicable, unbureaucratic, conceptually simple, and generally fair to all parties. Once Closed, the Incident should remain Closed.

The Role of COBIT5 in IT Service Management

In Improvement in COBIT5 I discussed my preference for the Continual Improvement life cycle.

Recently I was fact-checking a post on ITIL (priorities in Incident Management) and I became curious about the guidance in COBIT5.

The relevant location is “DSS02.02 Record, classify and prioritize requests and incidents” in “DSS02 Manage Service Requests and Incidents”. Here is what is says:

3. Prioritise service requests and incidents based on SLA service definition of business impact and urgency.

Yes, that’s all it says. Clearly COBIT5 has some room for improvement.

COBIT5 is an excellent resource that compliments several frameworks, including ITIL, without being able to replace them. For the record, the COBIT5 framework says it serves as a “reference and framework to integrate multiple frameworks,” including ITIL. COBIT5 never claims it replaces other frameworks.

We shouldn’t expect to throw away ITIL books for a while. Damn! I was hoping to clear up some shelf space.

Incident Prioritization in Detail

One of the advantages of working with BMC FootPrints is the lack of definition “out of the box”.1 The tool provides easy configuration for fields, priorities and statuses, and workflows within multiple workspaces, but there are few defaults (besides sample workspaces that are not very usable). This lack of out of the box configurations has exposed me to an infinite variety of choices used by different organizations.

Priority

One organization used the term Severity. This has the advantage of abbreviating to “Sev”, so incidents can be described as Sev1, Sev2, etc. Nevertheless, most organizations stick with Priority.

I have seen these range all the way from 2 (Critical, Normal) to as many as 7 or 8.

 

2 3 4 5 6 7
Critical High Critical Criitical Critical P1
Normal Medium High High High P2
Low Medium Medium Medium P3
Low Normal Low P4
Project Service Request P5
Normal

The table above shows the more common configurations. In my experience the use of terms (High, Medium, Low) is more common than numbering (P1, P2, P3), but the latter is also used.

One of my clients had used numbering, P1 through P5, but they had overused P1 so badly they had to insert a new P0 to achieve the purpose of P1–fortunately they have since fixed the issue. (This reminds me of the project “prioritization” of a former employer. Everything on the list was “High Priority”. They effectively said everything was the same priority, and they were all low.)

I encourage the use of “Normal” instead of “Low”, because no user wants their issue perceived as “low priority”. I have also seen a customer take this advice but swap it out for Medium instead of Low. Most organizations track Service Requests with Incidents, so we usually want some mechanism for differentiating them, but note that new priorities are not required (see Urgency below).

I also find it common to create a separate priority level Project, for handling projects (or extended service requests) that are scheduled past normal Service Level targets. A Project choice is particularly useful when Service Level measurements are tied to Priority (my colleague, Evans Martin, has written about this already.)

I have also seen duplicated sets of Priority choices tied to different Service Levels depending on which team was assigned the work, or regional organizational differences. For example, an software issue assigned to a development group might have a separate set of service levels but remain assigned in the same Service Desk system for tracking purposes. In this case they might have choices like P1, P2, P3, P1-Dev, P2-Dev, and P3-Dev.

Impact

Impact describes the level of business impact. Usually this is described in terms of the number or percentage of users impacted. I had one customer who described the percentage of configuration items (CIs) at its facilities that were impacted (see column 5 below).

 

1 2 3 4 5
High 10+ People Organization Entire Company 80-100%
Medium 2-9 People Department One/Multiple Sites 50-80%
Low 1 Person Individual Department 20-50%
VIP Under 20%
Individual

I have seen organizations describe the number of people affected (column 2), but most common are the choices ranging from Entire Organization to Individual. The choices in between need to reflect your own organization. One customer who ran fitness outlets needed to distinguish corporate sites from fitness centers.

The default configuration High/Medium/Low (column 1) is too ambiguous in most cases, but I have seen it used.

Many organizations separate VIPs from non-VIP individuals. VIPs will often map to Priority similar to Departments.

Urgency

Urgency describes how quickly the incidents should be resolved. In the simplest case this can be High, Medium, and Low, but as with Impact this is usually too ambiguous to be useful.

 

1 2 3
High 0-2 Hours Down
Medium 3-4 Hours Affected
Low 4-8 Hours Service Request
1-2 Days Project
Over 3 Days

I have also seen Urgency described in Resolution time frames. There are two issues with this: the time frame is easily confused with Service Levels (which they are not), and the time frames are also ambiguous especially in situations when no downtime is acceptable. I find Down, Affected, and Request to be useful.

The combination of column 4 in Impact and column 3 in Urgency results in sentences that read in English: the Company is Affected, or the Individual is Down. I like this because the intent is clear.

Mapping Table

The mapping from Impact and Urgency to Priority can usually be described in a table like below. There are no right or wrong answers here, and it varies by organization and by choices for Impact and Urgency. In the table, Impact runs in the first column and Urgency runs in the first row.

 

Down Affected Request Project
Entire Company Critical Critical High Project
Department Critical High Medium Project
VIP High Medium Normal Project
Individual Medium Normal Normal Project

In some cases multiple choices of Impact or Urgency will always map to the same priority. For example, VIP often maps like Department. Although I encourage simplicity, sometimes it makes sense to break them out in order to make the choices clear. (You could also stack choices, such as “Department / VIP”).

You will also need to decide whether to allow overriding these choices. If so, you will need to add a third field (called something like Override or Priority Override) to your mapping table.

Other Issues

  1. Start the discussion with minimal choices for Impact, Urgency, and Priority. Add choices only necessary.
  2. If the tool has default choices, start with those.
  3. You may have Service Level Agreements tied to your Priority that need to be factored in.
  4. Avoid duplicating terms across fields, such as using High/Medium/Low in both Urgency and Priority.
  5. You need to decide whether customers / users can choose the Priority. I don’t encourage it, because the user may not be qualified to understand the Impact. Moreover they will always choose critical. Nevertheless, many organizations do allow it.
  6. Decide if you want default choices for Impact and Urgency. Doing so may limit the usefulness of Priority (IT agents are lazy and often leave the defaults).
  7. As discussed before, you may need a policy for when and whether Priority can be changed.

1 Several customers preferred more options out the box. I can understand the desire for more the “standard configurations” provided by other vendors, but at the time it seemed strange and undesirable.

Changing Incident Priority

The correlation between sanity and Linkedin Groups is inverted. I joined several groups because I like to stay connected with the industry, but the disinformation (and verbosity) can be infuriating. Recently I read the following and several people agreed.

The priority of an incident must never be changed, once determined

For the record, here was my response:

Whether and how the priority should change is a policy issue for the organization. I am not aware of any “good practices” that says one way or the other. Some organizations allow the customer or user to provide the initial prioritization. The Service Desk should review the initial prioritization as a matter of good practice (and obvious necessity).

As Stephen suggested, and as described in ITIL 2011, the calculation of Priority will often be based on Urgency and Impact.

If you enforced this policy in the tool, just imagine the consequences of a simple data entry error that wasn’t detected prior to saving. Fortunately, few organizations use this policy, and ITIL 2011 is even more liberal.

It should be noted that an incident’s priority may be dynamic — if circumstances change, or if an indent is not resolved within SLA target times, then the priority must be altered to reflect the new situation. Changes to priority that might occur throughout the management of an incident should be recorded in the incident record to provide an audit trail of why the priority was changed.

In my experience few organizations create an audit trail for the change of an incident prioritization (although some tools, such as FootPrints Service Core, tracks these changes in the History). As a general good practice I stand by my original comment.

I will discuss the details of incident priorities in an upcoming post.

Changing Priority

Question: An Incident meets the criteria for P1. However, midway through resolution the impact has changed to that of P2. How should we treat the Incident now? How should we measure the SLA, based on P1 or P2?

Answer: I don’t believe ITIL provides much specific advice about this condition. How you want to handle this is really up to your organization and, more specifically, your Incident Policy.

In general organizations will prioritize Incidents based on the Impact (i.e. how many people or systems are affected) and Urgency (how long the organization can function with that service down or degraded).

Was the original assessment of Impact and Urgency incorrect? In this case you should change the prioritization.

Did the impact change because you applied a fix or workaround? In this case you should not change the prioritization.

Your SLA’s usually measures to full restoration. You could also measure to workaround provided. Your SLA’s won’t usually include fractional measurements based on changes to prioritization. For that matter, most organizations don’t even have formally agreed SLA’s with their customers.