What can ITSM learn from Hurricane Irma

Article first published 21st September 2017

I recently had the opportunity to take a short break between contacts and booked a holiday in Cuba. Unknown to me this was about to coincided with Hurricane Irma (Sept 2017) moving from the Mid Atlantic across the range of islands. Out of every life experience I try to draw some learnings, especially in the ITSM space and this once in a lifetime event certainly taught me a lot. So, let’s look at what happened and draw the comparison.

We started our break in the Cayo Coco region, which quickly became known to be in the flight path. Once this was confirmed we were moved West, 9 hours across the island to an area with hotel capacity to take the 1200 people and outside of the known path. Once established there, the hurricane changed again and our hotel was put on hurricane watch and then full blown emergency procedures implemented as the eye was due to pass close by.

Whilst this is predominantly an ITSM reflection it would be wrong not to acknowledge both the Melia Hotel Managers at Cayo Coco and Valadero and our Thomas Cook holiday rep, people who demonstrated incredible professionalism in their operational ranks .

So, once it was evident that our evacuation location was now in the direct path, all residents were briefed and taken to a safe area under the hotel. This was a large area under the hotel which actually housed the kitchen prep and inbound delivery area. The Hotel Manager assumed the crisis manager role and introduced his deputy and the crisis operating model (in three languages) and then the carrier reps provided specific client updates through the duration.

Putting the crisis management element to one side, part of the focus of this article is on service. We were in the throes of a natural disaster with over 500 “paying” guests. It would have been easy to drop the level of service and yet throughout the incident food was continually provided, the toilets were cleaned and the holiday makers endured over 18 hours in basically an access corridor without any cross words and no alcohol! In fact, the ratio of staff to guests was around 1:5 (an observation I am planning to cover in a later article).

So, what compelled me to write this article? Well its quite simple, in ITSM we regularly talk about 3 things, service design, DR and Service / SLA’s but too regularly organisations do not take this seriously or do this well and I challenged myself to ask why? What is the under-riding factor?

If I reflect on the example of Irma it was clear on a number of things

1) There was a plan. This was not created a few days before. The level of service maintained and the level of execution could only be put down to strong planning and regular modelling / simulation

2) There was a clear mandate – protect life and continue to deliver outstanding service

3) The outcome (for point 2) was never in doubt. There was no hesitancy, management crisis or shortfall in delivery. Everything “just worked”

Now in reflection I would suspect two things ensured this happened.

Firstly, the event was life protecting dependent. Both the hotel chain and the holiday carrier had a duty of care. That was clear in the approach and a 99.95% outcome in this area would just not be acceptable. I would suggest that the service aspect was a secondary consideration but in our ITSM terms, all they decided was that the Vital Business Functions were set with a very high bar.

Secondly, they were used to the natural phenomenon. Whilst this was a Cat 4-5 Hurricane, lower categories are not uncommon, as are tropical storms, therefore at some point there would always be expectation to initiate the plan. I would suspect this was not a procedure given lip service in a management meeting or taken out of scope due to a cost or time over-run.

So why the reflection?

Each year, the UK media seems to report significant IT failures and I am sure on a global scale they occur just as regularly and yet in most cases the true root cause never gets reported. We, as ITSM professionals can speculate on it and I am sure our experiences of working in different organisations can cite system failures that should never have happened or should have been recovered much more quicker with a better level of service. In simple terms, feedback regularly tells us that we let our customers or users down.

Watching the recovery process during Hurricane Irma peaked my interest in a number of areas:

1) Do we approach any service with a mindset that our common “natural disasters” happen? In our world, this could be a cryptolocker type attack, a significant database corruption or a major datacentre / infrastructure outage. As a service function, we have a right to be the voice of doom and approach service with a mindset that our “Irma” is always out there

2) As part of service design do we challenge the architects to put the potential points of failure on the table and then have clear options to mitigate them. More than this, do we actually turn these scenarios into service metrics?

3) Once those are clearly identified and solutioned we then put our service hat on and take up our position of “is the contractual SLA good enough or as a collective is a higher-level outcome our goal”

4) Regardless of the answer to the above question, once that is agreed that becomes the recovery objective. In essence we reset the bar, if a 4hr recovery is our agreed outcome then failure to achieve this should be taken in the same mindset as the mandate demonstrated by the crisis team in Cuba (ie not an option).

Acceptance of point 4 above is not uncommon and I am sure that a lot of organisations go through the 4 steps above during major projects and implementations, the problem is they fail to turn them into an executable outcome that only has a guaranteed conclusion of success. By that I mean truly exploring the “what ifs”, getting the recovery process clearly documented, testing them and then throwing in a curve ball at the last minute! (on a complete side note, I recently watched the film Sully and it was interesting to see the “simulated outcome” turned around significantly once the “human factor” was added in, but that’s for another day).

Maybe it needs a change of perspective? Maybe if a life depended on it or if service was truly king then the focus on service design, recovery, continuity and ensuring that the customer continues to receive the same level of service in the face of adversity regardless of the cost would be more prevalent in the ITSM service cycle.

In closing, I suspect the real benefits of V3 (if you hang your hat on this model) as still not being realised and as an industry we sitting in a hybrid world of acknowledging V3 as a framework but working in a V2 comfortable manner, otherwise the elements of Service Strategy and Service Design in their new guise would be ensuring that services designed and implemented (certainly in the last 6 years since the 2011 refresh of V3) would be closer to the service excellence I recently experienced.

If you have enjoyed reading this article and would like to discuss in more detail, I would welcome your thoughts on this.

Has the SLA had its day.

Article first published 16th July 2017

I was first introduced to IT Service Management back in 2002. The concept of the SLA and availability targets was fairly straightforward with the service components having a supplier given availability figure and the overall service target being a result of multiplying the service elements out. A recent assignment has significantly challenged my view of this. If you are interested in my viewpoint, please read on.

Now for me, this was probably a late to the game “eureka” moment but when you are elbows deep in the day to day delivery of service it is sometimes difficult to step back and see the changes around you. My assignment involved a request to create a new set of SLA’s as a new IT Director wanted to quickly understand the portfolio of services.

The scope was agreed as a full end to end study. Taking the commercial agreement to the end client, mapping the internal operational systems and identifying the service elements then carrying out the traditional mapping of support and support hours against the operational hours. All of that was straight forward but the first step of reviewing the client contracts raised an interest observation which to be honest when reviewed in the cold light of day was obvious and did pose to me the basic question of the validity of the system availability driven SLA.

What was the new variable that challenged the foundation of my ITSM compass? It was quite simple. The majority of the commercial contracts had limited reference to systems, system uptime or availability. Quite simply the majority of the contracts now referenced “outcomes”. Two examples are as follows:

  • All orders received into the suppliers system prior to 17:00 where flagged as a next day delivery to be processed in time for the final planned transport pick up in order to fulfill the next day delivery criteria
  • Order confirmation and order dispatch messages to be received back into the customers EDI gateway no longer than 15 minutes after the corresponding action is taken in the suppliers WMS system

So what has changed? The basic premise of the old availability approach to the SLA was purely about the fact that the system was “available”. I am sure when the concept was drawn up it was “good enough” to give both a level of re-assurance that IT was taking the internal customer seriously and could “nail its colours to the mast” but also give a point of reference to conduct a service review. But as technology has moved forward and if for example in the retail and logistics world, the service delivery to the end user has become close to real time (who would of thought at the change over to the millennium the likes of Amazon would soon be offering a service proposition whereby you pay a fixed fee and can order a wide range of product whereby if you order by 5pm can be delivered to you the next day at no cost), a basic availability target no longer is sufficient.

Why is that? Well quite simple a standard availability figure does not allow for the constraints of time bound activities. Taking the first example (above) of this we can clearly prove how this is no longer suitable as follows:

  1. An operation works Monday to Saturday 06:00 – 23:00 (but the final transport pick up is at 21:00). Therefore the service window is 60 x 15 x 6 = 5400 mins per week
  2. System availability target is calculated as 98.4% (measured from example server target = 99.5%, network target = 99.75%, application support target based on P1 (90 min fix across 24 x 7) = 99.1%)
  3. 1.6% allowed unplanned downtime against 5400 minutes gives 86 minutes
  4. With an order cut off of 17:00 and a last order pick up of 21:00, the 4 hour window to pick, pack and dispatch is now reduced to just over 2 1/2 hours. The risk and potential penalty is now moved from IT to the operation. Unless a sliding application fix SLA is provided that reduces the P1 fix time to 30 minutes during that 17:00 – 21:00,the availability driven SLA no longer favours the outcome driven contract clause

In a similar way, if we take the second contract criteria of messages being delivered back to the customer system within 15 minutes, the traditional availability target SLA which allows 84 minutes of unplanned downtime clearly does not support that requirement.

The challenge then comes that if you accept the observations above as a basic principle, what are the alternatives?

The obvious move is to realign the IT SLA’s to the business outcomes but in doing so a number of factors such as those below may need to be considered;

  • IT need to have a change of mindset to align them closer to the operational contract. In doing so their is an inevitable risk that differences in operation culture could create natural obstacles
  • In order to prevent future challenges, IT need to be engaged as early as possible in any client contract negotiations
  • Traditional supplier SLAs may need to be aligned to deliver the outcome based approach
  • Concept of fix time to P1’s has to be removed as the measure is no longer time bound but measured in the success of “outcomes”
  • Systems may need to be designed with a greater level of resilience / availability, considering true high availability during change activates to support business outcomes
  • Service processes such as Major Incident Management and service report will need to be aligned

Whilst the SLA may not have seen its day, certainly in order to keep up with the increasing demands, I would suggest that the traditional service measurements may need to be revisited and replaced with an outcome based expectation.