Nowadays, with more and more companies moving to the externalized cloud services, SLA’s have became more and more important. If we follow the ITIL guidance, SLA’s are, among other things, a way of transferring the risk to the provider, so you get free of risk. Is that true? Is this the way of current SLA’s are written?
In this post I will cover the common SLA and provide some comments about how it should be.
The Objective of an SLA
An SLA is a Service Level Agreement (if you don’t know it by now, think about what are you doing!). That is, a Service Level definition with a particular emphasis on prices and penalties about the final Service Level provided. What are the objectives of an SLA:
- A statement of the expectation of both customer and provider, so both understand each other and an agreement point is met.
- An incentive to the provider, so it will try to fulfill the agreement.
- A risk transfer, so the losses derived from a bad service are covered.
In fact, when you select a provider, you trust in its responsibility and way of doing. That is, you trust that it will try to fulfill the agreement, no matter if it is involved an SLA or not. So the second objective is the less important of all.
The current SLA philosophy
I can state that there is no CIO that plans to contract an externalized service without signing an SLA. But, what is an SLA? Again the same question, but a different answer. An SLA is a reason to make think the CIO that he’s under control. Let him sleep at night, thinking that the provider is under his dominion. But an SLA is really pre-agreement for the case of a service loss of level. That is, defines the compensation for this loss.
Nowadays the cost of a service is normally a reduced part of the benefit that is provided to the customer. Or seen from another point of view, it is only a part of the cost of an outage or the temporal lack of service. Think about it, What is the monthly cost of an externalized messaging service?. What is the cost of loosing one day of this service? It obviously depends on the organization, but normally the cost of one day without email is much greater than the cost of the service one month.
So, returning to the first paragraph of this point, an SLA is a reduced compensation in case of a loss of service. It won’t compensate the customer losses.
For instance, I remember some years ago about contracting a hosting+admin service. The cost of the contracted service was less than 1 % of the total cost of the service that was hosted and that we provided to another customer, a booking web site. If asked the hosing service an aggressive SLA, with a 50% of penalty for an outage of more than 4 hours. The provider agreed (after a little increase of the cost) and I was very happy, since I think that the risk of the outage was covered. We had an agreement with our customers that a lack of the service implies a reduction in the month pay, proportional to the double of the duration and the month duration. Than means a 3 days outage, in a 30 days months, means a 20% reduction of the cost. I implicitly though that I was completely covered. 4 hours / 50 %, in front of 1 day / 6 %. I discovered how wrong I was when a Sunday morning the RAID controller crashed. The emergency team of our admin service went, just to discover that they run out the replacement stock for this RAID controller. So until monday the service was out. Then the nightmare started:
- First, when I recalled the reduction of the payment for the next month, the provided argued that an lack of a replacement was a justified cause and the outage couldn’t be considered as a SLA breach. Finally they accepted, but after a lot of hours of telephone conversations.
- Second, the 50% reduction, was less than a quarter of the 6 % reduction given to the customer.
- Third, our reputation suffered. A Sunday down in high-season is just non acceptable.
- Fourth, the customer claimed that the 6 % of reduction can’t be compared to the revenue of the system on a Sunday. So the customer felt deceived.
So, returning to the main topic, the standard SLA doesn’t cover you of a loss. So the risk transfer is not achieved. That’s the rationale behind the last part of the joke “Five Managers Went to a Safari and…”, where the Service Level Manager offered a compensation of paying a coffee in exchange of a possibility of die of starving in the middle of the desert.
What should be an SLA
We have talked about what is an SLA, and what isn’t. But what should be. It should be a risk transfer. The provider must be compromised to the required level to fulfill the agreement. And that means that your expected losses in an outage must be the expected losses of the provider. This is the only way where the provider can understand how important is it.
Of course if a company accepts this level of penalty, the cost of the service will increase, because he will increase the reliability and availability level of its service. But that is exactly what we want.
Ok, that’s the way it should be, but… be realistic, it’s difficult (not to say impossible) to find a provider that accepts this kind of agreement.
What should you do
You must assess the risk of your service with the current SLA. Don’t forget that the SLA is not a risk container, is just a pressure measure to the provider. So you must consider alternative security controls to guarantee the service level and the business continuity in case of disaster. That could include redundant contracts, audit of the provided service or contracting an assurance.