Among the challenges to cloud service providers is “up time”. Cloud services must compete with traditional on-premises services if they want to be seriously considered for enterprise and business customers. This in addition to “security” are the two primary challenges faced by cloud service providers today. Although everyone has experienced their own organization’s network, or elements of their network being down from time to time, the outrage is confined to their individual organization and it’s employees. However when “cloud providers” like Microosft, Google, Amazon or IBM have outages or service degradation the impact is far wider and therefore much more critical.
These concerns are the primary reasons many information technology professionals will give you today for avoiding cloud services on the enterprise level. However I suggest that these occurrences, although they do occur from time to time are less frequent when compared to many on-premise services. The difference here is these large outages are of course publicized on a much larger scale due to the larger customer impact. Do not misunderstand me, “down time” should not be easily accepted and when they occur the customers should hold their cloud providers accountable. In many instances the cloud providers offer refunds or credit for any downtime experienced. Just as important as this cloud provides need to continue improving their outreach with meaningful information to their customers during service issues.
My local organization has been in Microsoft’s “cloud” for email services since 2011 and total down time in 3 years since then has been less then a dozen hours. The majority of the hours “down” impacted one element of the email service. For example email would not flow to the Outlook client however it continued working normally through OWA (outlook web access) and mobile devices. Or email would not flow to mobile devices but arrived normally elsewhere. These types of “outages” are really “service depredations” therefore their limit is minimized. Also during these service issues the organization’s information technology staff is not impacted as the staff tries to determine “what the heck is wrong with the service”.
All of this leads me to this week’s Microsoft Azure outage.
Earlier this week there reports of partial performance degradation incidents with traffic manager and multiple region service interruptions for Azure services. Microsoft Azure acknowledged quickly having problems on its Twitter account at 6:30 pm on Tuesday. The Twitter account reported the outage was resolved by 10:56 pm, yet its status page continued to report problems hours after.
Thousands of sites using Azure as a web host were down for hours including Microsoft’s own msn.com and Windows Store. There was also a storage outage in Western Europe.
Azure had other outages as well earlier this year including several in August which coincided with the release of new Office 365 features. Azure was also experiencing an outage when promoting the online gaming features of Xbox One launched in November of last year.
It’s important for cloud providers hoping to survive in the increasingly competitive cloud space to compete on service. Outages may be inevitable but certainly don’t help the case for customers looking for the most reliable service provider.
As I look towards moving more of our organization’s services into the cloud I will certainly be asked “is this safe or wise”. I would suggest that outages like the one Azure experienced this week certainly does not help. I would also suggest that these are very complex systems and Azure itself has only been in the public domain for 4 years now. Therefore it can be anticipated that the Microsoft’s service level in respect to performance will only continue to improve and even this most recent outage, although it impacted many customers was identified and resolved within several hours, although the investigation into it’s root cause continues. (It is believed that a Azure update was culprit.)
Considering the complexity of networking services and all that we have come to expect from technology in general, service related issues will continue to occur from time to time. To think otherwise if foolish. Understanding this organizations should work to find the right balance of “cloud” and on “on-premises” services.