‘We understand how incredibly impactful and unacceptable this is and apologize deeply,’ Microsoft says in a post-incident review report on the outage.
Microsoft on Tuesday apologized for a worldwide outage that impacted Azure cloud services including Microsoft Teams, Office 365 and Dynamics 365.
“We understand how incredibly impactful and unacceptable this is and apologize deeply,” Microsoft said in a post-incident review report on the outage, which was the result of “authentication errors” across multiple Microsoft cloud services. “We are continuously taking steps to improve the Microsoft Azure platform and our processes to help ensure such incidents do not occur in the future.”
Microsoft referred in the report to changes made after a Sept. 28, 2020, outage that impacted Microsoft 365 users for five hours.
“In the September incident, we indicated our plans to apply additional protections to the Azure AD (Active Directory) service backend SDP (Session Description Protocol) system to prevent the class of issues identified here.”
Microsoft said the first phase of SDP changes are finished, and the second phase is in a “very carefully staged deployment” that will finish mid-year.
“The initial analysis does indicate that once that is fully deployed, it will prevent the type of outage that happened today, as well as the related incident in September 2020,” Microsoft said. “In the meantime, additional safeguards have been added to our key removal process, which will remain until the second phase of the SDP deployment is completed.”
Microsoft on Tuesday morning said the “majority of services” impacted by the worldwide Azure and Teams outage were back online, except for Intune and Microsoft Managed Desktop.
The latest update on the outage came in a 6:34 a.m. Tweet from the Microsoft 365 status account.
The Microsoft apology came after a global outage Monday affecting the Teams collaboration app, as well as “multiple” other Azure, Office 365 and Dynamics 365 services.
The issues -- disclosed by Microsoft on Twitter starting at 3:40 p.m. Eastern Time on Monday -- could be affecting any user “worldwide,” the company said at that time.
Even with the outage, some industry executives are calling on MSPs to move customers more quickly to the cloud in the wake of the March 2 on-premise Exchange Server attack by Chinese state-sponsored hackers.
That attack affected only on-premise versions of Exchange Server and not Exchange Online or the cloud-based Office 365 email service. Some 30,000 U.S. organizations and 60,000 organizations globally have had emails stolen as a result of the breach, since they were still running on-premise versions of Exchange.
Last week, Microsoft alerted customers to DearCry Ransomware breaches as a result of the Exchange on-premise server attack. On March 12, it warned that “human operated ransomware attacks are utilizing the Microsoft Exchange vulnerabilities to exploit customers.”
Emmet Tydings, president of Columbia, Md.-based AB&T Telecom, which provides internet voice and data and failover stability for MSPs, said it is critical that partners move customers to the cloud to avoid serious security issues like those that came with the Chinese attack on Exchange on-premise servers.
“MSPs need to move their customers quicker to the cloud, and they also need to stabilize their communications infrastructure with diversity in their circuits and failover,” Tydings said. “Microsoft has emphasized that they are better able to provide security in the cloud than with on-premise Exchange.”
Tydings said partners need to provide robust internet connectivity with SD-WAN and wireless failover with carrier plans via a SIM module and a cable backup to a primary fiber line.
In the case of an outage like Microsoft Teams, MSPs should resort to alternative communications infrastructure such as Zoom or Cisco Webex, he said.
With the global pandemic leading to more distributed workforces, on-premise Exchange no longer makes sense for customers, according to Tydings.
“The MSPs we work with have been heroes for converting their clients from on-prem to cloud since the pandemic hit,” he said
The rapid migration to the cloud has led companies to invest in making software products faster, but they‘re not investing in making cloud services more resilient, said Ofer Smadari, co-founder and CEO of Portland, Ore.-based StackPulse, whose reliability platform helps teams detect, respond to and remediate incidents with code based automation.
“We see the results in the headlines every week, it seems, as major brands have site outages,” Smadari said. “Most companies are still using traditional IT tools like ticketing systems, service management tools or communication apps to share information and collaborate to restore service. Companies need to change from an IT management mindset to an engineering mindset where they build resilience into their applications and their business operations to take a more risk-aware approach. Only then can they recover quickly from outages and deliver on their promise to their customers.“
The incident brought to the forefront how an outage of a SaaS-based offering can impact users, according to Michael Fraser, CEO of Refactr, a Seattle-based DevSecOps startup.
“We could not administer M365 or Azure AD during the outage and had intermittent service to…various M365 services, including Teams, Yammer and Exchange Online,” Fraser said. “And for services like M365, there is not much you can do to plan for this type of outage other than having backups of your data. “