Recovery time from overloading exceeded service goals.
Microsoft has published a root cause analysis of an outage of its Azure Domain Name System that struck the cloud platform over Easter, causing intermittent failures for customers accessing and managing their Microsoft services globally.
The problems started at around 8.30 am on April 2, when the Azure DNS servers received an anomalous surge in queries for an unspecified set of domains hosted on Microsoft's cloud.
Microsoft said it was ready for such surges, with layers of caches and traffic shaping to mitigate the effect, but a bug in its DNS service made the overloading worse.
"In this incident, one specific sequence of events exposed a code defect in our DNS service that reduced the efficiency of our DNS Edge caches," the company said.
"As our DNS service became overloaded, DNS clients began frequent retries of their requests which added workload to the DNS service.
"Since client retries are considered legitimate DNS traffic, this traffic was not dropped by our volumetric spike mitigation systems."
Multiple Microsoft services, including Azure, Office, Microsoft 365, Dynamics and Xbox Live were impacted.
Some customers reported being unable to access the Azure service status web page, but it's not clear if that issue was related to the DNS outage.
Microsoft apologised for the impact caused by the outage and said it would repair the code defect so that all DNS requests can be effectively handed in cache.
At the same time, the company said the recovery time from the outage exceeeded its design goals.
The Easter outage came just over two weeks after a wrongly removed digital key locked out Microsoft customers from their applications, causing access issues for 12 hours.