Microsoft’s preliminary analysis of an incident that took out its Australia East cloud region last week – and which appears also to have caused trouble for Oracle – attributes the incident in part to insufficient staff numbers on site, slowing recovery efforts.
The software colossus has blamed the incident on “a utility power sag [that] tripped a subset of the cooling units offline in one datacenter, within one of the Availability Zones.”
Microsoft is known to operate some cloud infrastructure in parts of Sydney, Australia, that experienced power outages after an electrical storm last week. The “power sag” explanation is therefore consistent with wider events.
The analysis document explains that the two data halls impacted by the sag had seven chillers – five in operation and two on standby. Once the sag struck, Microsoft’s staff executed Emergency Operational Procedures (EOPs) to bring them back online. But that didn’t work “because the corresponding pumps did not get the run signal from the chillers.”
That’s not what is supposed to happen. Microsoft is talking to its suppliers about why it did.
Backup chillers didn’t completely live up to their name.
“We had two chillers that were in standby which attempted to restart automatically – one managed to restart and came back online, the other restarted but was tripped offline again within minutes,” Microsoft’s report states.
With just one chiller working in data halls that need five, “thermal loads had to be reduced by shutting down servers.”
Which is when bits of Azure and other Microsoft cloud services started to evaporate.
The software colossus’s report offers a very detailed timeline of events that shows how its on-site team made it onto the datacenter’s roof to inspect chillers exactly an hour after the power sag, and that the chillers’ manufacturer had boots on the ground two hours and 39 minutes after the incident commenced.
But the document also notes that Microsoft had just three of its own people on site on the night of the outage, and admits that was too few.
“Due to the size of the datacenter campus, the staffing of the team at night was insufficient to restart the chillers in a timely manner,” the report states. “We have temporarily increased the team size from three to seven, until the underlying issues are better understood, and appropriate mitigations can be put in place.”
The analysis also suggests the prepared emergency procedures did not include provisions for an incident of this sort.
“Moving forward, we are evaluating ways to ensure that the load profiles of the various chiller subsets can be prioritized so that chiller restarts will be performed for the highest load profiles first,” the document states.
Microsoft also had trouble understanding why its storage infrastructure didn’t come back online.
Storage hardware damaged by the data hall temperatures “required extensive troubleshooting” but Microsoft’s diagnostic tools could not find relevant data because the storage servers were down.
“As a result, our onsite datacenter team needed to remove components manually, and re-seat them one by one to identify which particular component(s) were preventing each node from booting,” the report states.
Some kit needed to be replaced, while some components needed to be installed in different servers.
Microsoft also admitted “our automation was incorrectly approving stale requests, and marking some healthy nodes as unhealthy, which slowed storage recovery efforts.”
And that’s just the stuff the tech giant was able to discover in its immediate post-incident review, compiled within three days of an incident. The Beast of Redmond publishes full assessments of outages within fourteen days, and The Register awaits that document with interest – as, we imagine, will Azure customers. ®