Allspaw has a talk How Your Systems Keep Running Day After Day which is a 30-minute review of the STELLA report (on github), and could have been entitled 'Why Your Engineers Do Not Sleep Well'.
The crux is the correspondence (or lack thereof) between our understanding of systems and their actual workings. How do we avoid these smaragdine problems?
Each of the six themes identified in this report could become an avenue for progress on coping with complexity. There is already progress in controlling the costs of coordination. The burgeoning use of "chat ops" and related automation is a 'hot' area and likely to remain so, especially because so many working groups are now geographically distributed. Visualization tools are appearing, especially associated with application and platform monitoring. Interest in non-technical debt gets a boost with every celebrated outage.Less is happening in the area of blame versus sanctions and making postmortems more effective. These are areas that are difficult to approach, partly because there is no tradition of deliberate process tracing after events. Postmortems are hard to do and consume valuable resources. Many (most?) organizations have difficulty extracting useful learning from after-anomaly investigation and analysis. Management sensitivity to user community perceptions and publicity does not always lead to deep, thoughtful investigation and analysis and wide sharing of the details of anomalies and their implications is not the norm. Although many organizations claim to be "blame-free" most are, at best, "sanction-free" over a limited range of outcomes.
Shockingly, Allspaw said that differences in mental models was a good thing. The benefits of different viewpoints only accrue to normative questions, all else seems a di-worsification. ie. we can disagree about what we should do, however we should never disagree about the current state.
What should we do then?
Use more SOAP faster
To borrow from the medical world, how fast can we run through an iteration of SOAP? ie. collect the subjective and objective data, analyze it, and form a prognosis?
The faster we can iterate through the SOAPs, the more our Mean Time To Restoral drops.
Target fewer humans
Use commodity hardware/software and scale out, so that everything looks the same and is built on boring old well-understood technology. 1
Given that coordination costs quickly dominate as soon as a system is big enough to require more than one person to maintain it, a rule-of-thumb is that the rate of change produced across all systems cannot be more than one person can consume.
In Cook's Line of Representation, this is the maximum data transfer across that bright line.
Cook's Line of Representation 2