An example of this phenomenon occurred at Google in 2021. We set and enforce resource quotas for some kinds of internal software running on our infrastructure. To maximize efficiency, we also monitor how much of its quota each software service uses. If a service consistently uses less resources than its quota, we automatically reduce the quota. In STPA terms, this quota rightsizer has a control action to reduce a service’s quota. From a safety perspective, we then ask when this action would be unsafe. As one example, if the rightsizer ever reduced a service’s quota below the actual needs of that service, it would be unsafe—the service would be resource-starved. This is what STPA calls an unsafe control action (UCA).
STPA analyzes each interaction in a system to determine comprehensively how the interaction must be controlled in order for the system to be safe. Unsafe control actions lead to the system entering one or more hazard states. There are only four possible types of UCA:
-
A required control action is not provided.
-
An incorrect or inadequate control action is provided.
-
A control action is provided at the wrong time or in the wrong sequence.
-
A control action is stopped too soon or applied for too long.
This particular unsafe control action—reducing an assigned quota to be less than what the service requires—is an example of the second type of UCA.
Simply identifying this unsafe control action by itself is only partially useful. If “quota rightsizer reduces the assigned quota under what the service requires” is unsafe, then preventing that behavior is what the system must do, i.e. “quota rightsizer must not reduce the assigned quota under what the service currently requires.” This is a safety requirement. Safety requirements can be very useful for formulating future designs, elaborating testing plans, and helping people understand the system. And let’s be honest—even mature software systems can operate in ways that are undocumented, unclear, and surprising.
Nonetheless, what we really want is to anticipate all of the concrete scenarios that lead to a hazard state. Again, STPA has a simple and comprehensive way to structure an analysis to find all of the scenarios that could lead the quota rightsizer to violate this safety requirement.
So in the case of the rightsizer, there are four archetypal scenarios that we can investigate.
-
Scenarios in which the rightsizer has incorrect behavior.
-
Scenarios in which the rightsizer gets incorrect feedback (or no feedback at all).
-
Scenarios in which the quota system never receives an action from the rightsizer (even though the rightsizer tried to send one).
-
Scenarios in which the quota system has incorrect behavior.
One specific scenario quickly jumped out to us when analyzing the rightsizer. It gets feedback on the current resource usage from the quota service. As implemented, the calculation of current resource usage is complicated, involving different data collectors and some tricky aggregation logic. What if something went wrong with this complex calculation, resulting in a value that was too low? In short, the rightsizer would react exactly as designed and reliably shrink a service’s quota to the incorrect lower usage level.
Exactly the disaster we wanted to prevent.
Up to this point, lots of attention had been paid to getting the quota adjustment algorithm right and reliably producing the correct outputs, namely, the action to adjust a service’s quota. However, the feedback path—including the service’s current resource usage—had been less well understood.
This highlights a major advantage of STPA—by looking at the system level and by modeling the system in terms of control-feedback loops, we find issues both in the control path and the feedback path. As we run STPA on more and more systems, we see that the feedback path is often less well understood than the control path, but just as important from a system safety perspective.
As we dug into the feedback paths for the rightsizer, we saw many opportunities to improve them. None of these changes looked like a traditional reliability solution—it didn’t boil down to managing the rightsizer with a different SLO and error budget. Instead, the solutions showed up in other parts of the system and involved redesigning parts of the stack that had previously appeared to be unrelated–again, an advantage of STPA’s system theory approach.
In the 2021 incident, incorrect feedback about the resources used by a critical service in Google’s infrastructure was sent to the rightsizer. The rightsizer calculated a new quota, allocating far fewer resources than the service was actually using. As a precautionary measure, this quota reduction was not immediately applied, but was held for several weeks to give time for someone to intervene in case the quota was wrong.
Of course, major incidents are never simple events—the next problem was that despite adding the delay as a safety feature, feedback about the pending change was never sent to anyone. The entire system was in a hazard state for weeks, but because we weren’t looking for it, we missed our chance to prevent the loss that followed. After several weeks, the quota reduction was applied resulting in a significant outage. Using STPA, we have anticipated problems just like this one in many different systems across Google.
As Leveson writes in Engineering a Safer World: “In [STAMP], understanding why an accident occurred requires determining why the control was ineffective. Preventing future accidents requires shifting from a focus on preventing failures to the broader goal of designing and implementing controls that will enforce the necessary constraints.” This shift in perspective – from trying to prove the absence of problems to effectively managing known and potential hazards – is a key principle in our system safety approach.
Premium IPTV Experience with line4k
Experience the ultimate entertainment with our premium IPTV service. Watch your favorite channels, movies, and sports events in stunning 4K quality. Enjoy seamless streaming with zero buffering and access to over 10,000+ channels worldwide.
