Contextualizing Reliability
Spend a few minutes in the SRE community and you will surely hear: “reliability is the most feature.” From the Google SRE Workbook:
The argument is sound and, as stated, “people don’t normally disagree.” However, what does this look like in practice?
This morning I spent some time contextualizing this argument with numbers.
Setting the Scene
Consider a typical web service. This service has users and we’ll make the following assumptions:
- we aim to serve 100% of users
- users are
aware of
andhave access to
the site - while users have their own reliability issues (i.e. network outages, slow connections, etc.), we’ll ignore those for now
Basic Definitions
Intent: the user has the intention, or desire, to use the site.
Able: the user, when navigating to the site, has the ability to accomplish their goals and extract value from it. Here able
will function as a synonym for reliable
.
The Interplay of Intent and Ability
As it stands, we have the ideal, 100%, and then reality, the combination of those with intent and the ability to use the site: 100%
of those aware and with access can use the site, X%
of those have the intent to use the site, and Y%
of those are able to use the site.
Mathematically this looks like this:
ACCESS: 100%
INTENT: X% = ACCESS - THOSE_WITHOUT_INTENT
ABLE: Y% = X% - THOSE_WITHOUT_ACCESS
Visually:
Which is based on the following truth table:
Considering the extreme cases, lack of intent or lack of ability, we get:
No ability: if 0% of those with intent have the ability, 0% will access. Thus, even if 100% of users have the intent (i.e. design, marketing, engineering, etc. have done their job exceptionally well), no one can access the site.
No intent: if 0% of users have the intent of using the site, 0% will access. Thus, even if the site is completely reliable, if no one wants to access it, no one will.
In reality we have some combination of the two.
The Don’t Leave Any Money on the Table Principle
At the end of the day, we live in a capitalist society. Most web services provide some value to the user in exchange for time and/or money (which then, in turn, pays your salary). So, business leaders desire to convert as much of their addressable market to customers. In business lingo, it’s something like this:
However, converting 100% of users with access to 100% with access and intent is a challenging (and maybe impossible?) endeavor. It would involve turning the following triangle into a square? 🤔
As a site reliability engineer, designing an effective user funnel is outside my wheelhouse.
However, I do believe the following.
If we’ve invested the time and money to achieve a decent conversion rate, we are, up to a certain point, wasting resources if we leave any users with intent unable to access our site.
No site is perfectly reliable. In fact you don’t want it to be (each extra nine of reliability is exponentially expensive). However, there is a threshold of level of service that your users expect, which, if not met, results in missed business. Defining that threshold (service level objective) based on a decent proxy of the user experience (service level indicator) is not easy (heck… there is a whole book about this stuff). However, as hard as it is, service reliability engineering is a science. There are principles, best practices, patterns, and existing precedents (across industry domains, medicine, bridges, airplanes) we can follow.
Conclusion
Based on the above, any discrepancy between the level of user intent and the level of user ability is, in my opinion, money on the table. So, circling back to our original point of investigation, reliability is the most important feature: without it, the rest of your organization’s work loses its effectiveness and momentum and in the worst case, is completely wasted.
Comments