Imagine your team has just pushed a hot fix for a problem. Once the first 24hrs has passed do you relax? How many days of not seeing the problem do you conclude the fix worked? Let’s start by discussing coin tosses.
Assuming you have a fair coin, that is just as likely to land heads as tails, the probability of getting a heads on a single flip is 50%. Now two questions. First, what is the chance of getting two heads in a row? Second, what is the chance of the next flip being heads? While these two questions seem similar they are very different. To answer the first question, two heads in a row, we can look at all the possible combinations of two coin tosses:
(H,H) (H,T) (T,H) (T,T)
With four possible outcomes one of which is our two heads (H,H), we can easily compute that we have a 25% chance of getting two heads in a row. Another way of computing this is by multiplying the probability of getting a head on each coin toss. Because each toss is independent the likelihood of getting a head is 50% for each. We therefore have 50% * 50% = 25%.
This gets us to the second question, what is the probability that the next flip is a heads. As mentioned above, and contrary to what gamblers and sportscaster often believe, each flip is independent there is no “law of averages” that would indicate heads or tails is more likely. The first flip and the second flip and the third and each subsequent flip are independent of each other. However, we do expect that as the number of flips gets large the porportion of heads and tails approaches 50%. Another way to look at this is to go back to our diagram above. If our first flip was a heads which of the scenarios could exist for our second flip? The answer is the scenarios of (H,H) (H,T) because they both have ‘H’ as their first flip. Therefore the chance that the second flip is a tails is 1 out of 2 or 50%.
Back to our hot fix scenario. Let’s ignore the probability that our fix actually solved the problem and just focus on the likelihood of a problem occurring each day this week. We can get into prior probabilities in a later post. As we discussed above, independent events maintain their probability for each event so the probability of the problem occuring today is 50%, that it occurs tomorrow is 50%, and so on. A different question is, what is the probability that the problem will not occur three days in a row? Let’s first look at this visually using N = No Problem and P = Problem.
(N, N, N) (N, N, P) (N, P, N) (P, N, N) (N, P, P) (P, N, P) (P, P, N) (P, P, P)
From the figure above we can see that there is only 1 out of 8 scenarios where we have No Problem three days in a row. This equates to 1/8 = 0.125 or 12.5%. We can also solve this as we did above by multiplying our probability each day 50% * 50% * 50% = 12.5%.
One final note about independent failure events. We often tell clients to avoid synchronous calls because if one service fails (because of hardware or software) it causes others to fail. If you have 99% uptime on one service and 99% uptime on another but they are both required to service a request, the total system availability is 99% * 99% = 98% unless of course they happen to fail at the exact same time every time. This is what we call the multiplicative effect of failure.