Adapted from an internal p2 post.
Fail-fast is a systems design approach that intentionally fails early and visibly. The gist is that it is better for users as well as better for developers to halt a system if it finds itself in a “critical enough” situation.
Wait, What, Why?
There are worse things than a crash.
Mike Stall (via Jeff Atwood) sketched an “app health hierarchy” which looks thusly:
1) Application works as expected and never crashes.
2) Application crashes due to rare bugs that nobody notices or cares about.
3) Application crashes due to a commonly encountered bug.
4) Application deadlocks and stops responding due to a common bug.
5) Application crashes long after the original bug.
6) Application causes data loss and/or corruption.
Complex systems can reduce their exposure to the darkest parts of that hierarchy in failing fast. Users are immediately presented truth rather than sometime later discovering their actions didn’t take. Developers get immediate, direct feedback in critical scenarios that points directly toward the root cause.
And, perhaps most importantly, halting in the right places prevents on-going corruption, which is what keeps me up at night as a software developer.
We can’t just BSOD our users!
Yes and no.
Yes, we can, if the situation is that critical and we know of no way to recover.
No, we can’t, on anything less than that.
Fail-fast requires (team) wisdom in implementation. Each situation must be considered in its own context. Tricky, at times, but in my experience the effort is very much worth it.