S36
On Wednesday night I got home to my roommate asking me if I had heard the explosion. “What explosion?” I asked. I quickly learned that Ship 36 had been lost—he head heard the explosion from our apartment 20 miles away.
I didn’t work on S36 directly, but I was still quite crestfallen about it. We’ve had 4 failures in a row now, and while I believe that each one has taught/is teaching us a lot about how to build and fly this vehicle, I’ve talked to a lot of people who see this as a manifestation of incompetence, rather than as a part of the process. Frankly, I don’t think these people fully understand risk management, or a how a hardware-rich approach to engineering risk management works, so I figured I lay my thoughts on this matter out here.
There’s a book the poker master Annie Duke wrote called Thinking In Bets that captures the way I have thought about this failure pretty well. In the book, Duke explains that we should evaluate the bets that we make according to the spread of possible outcomes, rather than by which one is realized. Yes, blowing up a ship and a lot of the associated test infrastructure realized a black swan, worst case outcome for that test. But rather than assessing the test based on the fact that this happened, a better question to ask is “was it a good idea to run this test in the first place?” The answer to this question is more elucidating with respect to the competence of those who organized it than asking no questions at all, jumping to conclusions, and judging the test solely based on the outcome that it realized.
From my point of view, here is the decision tree for running the test vs. not:
Option 1: Run the test
• Worst case: rocket blows up and damages test infrastructure, but you get data to learn how to make it better next time
• Best case: rocket works perfectly, and you proceed into launch with confidence
Option 2: Don’t run the test
• Worst case: rocket fails on the pad while the world is watching, destroying it and preventing further launches for the foreseeable future
• Best case: launch goes fine, but you potentially launched at huge risk
Thinking about things this way running the test was clearly the correct choice, even though the worst case outcome in option 1 was realized. If you think this way, and then mitigate downsides, you will eventually end up with a system that is robust and reliable.
Failure sucks, but ultimately it will lead to a better design and qualification procedure. The right thing to do (in any domain where you are able to try again) is to expose your hardware to its operational environment, or something close to it, early and often. This is the only way to learn where the unknown risks in your design are, which are almost inevitably the ones that bite you.