Kind readers of this site have pointed me to several papers related to complex-system failure. I'm grateful for their references - it's helped me process through a lot of my thinking on the subject. Please keep them coming!
One theme I've seen is that of the "Swiss Cheese analogy" for system failure. That is, individual activities or process steps are like slices of Swiss cheese - and the holes are individual errors. When the holes in a bunch of slices in a row line up, that is a systemic failure.
By this analogy, as long as we put actions in place to make sure the holes don't line up, we can avoid large failures. Those actions would be things like peer review, creating redundancy & backup systems, etc. This makes intuitive sense. Just prevent the holes from lining up! But there are two significant cases for which this analogy breaks down.
First is a system that is constantly evolving. The processes and practices that were put in place yesterday did not take into account the change that happened today. For example, a product sales plan is derailed by the increased adoption of a substitute product. Or a doctor's diagnosis is affected by having been involved in a minor car accident on the way to work.
In this example, new holes are being punched into the cheese. There are new failure states being created all the time, and reliance on process and behavior-type tools won't cover all the contingencies. Worse than that, a false confidence in these tools may allow people to overlook the brand new holes that have appeared.
The second situation is that of a highly unlikely but powerfully impactful event (a Black Swan). In this case, the probability of the event is so small that planners tend to overlook it (or at least underestimate its likelihood - what's the difference between 0.1% and 1%, anyway?). But when it occurs, the hole in the Swiss cheese is so huge that our safety measures can't cover it up.
For the second situation, there is a wealth of information in Nassim Taleb's book Antifragile and on his Facebook page.
The first one I'm thinking about right now. How do we create an environment where we both create processes to avoid holes lining up, and constantly keep everyone on the lookout for new holes?