In safety engineering, it’s important to maintain a clear understanding of different failure categories. Per ISO 26262, we define random hardware failures as “failures that can occur unpredictably during the lifetime of a hardware element, and that follow a probability distribution.” We distinguish random hardware failures from systematic failures, which ISO 26262 defines as “failure related in a deterministic way to a certain cause, that can only be eliminated by a change of the design or of the manufacturing process, operational procedures, documentation or other relevant factors.”
Put more simply, systematic failures are mistakes or oversights in the design. A systematic failure is caused by human error. By contrast, random hardware failures “just happen” in a way that is not deterministic and not related to any obvious problem or mistake. So even in a “correct” design, with no obvious flaws or oversights, we must account for the possibility of a random hardware failure. (We can counteract such a failure, for example, with redundancy, run time monitoring, proof testing, etc.)
Digging deeper, what do we really mean when we describe a failure as “random”? Don’t all failures happen for a reason? Are we trying to say that some failures have no root cause? It might sound that way, but in fact all failures are related to some root cause. It’s just that the root cause isn’t always observable or controllable, in practice.
As a thought exercise, let’s consider a hypothetical production run of one billion resistors. Let’s suppose that they are all shipped in one big box to the ECU assembly plant. And let’s also suppose that when that box is unloaded, it is dropped on its corner accidentally and the one resistor sitting in that corner is slightly dented or stressed. Then that one resistor, along with all the rest, are assembled into one billion controllers installed into one billion cars. And then time goes by…
Over time, we may observe that almost all of those controllers live out their useful life with no failures. But one resistor, that same one that was slightly dented, cannot survive the repeated thermo-mechanical stresses and fails open. A reliability engineer might observe this result and record that the “probability of random failure over system life is one in one billion.” Was there a root cause? Yes, that one resistor was damaged in shipping! But that’s a pretty difficult fact to discern years later with scant data. We don’t know what caused it, we can’t find anything wrong in the design, and all the rest were working fine. It just seems so…random. And so we call this random failure.
If we had the insight to see what made that one resistor fail, we might call it a systematic failure. The proper fix in such a case might be to better manage shipping and handling of the parts (maybe using smaller boxes!). But in real-world cases, some seemingly minor issues are not observable to us. We cannot follow every part for every moment. In many cases, it’s the cumulative effect of these minor un-observable quality-related issues that lead us to so-called random hardware failure. Transient errors or “bit flips” are similar but different. We know the root cause (radiation particles) but we can only control them so much. The uncontrollable portion is dealt with as “random.”
Next time we’ll look at some of the hidden numbers used in random failure analysis…