The question comes up too often, and has to be addressed… Is a Checksum and Rolling Counter enough?
I love Annex D within Part 5 of the ISO 26262:2011 standard, probably more than I should. It is where I start when I’m looking into the likelihood of whether an Architecture has been able to meet its appropriate ASIL or not. What is Annex D, you ask? It is the Annex of ISO 26262 Part 5 that lists the potential safety mechanisms that you can design into your system for coverage against common hardware failure modes. It is a Functional Safety Bible.If you make it through the bulk of parts 1 through 5, you have found a gold mine. In Annex D, the authors allow us to claim high diagnostic coverage on our communication bus with a combination of information redundancy, frame counter, and timeout monitoring. However, multiple times a year, someone demonstrates to me how it can fail, and therefore suggests that it is still not good enough. The key, however, is statistics.
“If you make it through the bulk of parts 1 through 5, you have found a gold mine.”
It’s a hard concept to grasp, but our end goal is to reduce residual risk to an acceptable level, not to eliminate it. It is to create the safest systems within the confines of current state of the art technology and societal acceptance of risk. In the case of communication channel protection though, it is easy to point out a potential flaw and the case where our chosen algorithm for the CRC fails! As engineers, this creates a problem that we want to fix. Again, I point you to the purpose of the standard: To reduce residual risk to a tolerable amount within reasonable engineering boundaries.
Still, I hope I can open you up to the concept that these are sufficient. Let’s head straight to the determination of hardware metrics, where these concepts become important. The lines of the FMEDA surrounding the communication bus could look something like this:
The standard gives you permission to do this. It also gives you several notes as to how to achieve a CRC which will provide high diagnostic coverage for message corruption and masquerading.
A better FMEDA may look like this (failure rates and distribution are for demonstration only):
I don’t think it takes much convincing to see how timeout monitoring and a frame counter (where we are actually concerned about the sequence of the values, not just the fact that it changed) are sufficient in providing high diagnostic coverage for their indicated failure modes. However, the case for information redundancy via a CRC is less convincing, especially when it’s easy with a small data packet to show that it only takes 1 bit to flip in the message to result in the same protection value. True – this could happen, and there are plenty of conflicting opinions and papers on the topic. (See references below.)
The rationale, as stated before, is statistics. We can detail these statistics in our safety analysis course, but for now, I am going to try to simplify it with this new FMEDA table expanding on the message corruption line:
So, the analysis required by you, is how effective is your algorithm for a specific number of bit failures? You have a very low probability rate of a specific number of bits being errant, and you will have some coverage of every case with your algorithm. Can you see how you will achieve 99% coverage overall when you consider the coverage achieved for each corruption case? You may have an algorithm that only covers 90% of 2 bit flip corruption failures, but that will be weighted with the times when that algorithm covers 100% of multiple bit faults.
I understand that being able to point out a fault in your safety mechanism leaves you uneasy. That makes me proud of you as a functional safety engineer. After all, what good safety engineer wouldn’t struggle with some unease? As a diligent functional safety engineer, I wish I was 100% comfortable suggesting to believe in the power of statistics, in the randomness of random hardware failures, and accept what the standard permits. Don’t worry though, I am still going to leave you with the suggestion of a parity bit, and really good data rationality checks.
- Undetected Error Probability for Data Services in a Terrestrial DAB Single Frequency Network, R. Schiphorst, F.W. Hoeksema and C.H. Slump
- Koopman-CRC Webinar