CRC, or two copies?

Hello, I need to read and write a set of data to a semi-reliable medium.  So, when I need to retrieve a portion of the data I need to do an integrity check.

One way to do this would be to (a) compute a CRC of some sort and write the CRC after the data, and then when I retrieve the data, make sure it is consistent with the CRC.  If so, use it; if not, treat it as  bad data.

Another way would be to (b) just write the data twice in a row.  Retrieve both data items, and check if they are the same.

The (b) approach is obviously more expensive in terms of storage space.  But assume I don't have much data and don't care about this -- which approach is more reliable, i.e, less chance of "bad" data being mistaken for "good" data?

It seems to me that storing a complete second copy of your data (approach b) should be more reliable, simply because fewer verification bits (CRCs are shorter), no matter the method, means more chance for overlooked errors.  Or... on the other hand is there some mathematical wizardry in some CRC calcs that somehow makes it more reliable than a full duplicate?

Any thoughts or insight is appreciated...
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Store 3 copies and make sure that at least 2 agree.
RonMexicoAuthor Commented:
Sure, that would be a different approach.  

My question had to do with the academic principle though -- is two copies always more reliable than a CRC simply by virtue of more bits and more verification power?  Or are there CRC calculations that somehow do a better job than straight copies?

My guess is the former, and CRCs only exist because it is not often practical to store or send two copies of the important data.  But I'm not 100% confident of this hypothesis.
CRC tells you there is a problem, but not where.

Duplicate copies narrows down where the error is, but still doesn't help you proceed.

Might as well add enough redundancy to actually be useful.
Determine the Perfect Price for Your IT Services

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden with our free interactive tool and use it to determine the right price for your IT services. Download your free eBook now!

RonMexicoAuthor Commented:
I'm only interested in one thing: detecting an error.  Which does a better job, CRC or duplicate copy?
You have to look at the characteristics of the data and the characterize and probabilities of the errors and the costs associated with both kinds of failures.

Do you have a particular unreliable media or noisy channel in mind?
RonMexicoAuthor Commented:
I'm actually storing important bytes from my CPU to some dataflash over SPI.   When I retrieve them later, they could be corrupted by either (a) chip degradation or (b) electrical integration while transmitting between chips.

It sounds like you are disagreeing with my hypothesis, then: that if it is practical to store a duplicate copy, then it is always stronger than a CRC... simply by virtue of having (significantly) more verification bits.   Is that correct, you disagree?
RonMexicoAuthor Commented:
* interference not integration
You really have to know (or make assumptions about) error probability.
How big is the file, and how big is the checksum?

If the duplicated file is twice as large, isn't it twice as likely to have an error?
You can make the chance of random errors escaping the CRC arbitrarily small.
In practice, 2^-64 should be good enough for most purposes.

CRCs can also be used for error correction, so that if there are sufficiently few errors,
the original data can be recovered even after some of it has been corrupted.

This can be done in a much more efficient way than keeping 3 copies of the data,
and when done correctly, can recover from more severe data corruption.
1st answer is the best one.

On one hand, a checksum is a pretty reliable way to know if the file is corrupt. But then again, it is only as reliable as the medium it is stored in... if you don't rely in the medium, how can you rely on the checksum?

On the other hand, having 2 copies only makes matters worse: you waste space and you don't know which one is the uncorrupted.

so... d-glitch is right: store 3 copies and make sure 2 of them check.
I think I do agree, but I'm sure I'm guessing.
If you have enough extra data bits to store 2 copies of the data, using those bits for CRC type checks can be much more reliable than simply keeping a second copy of each bit.
(which may not detect two errors, even if it always detects a single error)
RonMexicoAuthor Commented:
d-glitch: My hypothesis is that duplicate is *always* better than CRC (but usually not practical -- it happens to work fine for me).  If you are saying that it depends on error probability, then you must disagree with my hypothesis?  So then I'll turn it around... what circumstances (error probability, file size, etc) are you thinking of in which a CRC would be a better detector of corruption?

ozo: thank you, yes talking about random errors -- but you hit on the unique part of my question... I don't care about efficiency, I have lots of storage space, I only care about corruption detection, not treating bad data as good.  In that case would you choose a CRC over a duplicate copy?
RonMexicoAuthor Commented:
(Missed a couple of responses -- catching up)
3 copies allows you to recover from any single error, but may be fooled by a double error.
There are ways to use CRCs so that you can recover from double errors, using much less space than storing 3 copies, and with more sophisticated error correction, you can recover from even more errors.
RonMexicoAuthor Commented:
designatedinitializer: understand your response, thank you.  But in my case I don't care about either (a) the wasted space or (b) not knowing how to proceed.  Clarification on (b): i don't care which of the two is uncorrupted, only that some corruption has occurred.  I will either try to retrieve it again, or I will go with a default value -- the thing I'm trying to avoid is treating a corrupted value as good.  So between CRC and duplicate copy I'm trying to select the strongest filter.

Also, agree if the medium is bad the checksum is iffy -- but it's much better to throw away good values than to accept bad values.
If the probability of two errors is roughly the square of the probability of one error,
than the probability that a duplicate copy will not catch that may be high.
With a few more CRC bits it should be easy to reduce the probability of not catching errors to well below that level.
RonMexicoAuthor Commented:
Hm, everyone likes the 3 copies idea, so thinking about that.  

If I have two copies, and get the same bit flipped between the two copies, then I will miss the error and accept a bad value.

If I have three copies and choose two that agree, and get the same bit flipped between any pair of copies, then I will miss the error and accept a bad value -- not sure this is any less likely.

It would be better than two copies, obviously, to have three copies and reject ALL if any disagree.  Could go to four and five copies, too, if I want to improve my filter.  

I was mostly wondering about the science behind it though: is rejecting based on a duplicate copy always stronger than a CRC?
RonMexicoAuthor Commented:
(reading your last comment ozo)
RonMexicoAuthor Commented:
So ozo you are basically saying a CRC is more powerful than a duplicate copy, because it is less likely to have the CRC and data both (randomly) corrupted in a way that matches, than it is to have two full sets of data both corrupted in a way that matches.
If you can tell me the error characteristics of the medium, and the acceptable probability of an undetected error, I should be able to tell you how many CRC bits it would take to reduce the likelihood of an undetected error to that level, and I should be able to tell you how many copies of the original it would take to reduce the likelihood of an undetected error to the same level.  In almost all such cases, CRC sould be more efficient.
You could also tell me error characteristics of the medium, and the amount of extra space you want to use, I should be able to tell you the probability of an undetected error if you fill that space with copies of the original, and the probability of an undetected error if you fill that space with CRC type checks.  Again, CRC should be more efficient in almost all scenarios.
it is less likely to have the CRC and data both (randomly) corrupted in a way that matches, than it is to have two full sets of data both corrupted in a way that matches.
For an equivalent number of extra bits, yes.  (Unless the error characteristics of the media are very peculiar or the number of bits in the original data is very small)
There's one other way:
Compute and store multiple checksums (CRC + MD5 + ...) Then check them at read. The only downside is that's more processing-intensive.
However, this way you get the best of both worlds.
If you are worried about false positives, then you can rest assured it is extremely unlikely that the data is corrupted if it passes all tests.
is rejecting based on a duplicate copy always stronger than a CRC?
It is almost always weaker than a properly designed CRC.
RonMexicoAuthor Commented:
"It is almost always weaker than a properly designed CRC."

ozo, sounds like I might benefit from some research, do you have any references for how you would calculate "how many CRC bits it would take to reduce the likelihood of an undetected error to that level"?
MD5 and can be better than CRC for non-random errors,
like if someone was deliberately trying to corrupt the data in a way that escapes detection.
If the errors are random, a CRC with a sufficiently large polynomial should suffice.

If the random errors have certain known characteristics, (e.g. tending to occur in bursts)
then a CRC (or perhaps more sophisticated error detection) can be designed to optimize error detection for those types of errors.  (for example it might detect all bursts shorter than a certain length)
"how many CRC bits it would take to reduce the likelihood of an undetected error to that level"
In the simplest case, if you want the likelihood to be below a probability p,
you would need -lg(p) bits in your CRC.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
RonMexicoAuthor Commented:
Can you point me to an article or web page that explains this that you agree with?  In addition to understanding it, I need to provide a basis for my decision, with references...

It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Programming Theory

From novice to tech pro — start learning today.