Solved

CRC, or two copies?

Posted on 2012-03-24
28
422 Views
Last Modified: 2012-03-25
Hello, I need to read and write a set of data to a semi-reliable medium.  So, when I need to retrieve a portion of the data I need to do an integrity check.

One way to do this would be to (a) compute a CRC of some sort and write the CRC after the data, and then when I retrieve the data, make sure it is consistent with the CRC.  If so, use it; if not, treat it as  bad data.

Another way would be to (b) just write the data twice in a row.  Retrieve both data items, and check if they are the same.

The (b) approach is obviously more expensive in terms of storage space.  But assume I don't have much data and don't care about this -- which approach is more reliable, i.e, less chance of "bad" data being mistaken for "good" data?

It seems to me that storing a complete second copy of your data (approach b) should be more reliable, simply because fewer verification bits (CRCs are shorter), no matter the method, means more chance for overlooked errors.  Or... on the other hand is there some mathematical wizardry in some CRC calcs that somehow makes it more reliable than a full duplicate?

Any thoughts or insight is appreciated...
0
Comment
Question by:RonMexico
  • 12
  • 9
  • 5
  • +1
28 Comments
 
LVL 27

Assisted Solution

by:d-glitch
d-glitch earned 167 total points
ID: 37761711
Store 3 copies and make sure that at least 2 agree.
0
 

Author Comment

by:RonMexico
ID: 37761718
Sure, that would be a different approach.  

My question had to do with the academic principle though -- is two copies always more reliable than a CRC simply by virtue of more bits and more verification power?  Or are there CRC calculations that somehow do a better job than straight copies?

My guess is the former, and CRCs only exist because it is not often practical to store or send two copies of the important data.  But I'm not 100% confident of this hypothesis.
0
 
LVL 27

Expert Comment

by:d-glitch
ID: 37761721
CRC tells you there is a problem, but not where.

Duplicate copies narrows down where the error is, but still doesn't help you proceed.

Might as well add enough redundancy to actually be useful.
0
 

Author Comment

by:RonMexico
ID: 37761723
I'm only interested in one thing: detecting an error.  Which does a better job, CRC or duplicate copy?
0
 
LVL 27

Expert Comment

by:d-glitch
ID: 37761726
You have to look at the characteristics of the data and the characterize and probabilities of the errors and the costs associated with both kinds of failures.

Do you have a particular unreliable media or noisy channel in mind?
0
 

Author Comment

by:RonMexico
ID: 37761736
I'm actually storing important bytes from my CPU to some dataflash over SPI.   When I retrieve them later, they could be corrupted by either (a) chip degradation or (b) electrical integration while transmitting between chips.

It sounds like you are disagreeing with my hypothesis, then: that if it is practical to store a duplicate copy, then it is always stronger than a CRC... simply by virtue of having (significantly) more verification bits.   Is that correct, you disagree?
0
 

Author Comment

by:RonMexico
ID: 37761739
* interference not integration
0
 
LVL 27

Expert Comment

by:d-glitch
ID: 37761756
You really have to know (or make assumptions about) error probability.
How big is the file, and how big is the checksum?

If the duplicated file is twice as large, isn't it twice as likely to have an error?
0
 
LVL 84

Expert Comment

by:ozo
ID: 37761761
You can make the chance of random errors escaping the CRC arbitrarily small.
In practice, 2^-64 should be good enough for most purposes.

CRCs can also be used for error correction, so that if there are sufficiently few errors,
the original data can be recovered even after some of it has been corrupted.

This can be done in a much more efficient way than keeping 3 copies of the data,
and when done correctly, can recover from more severe data corruption.
0
 
LVL 7

Expert Comment

by:designatedinitializer
ID: 37761768
1st answer is the best one.

On one hand, a checksum is a pretty reliable way to know if the file is corrupt. But then again, it is only as reliable as the medium it is stored in... if you don't rely in the medium, how can you rely on the checksum?

On the other hand, having 2 copies only makes matters worse: you waste space and you don't know which one is the uncorrupted.

so... d-glitch is right: store 3 copies and make sure 2 of them check.
0
 
LVL 27

Expert Comment

by:d-glitch
ID: 37761772
I think I do agree, but I'm sure I'm guessing.
0
 
LVL 84

Expert Comment

by:ozo
ID: 37761773
If you have enough extra data bits to store 2 copies of the data, using those bits for CRC type checks can be much more reliable than simply keeping a second copy of each bit.
(which may not detect two errors, even if it always detects a single error)
0
 

Author Comment

by:RonMexico
ID: 37761775
d-glitch: My hypothesis is that duplicate is *always* better than CRC (but usually not practical -- it happens to work fine for me).  If you are saying that it depends on error probability, then you must disagree with my hypothesis?  So then I'll turn it around... what circumstances (error probability, file size, etc) are you thinking of in which a CRC would be a better detector of corruption?

ozo: thank you, yes talking about random errors -- but you hit on the unique part of my question... I don't care about efficiency, I have lots of storage space, I only care about corruption detection, not treating bad data as good.  In that case would you choose a CRC over a duplicate copy?
0
 

Author Comment

by:RonMexico
ID: 37761778
(Missed a couple of responses -- catching up)
0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 
LVL 84

Expert Comment

by:ozo
ID: 37761789
3 copies allows you to recover from any single error, but may be fooled by a double error.
There are ways to use CRCs so that you can recover from double errors, using much less space than storing 3 copies, and with more sophisticated error correction, you can recover from even more errors.
0
 

Author Comment

by:RonMexico
ID: 37761793
designatedinitializer: understand your response, thank you.  But in my case I don't care about either (a) the wasted space or (b) not knowing how to proceed.  Clarification on (b): i don't care which of the two is uncorrupted, only that some corruption has occurred.  I will either try to retrieve it again, or I will go with a default value -- the thing I'm trying to avoid is treating a corrupted value as good.  So between CRC and duplicate copy I'm trying to select the strongest filter.

Also, agree if the medium is bad the checksum is iffy -- but it's much better to throw away good values than to accept bad values.
0
 
LVL 84

Expert Comment

by:ozo
ID: 37761804
If the probability of two errors is roughly the square of the probability of one error,
than the probability that a duplicate copy will not catch that may be high.
With a few more CRC bits it should be easy to reduce the probability of not catching errors to well below that level.
0
 

Author Comment

by:RonMexico
ID: 37761806
Hm, everyone likes the 3 copies idea, so thinking about that.  

If I have two copies, and get the same bit flipped between the two copies, then I will miss the error and accept a bad value.

If I have three copies and choose two that agree, and get the same bit flipped between any pair of copies, then I will miss the error and accept a bad value -- not sure this is any less likely.

It would be better than two copies, obviously, to have three copies and reject ALL if any disagree.  Could go to four and five copies, too, if I want to improve my filter.  

I was mostly wondering about the science behind it though: is rejecting based on a duplicate copy always stronger than a CRC?
0
 

Author Comment

by:RonMexico
ID: 37761810
(reading your last comment ozo)
0
 

Author Comment

by:RonMexico
ID: 37761816
So ozo you are basically saying a CRC is more powerful than a duplicate copy, because it is less likely to have the CRC and data both (randomly) corrupted in a way that matches, than it is to have two full sets of data both corrupted in a way that matches.
0
 
LVL 84

Expert Comment

by:ozo
ID: 37761829
If you can tell me the error characteristics of the medium, and the acceptable probability of an undetected error, I should be able to tell you how many CRC bits it would take to reduce the likelihood of an undetected error to that level, and I should be able to tell you how many copies of the original it would take to reduce the likelihood of an undetected error to the same level.  In almost all such cases, CRC sould be more efficient.
You could also tell me error characteristics of the medium, and the amount of extra space you want to use, I should be able to tell you the probability of an undetected error if you fill that space with copies of the original, and the probability of an undetected error if you fill that space with CRC type checks.  Again, CRC should be more efficient in almost all scenarios.
0
 
LVL 84

Expert Comment

by:ozo
ID: 37761834
it is less likely to have the CRC and data both (randomly) corrupted in a way that matches, than it is to have two full sets of data both corrupted in a way that matches.
For an equivalent number of extra bits, yes.  (Unless the error characteristics of the media are very peculiar or the number of bits in the original data is very small)
0
 
LVL 7

Assisted Solution

by:designatedinitializer
designatedinitializer earned 166 total points
ID: 37761847
There's one other way:
Compute and store multiple checksums (CRC + MD5 + ...) Then check them at read. The only downside is that's more processing-intensive.
However, this way you get the best of both worlds.
If you are worried about false positives, then you can rest assured it is extremely unlikely that the data is corrupted if it passes all tests.
0
 
LVL 84

Expert Comment

by:ozo
ID: 37761859
is rejecting based on a duplicate copy always stronger than a CRC?
It is almost always weaker than a properly designed CRC.
0
 

Author Comment

by:RonMexico
ID: 37761880
"It is almost always weaker than a properly designed CRC."

ozo, sounds like I might benefit from some research, do you have any references for how you would calculate "how many CRC bits it would take to reduce the likelihood of an undetected error to that level"?
0
 
LVL 84

Expert Comment

by:ozo
ID: 37761887
MD5 and can be better than CRC for non-random errors,
like if someone was deliberately trying to corrupt the data in a way that escapes detection.
If the errors are random, a CRC with a sufficiently large polynomial should suffice.

If the random errors have certain known characteristics, (e.g. tending to occur in bursts)
then a CRC (or perhaps more sophisticated error detection) can be designed to optimize error detection for those types of errors.  (for example it might detect all bursts shorter than a certain length)
0
 
LVL 84

Accepted Solution

by:
ozo earned 167 total points
ID: 37761892
"how many CRC bits it would take to reduce the likelihood of an undetected error to that level"
In the simplest case, if you want the likelihood to be below a probability p,
you would need -lg(p) bits in your CRC.
0
 

Author Comment

by:RonMexico
ID: 37761901
Can you point me to an article or web page that explains this that you agree with?  In addition to understanding it, I need to provide a basis for my decision, with references...

Thanks.
0

Featured Post

Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

Join & Write a Comment

Suggested Solutions

Short answer to this question: there is no effective WiFi manager in iOS devices as seen in Windows WiFi or Macbook OSx WiFi management, but this article will try and provide some amicable solutions to better suite your needs.
Even if you have implemented a Mobile Device Management solution company wide, it is a good idea to make sure you are taking into account all of the major risks to your electronic protected health information (ePHI).
Here's a very brief overview of the methods PRTG Network Monitor (https://www.paessler.com/prtg) offers for monitoring bandwidth, to help you decide which methods you´d like to investigate in more detail.  The methods are covered in more detail in o…
This video gives you a great overview about bandwidth monitoring with SNMP and WMI with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're looking for how to monitor bandwidth using netflow or packet s…

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now