Link to home
Start Free TrialLog in
Avatar of Brandon Lyon
Brandon Lyon

asked on

Filesystems with data integrity

When I fix people's computers 4 times out of 5 it's because operating system files get corrupted. I understand that can be prevented by using a filesystem built for data integrity. I am unclear which disk format to use and why.

After some research it appears that the only option for Windows is ReFS but that is focused on enterprise and can't be used for the OS disk, so Windows doesn't really have any options yet as far as I can tell.

For my Linux machines it looks like BTRFS and ZFS are both options but I'm not sure which one to use or if there are any better options. I tend to run Arch based machines but am comfortable and happy with Debian options as well.

From what I can gather about Apple they use a combination of hardware ECC and APFS to ensure data integrity but is that enough or should I use another nix based format like ZFS?

What are the downsides if any to these data integrity focused disk formats other than potentially performance?

Edit: Let's assume that the system is not impacted by malware and already has a working UPS with graceful shutdown.
Avatar of Gary Patterson, CISSP
Gary Patterson, CISSP
Flag of United States of America image

This really isnt a filesystem issue.  It is a disk integrity issue.  Disk gets corrupted when a drive fails, during an abrupt power outage, or from malware.

You solve these by using mirrored disks or a RAID array, using a ups that can shut the system down gracefully, and by good AV and security practices.
Four out of five times you're seeing corrupted OS files? Wow, 80% seems like an unusually high ratio for corrupted files unless you're just known as the go-to-guy for that kind of thing, or unless you're in a corporate environment where you're using the same hardware across a lot of machines and that hardware is failing. For me, usually if someone needs their computer fixed, it's because the system is teeming with malware.

The fact that you're asking about disk formats as a preventive measure tells me you're in a position to do things to prevent the corruption from happening, but some details would help out here.

Now, the filesystem choice almost seems irrelevant unless you can identify the root cause for the corruption. I mean, let's say that malware is responsible for corrupting the files (e.g. replacing standard files with infected ones). Chances are that the filesystem alone isn't going to prevent this from happening because the filesystem probably can't distinguish malware from legitimate software (e.g. Windows Update).

If it's hardware related (which seems more likely, but that's just my anecdotal opinion), then I would recommend more quality hardware first, and then a more redundant hardware architecture second. For example, let's say you're buying a lot of hard drives and you're seeing a 30% failure rate within one year. It might be cheaper to buy a more expensive brand (e.g. HGST is pretty consistently the most reliable per most reports, like this one: https://www.digitaltrends.com/computing/backblaze-reliability-report-2018-hgst/), and have less failures.

Alternatively, you could look at a RAID setup. A basic RAID-1 setup would guard against any single drive failure, which would give you the opportunity to swap out the drive without losing any data (and the user could continue to limp along on the 1 drive until you got to them). Sure, the performance isn't great, but you'd be able to use a standard filesystem for better software-level compatibility, and RAID-1 is pretty trivial to setup.
Avatar of Brandon Lyon
Brandon Lyon

ASKER

That is technically correct but wouldn't fully mitigate the issue and in my experience aren't the cause of the issue most of the time. A RAID array can propagate corrupted files. AV & security doesn't really have anything to do with corrupted files 99% of the time. Shutting the system down gracefully is often not the cause either.
Looks like Gary beat me to the punch while I was writing that up, but he included an important (and cheap) option that I missed, which is a UPS battery backup. Sudden power loss on a system can have a detrimental effect on hardware (not to mention that the guy who just spent the last 3 hours on an unsaved Photoshop project is probably ready to set something on fire). So yeah, UPS can help out a lot here, too.
Unfortunately this is fixing other people's computers. Family and friends, stuff like that. I can't buy their hardware for them. I'm often asked for hardware and software recommendations but that is after it's too late. When I rebuild their computers I would like to make sure their filesystems are as resilient as their hardware.

I'm not really concerned with technical difficulty of implementing a fix. I'm more concerned with not coming back and fixing something again because I fixed it right the last time.
Yes, a RAID array would propagate corrupted files if the corruption is due to something other than hardware. Again, you should be isolating the root cause of the corruption first and looking to mitigate that. Corruption doesn't usually just happen for no reason at all, and the odds of corruption hitting critical operating system files before it hits a bunch of other non-essential files is usually pretty slim (simply because you usually have more non-OS data than OS data).

The AV can be critical here if it's malware-induced corruption. Several AV makes nowadays (Comodo, for example) have automatic sandboxes that can prevent unknown software from making changes to the filesystem (while allowing known, digitally-signed software to continue operating normally).
I agree that isolating the root cause is important. It's one of the reasons I'm trying out different filesystems.
So for family and friends, I'd almost definitely go down the route of instituting some decent A/V. My parents often fell victim to malware, especially when a malicious ad would present some legitimate-looking pop-up that used social engineering to get them to download and install things. They've gone through many cycles of problems that often required a complete wipe and reinstall, but ever since I added Comodo (there are other options, too), my mom will occasionally email me to tell me that Comodo presented some message about something being blocked and whether it's okay or not, etc, etc - the end result is that they've had a pretty consistently smooth experience ever since.

Trying to solve the problem with a different filesystem is drastic, and unlikely to actually solve the underlying root cause, and also puts those users into a less-common situation that might actually make their lives harder, and might leave them with little support if they need to get it from someone else. Whenever I help out my parents, I always think in the back of my head, "Would someone else be able to step in and help if I were to get hit by a bus tomorrow?"

I mean, if it were just as simple as having a better filesystem, then that would be the default filesystem for Windows instead of NTFS.
Plus, a good A/V should be more effective at not only mitigating malware-caused corruption but also identifying it (or eliminating it) as a root cause.
And when I say a "good A/V" - I'm talking about something that is a little more full-fledged and NOT just an on-demand, free scanner like ClamAV or Avast or something.  Those might be okay when you know you have a problem file or you're trying to scan something before executing it, but nowadays the malware can sometimes be spread without any human interaction at all, so it's important to have something that is constantly active and monitoring the filesystem and has a sandbox and can cover multiple channels (email, web, etc).

Yes, it costs money, but it's usually far cheaper in terms of time and/or money than the alternative.

This is only my opinion, but I'd suggest either Comodo or Kaspersky or Symantec for brands. They tend to produce quality, consumer-targeted security bundles, and they often sell the licenses in bulk so you can buy one package and literally cover all the PCs in a household.
ASKER CERTIFIED SOLUTION
Avatar of David
David
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I see the original question was edited to mention an assumption of no malware and clean shutdowns.

Again, I have to emphasize that corruption doesn't just happen for no reason (and when it does happen, it doesn't usually target OS files), and a filesystem change isn't going to really provide insight into the root cause of the corruption.

Also a different filesystem is more likely to CREATE problems than solve any, especially on a system like Windows where NTFS is so common (and if I recall correctly, is the ONLY filesystem for the OS unless you're considering older FAT-based options). ZFS doesn't work on Windows, last time I checked.

So I'm trying to figure out why you're intent on the filesystem path. It's puzzling, to the say the least. If you're absolutely certain that there is no malware, then the more likely option is a failing drive.
I should also note that I'm focused on ZFS because you're talking about fixing others' computers, and in my experience, if a family or friend is asking for help, they probably aren't a power user who would have anything but Windows installed anyway (unless they're on Macs).

When you say 4 out of 5 - how many unique computers have you encountered with this problem and how did you verify that they were free from malware?
is it the filesystem that is being corrupted requiring a chkdsk? (points to hardware problem) or files on the file system getting corrupted?  I think you are barking up the wrong tree blaming the file system NTFS is rather robust
My company provides desktop and network services, including data recovery.  I've been dealing with support issues like this since PCs were introduced (in 1981), so I've got a pretty good handle on the subject.

"4 out of 5" problems caused by "corrupt OS files" (what does that mean, by the way?) is a weirdly high number.  If that's an accurate number based on your experience, then you're clearly dealing with some sort of "microcosm" of support issues that is skewing your perspective:  a single system you are coming back to over and over due to malware, a user who yanks the power cord frequently, or a failing drive you haven't isolated and replaced.  

"Reliable file systems" primarily protect against physical data corruption: failing disk drive and to some extent improper shutdown.  Even then, they don't provide a lot of benefit in end-user systems - we use them in very busy enterprise systems.

The kind of problems you are describing with end user systems happen from 1) physical disk failure, 2) improper shutdown, and 3) malware.  Reliable file systems can reduce the risk of corruption due to physical disk damage (but can't eliminate it, especially in a "single disk" failure scenario), but you still have to diagnose and replace the failing disk or eventually no amount of file system redundancy will help you.  They can also provide additional protection against corruption from improper shutdown - you will have data loss, but files should be left in a valid state.  

They provide no additional protection against malware.

For protection against failing disk, nothing beats redundant hardware.

For protection against improper shutdown, nothing beats a UPS, followed closely by user training (for end user systems, that is).

That said, to answer the question of "what harm?", other than performance, the answer is "none".
Thanks for trying to help everyone. This is all off-topic. I apologize for framing the question with extraneous details. I'll delete the question and try again with a clearer question.
I wouldn't necessarily delete this - there's some decent content in here that might be useful to others. Just my two cents. -shrug-
If 4 out of 5 of the computers that come to you for repair have data corruption, it's very likely that it's a user error amongst those users you support, rather than computer error. No amount of technology will fix that.  I've handled several hundred computers and I just don't see that kind of failure that frequently.  All modern file systems have journalling, which should already make them more robust than early file systems that didn't have journalling.  In those olden days, I did see a lot more file and disk corruption than I do now.  If they're causing so much failure even with journalling enabled, this suggest the users are forcing hard power shutdowns rather frequently, or they have unstable electrical power supply, or something else in the environment.