Link to home
Start Free TrialLog in
Avatar of kingy0489
kingy0489

asked on

Advice needed on C# file compression/encryption

I have a fairly large data collection system, which stores its data into an XML format.

This has two limitations:

1. Each file has approx 1000 <reading> tags, and there is much repeated text, I would like to be able to compress somehow to remove some of this bloat.

2. Data is stored in plaintext. The system is not human accessable but for CFR 21 reasons I want to at least have some sort of rudimentary tamper protection.

I am using LINQ to XML quite heavily, but ideally I want to compress, then encrypt each file also.

I am a bit stuck with how to continue here. Also, I generally access 30-100 of these files in quick succession. Is there a large processing overhead associated with compression and encryption in c#?
Avatar of Avodah
Avodah
Flag of United Kingdom of Great Britain and Northern Ireland image

There will be an overhead and this would be relative to the size of the data being compress and encrypted plus what encryption algorithm is used.

I believe compression and encryption in .NET is simple enough for you to knock up a quick example and check the performance for yourself.

Compression / Decompression using GZIP
http://www.csharphelp.com/2007/09/compress-and-decompress-strings-in-c/

Encryption / Decryption
http://support.microsoft.com/kb/307010

DaTribe
ASKER CERTIFIED SOLUTION
Avatar of Alexandre Simões
Alexandre Simões
Flag of Switzerland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of kingy0489
kingy0489

ASKER

Alex,

The system in question is running on an embedded device, recording approx 10,000 reading per day (not too many I know), and storing them into a database (MSSQL).

The MSSQL database stores the last 30 days worth of data, which is used for reports and querys.

After 30 days however, the data is archived off into these XML files for long term storage.

The data has to be recorded for legal purposes, for approx 30 years.

For this reason, it was decided that using a database in the traditional sense was not viable, as it would

a) get huge (exceeding the limits of MSSQL within a few years)
b) be difficult to adapt in future when/if demands change
c) not be possible to normalise due to immutability required on this data.

Also, the data in XML files can easily be posted via a web connection for remote storage, which is something we do on a regular basis, posting to servers in 4 countries. This connection is of course encrypted to a high standard as we use the inter-tubes over SSL, with additional blowfish!

Originally the plan was to use a database, but it was deemed not to be "future proof" as much as storing data in XML.

Not my decisions either, but something I have to work with.
A practical and sensible choice for archiving the data. I don't see any problems with doing it this way, however I will advise to do some performance testing to ascertain how the system will hold up under heavy load and therefore you can determine if you need to implement something special.

Your performance requirements I believe will be centred around the time patterns of when the data is accessed. This could lead to developing a caching algorithm something like how windows does paging. Simply put cache the resources with are accessed more often. Resource restriction & a clever algorithm to work out with XML document to cache.

DaTribe

Ok, thanks for the explanation.

After that I can only advise you to enable windows compression & encryption the the folder you have the files.
This way only allowed users will have access to those files and all the compression/encryption process will be transparent.

As these are text file the compression ratio will be very good.
As noted above the system when the documents are stored is not human accessible, plus encryption transparency will defeat the purpose of securing the documents against unauthorised access if the storage system is compromised.

DaTribe
It all depends on how much can you afford to spend on encryption and compression processes when you only need if to query that data.

Keep in mind that if that data is too big for you to keep it in a database, it still is big when persisted to XML (I think it actually get bigger because XML need more text to store the same value).
Also note that you no longer have a database engine to optimize the queries... all you have is the FileSystem and a XML query language to ease the pain.

My advise to you is: don't get too exited trying to build a super secure and highly compressed system as it will blow on your hands when things get huge.

As you asked on your question:
1. "some sort of rudimentary tamper protection"
    I would stick with windows filesystem encryption.

Same thing for compression... you'll notice a huge space availability after compressing the xml files folder.
@DaTribe, correct the file directory is not available via normal navigation, and the archive is visible only through a web based interface, which I limit access to based on permissions. The requirements for CFR are a little bit hazy in this respect, as in reality it would be very difficult for a malicious user to gain access to the file structure in the first place...but you never know.

@AlexCode, One idea I did have was to use only compression on the files, which should be quite effective. I only really know about image compression, but I would imagine the idea is along the same lines, in that repeated items (such as the XML tags) get shortened down to shortcuts etc.

Based on what has been said here, I am thinking of just compressing using a built in algorithm, and then storing some sort of checksum elsewhere, so that I know if files have been tampered with. Sound sensible?
For me seems ok...
My main point is to keep the overhead as low as possible.
Even compression, if it's not critical let it look nasty as long as it is (and remains for as long as possible) FAST! :)
My worry is how big my hard disk will have to be!
Storage is an 80's problem, not todays...
Don't jeopardize performance because of storage, you'll soon find that performance is way more expensive :)

SATA 2Tb basic HDs are very cheap...

Or buy a NAS... they get cheaper every day. You have from home solution to more advanced systems supporting multiple combination of RAID letting you have IO performance vs redundancy for hardware faults.

All depends on how critical that information really is and how fast you need to access it.
The solutions are available, pick the right hardware for you, but again, never jeopardize performance because of storage.

Cheers!
Alex