Link to home
Start Free TrialLog in
Avatar of wsyy
wsyy

asked on

MongoDB vs file system

Hi,

I needs to handle millions of json files, which involves multiple steps, and saving a copy of handled files at each step.

Originally I wanted to save those files in FS, but now I am thinking of MongoDB. I would like to know what differences, in terms of spead, are.

Thanks
Avatar of lexlythius
lexlythius
Flag of Argentina image

Assuming you're working with Linux, I would use Git for storing the files and the revision thing.

Also, linux filesystems have these handy tools like find and better yet locate which handles it own Berkeley DB to keep an index of files. Couple it with grep and regular expressions and you already have a lot of power.
If you want you can then add a DB or Sphinx for indexing on top of that.
Avatar of wsyy
wsyy

ASKER

Hi,

I am talking about java programming.
I'm not knowledgeable in MongoDB, but you may want to check out Hadoop.
So, you are taking a single file, making changes, saving new file, making changes, saving new file, moving onto the next file.

Having done that, what would you be doing next with the files?

Keeping them in the filesystem is the simplest. Each file is saved with its name and a datatimestamp; e.g.
file_0001-2011-09-05_16_48_48.json

Open in new window

sort of thing.

Depending upon how the data is to be accessed in the future, maybe you need to organize them in a structure ...

\0\0\0\1\file_0001-2011-09-05_16_48_48.json
\0\0\0\1\file_0001-2011-09-05_16_49_53.json
\0\0\0\2\file_0002-2011-09-05_16_48_48.json
\0\0\0\2\file_0002-2011-09-05_16_49_53.json

Open in new window

maybe.


Putting the json into a DB and having a couple of columns for the name and modification datetime is another option.


I'm still learning MongoDB. And whilst the main point is that it is a schema-less data store, you still need to be able to key the files in some form, so again, a name and a modification key would be needed/useful.


All of this though depends upon what you will be doing with the millions of json files once you've processed them.

An VCS is going to allow you to quickly compare the differences using a web browser. Any number of VC systems are available for local use (I use SVNServer with TortoiseSVN and SlikSVN for windows along with websvn - a PHP based SVN repo viewer).
Avatar of aikimark
Please give us more details about your handle process requirements.
Avatar of wsyy

ASKER

I need to process the original html in a few steps. One is parsing it, and saving the results in a json file.

The next step is evaluate the results by opening the json file and performing evaluation. The evaluation step saves the results into another json file because I want to keep historical records of each step.

The rest steps are similar to the second, and each of the rest steps has to make a copy of the results in a new json file.

So we have many interim results in different json files.
Avatar of wsyy

ASKER

I want to know how to save disk space and speed up by using mongodb vs filesystem.

Some said mongodb requires more space than filesystem, but the speed is not a significant plus.
ASKER CERTIFIED SOLUTION
Avatar of Richard Quadling
Richard Quadling
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Right. I've just imported that data into MongoDB.

The resultant store is in 5 files totalling 268,435,456 bytes.

The drive is compressed and that reduces it to 155,148,288 bytes.

So, between MongoDB and the Filesystem, with compression, the difference is about 8%.

In the uncompressed state the difference is about 24%.


OOI. I RAR'd the files - 7,352,203 (the compressed data only takes 6,403,992 - the remained is the fileanames/times/etc.).

So, uncompressed text 347MB. Compressed RAR on NTFS 6.7MB.

Speed ... well. Reading RAR files is slow as the it compresses all the files in an stream to allow similar content to be reduced, even if it is in a different file. ZIP files compress each file individually and  creates larger archives but is faster to extract a single file.

Having done that, your data could be completely different in size.



And another OOI. I just RAR'd the MongoDB store. 4,104,738 for the data, with 4,105,238 on disk. Completely useless mind, but for moving a data store of nearly 350MG into just under 4MB. That could be useful for long term storage.
I have so many questions.  Here are a few:
1. What is the ratio of read to write for each file?
2. What is the read:write ratio for each step?
3. What is the selection criteria for the files other than the name?
4. Once the process is finished, what do you need to do with the intermediate json files?
5. What is the read and write pattern for the files? (sequential or random)
6. What are you performance criteria?
7. Do all json files change between steps?  If not, what percentage of files change?
8. What kind of data are you storing in the json files?

=======
If you are just keeping the intermediate files around for step restarts, then the simplest configuration would be to have a directory for each step.  There would likely be sub-directories based on some logical content grouping.  Starting with the second step, once a step completes, the prior step's directory tree is compressed.  This compression can take place in the background in a separate process and can use a max/ultra-compression algorithm.  It is possible to do the compression to a different drive or different medium.  Once the compression step is finished, you can delete the source directory tree.

If you do need to restart, the (now) worthless content can be deleted and the starting point's content can be restored.
@wsyy

On what OS is your Java code running?
Avatar of wsyy

ASKER

I intend to run it in CentOS 5.5
Without the 32-bit Windows memory restrictions, you should be able to hold two generations of the json data in separate data structures and archive them as part of your inter-step process.  The archive (compression) process can run asynchronously.

You can do this programmatically (list or collection), or with a simulated in-memory disk drive (ramdisk/ramdrive).  The Linux kernel provides you with this feature.  Even with the simulated directory activity, the performance runs circles around actual I/O to a hard drive.
http://www.vanemery.com/Linux/Ramdisk/ramdisk.html
Avatar of wsyy

ASKER

RQuadling:

Thanks a lot for detailed explanation. What are your conclusions about choosing MongoDB vs filesystem?
Avatar of wsyy

ASKER

1. What is the ratio of read to write for each file?
--likely 50 to 50

2. What is the read:write ratio for each step?
--every json will be processed at each step, and the ratio of read vs write is 5 to 2

3. What is the selection criteria for the files other than the name?
--no other criteria

4. Once the process is finished, what do you need to do with the intermediate json files?
--i would like save them for a long while, say 60 days. then discard them

5. What is the read and write pattern for the files? (sequential or random)
--seemingly random. need to evaluate the contents before making decisions

6. What are you performance criteria?
--the faster the better. i hope that we can handle thousands of files a second by using multithread

7. Do all json files change between steps?  If not, what percentage of files change?
--not really. won't know until content evaluation. likely 60-70% files will be changed

8. What kind of data are you storing in the json files?
--normal json file, string mostly.
For such a short term cache (60 days), personally, I'd stick with the file system.

I'm not seeing any realistic benefit of using MongoDB in this instance.

The other advantage to using the file system is that any file can be manually retrieved instantly without the need of any specialist coding, using any language or even the OS's file browser.


--... the ratio of read vs write is 5 to 2
--no other criteria (selection criteria...other than the name)

5. What is the read and write pattern for the files? (sequential or random)
--seemingly random. need to evaluate the contents before making decisions

These three answers lead me to think that you are linking/joining/merging data or items.  Please supply more details about the processing you are doing and the "decisions" that need to be made.
Avatar of wsyy

ASKER

for example, we need to remove data originally loaded into Json files since we didn't know of the data being noise.

we need also to analyze some related contents in the Json files and generate and save the analyzed contents as new items in the file.

also, we need to analyze multiple files if they are related to each other, and come up with analyzed results which will save into new files.

These are pretty much what we are doing.
how are these files 'related' to each other?