Order of Magnitude of SELECT query.

Posted on 2013-02-01
Last Modified: 2013-05-07
Hey Experts,

I have a good algorithm question for you.

First off, I'm trying to implement incremental updating differently than before. At the moment, the code would walk over every file in a folder using Depth First Search, and track where I was and what version of the file I already documented using a hierarchy based data structure (folder and file representations as objects).

I'll call the hierarchy based data structure - I.E. - The object based folder and file representations the "Directory Representation".

So, as the code stepped through the filesystem, it would also step through my Directory Representation as well, comparing files as I went - and picking out differences. this would allow me to determine what has changed, been added, or deleted since the last sync when we were all done. Cool stuff.

a) So, in a worst case scenario, the directory is 1 level deep only with n files in it. This would mean that for each file, I would have to search my Directory Representation's +-n items, making it ~O(n^2) search to compare every file.

b) This doesn't usually happen because usually there is a lot of folders, and I'm in the correct folder in the Directory Representation so I only need to to search that specific Directory's files in the Directory Representation to compare. I think this is what makes it MUCH more efficient. (It is very quick)

In reality, the Directory Representation uses a Sorted List, so the search is more like log(n), so searching it n times would be more like nLog(n) for each directory's files. (nice)

SO: Now to my question.

We are thinking about replacing this hierarchy based Directory Representation with a Database.

This is troublesome because I think of a database as a 1-D structure, so every time the code needs to check the previous state of a file, a SELECT query would be used, and the ENTIRE database of files would need be searched. In my view, this would make the incremental updater much slower when there are large amounts of files invloved - and it would make the problem more like the worst case scenario (a).

So, can anyone suggest ways to search a database n times when the database is of size n  (database represents a folder and file hierarchy) so that the overall compare is an Order of Magnitude or two faster than O(n^2)? (n*n=n^2)?

So I guess this is a database sorting/optimization problem...

We're using SQL lite for now...

Thanks for any help or opinions!

Question by:jeffiepoo
  • 3
  • 2
  • 2
  • +2
LVL 45

Assisted Solution

aikimark earned 300 total points
ID: 38846717
With a database table expected to encounter a lot of searching or needing to be joined with another table, you would normally create an index.  Indexes are usually log(N) efficient, depending on the number of columns and their type.

However, with a database, you can approach your problem differently and leverage the power of the database engine.  For instance, you can populate a table with data from the existing file system and join it with the table of prior (status) data, looking for differences.  This would normally require three queries (different size/date for same path; new files in path; deleted files in path).
LVL 51

Accepted Solution

Mark Wills earned 100 total points
ID: 38846852
I think I am misunderstanding the point...

You have a directory representation and you need to check for file changes, so what is the actual comparison ? Sounds like you need to search every folder anyway, because that represents the "new".

Or, do you gather a new file, and then go searching for where it should now live ?

There has to be a double-sided action ("what is" vs "something new"). So, the search is not the only consideration.

It started to sound like you DB approach was going to store a file as if it were a table, and that's when I got a bit confused by your scenario.

You need to design the database to best utilise the power of the database, so you invent constructs like "date last changed" and "date synched" so you can do your searches using a "set" style approach rather than RBAR (row-by-agonising-row). I think the engine search response will be significant - simply because that is what they are designed to do and they spend vast sums of effort and resource to make sure it is as efficient as possible.

While you might need to visit a folder than scan for files, an index within the DB might contain several pointers on the one page (building block of a db) and is already caching parts of the db. So, I dont think it is a simple matter of how many lookups, but how you have structured your database as to gaining (sometimes profoundly significant) efficiencies in moving to a database structure.

Author Comment

ID: 38847726
I like the index suggestion - this would probably be the best option with a database.

Say I have a database made up of a single table with 5 columns. How would I index the database on all 5 columns?

If I indexed the database this way, would a SELECT statement be faster in every circumstance?

Sorry I'm kinda new to this. Sorry I can't help you if you didn't understand my post - it's complicated and I read it 3 times to make sure it was clear and it wouldn't be worth the typing room to re-explain it a bunch of times.

Thanks for the help guys!

LVL 51

Assisted Solution

by:Mark Wills
Mark Wills earned 100 total points
ID: 38847791
You can index multiple times, multiple columns. You would need to do a bit of analysis to make sure the indexes are actually worthwhile and to what extent you actually index.

Also depends a bit on which database as to how best structure the tables and subsequently the indexes. They all have very subtle and slight variations (because they all reckon they know how to best perform).

So, you first need to try to understand (and it can be a bit of a black art) what the query optimizer is doing...

Also read how to create an index :
Netscaler Common Configuration How To guides

If you use NetScaler you will want to see these guides. The NetScaler How To Guides show administrators how to get NetScaler up and configured by providing instructions for common scenarios and some not so common ones.

LVL 21

Assisted Solution

developmentguru earned 50 total points
ID: 38864496
There is a way to get rid of the Log(N) on your current search.  If both lists are sorted then you can simply walk the lists together.  Wherever the lists match in increment both.  When they do not match, the lowest is the "odd man out", depending on which list it is in you would know which way the file needs to move.  In the odd man out scenario you would only move to the next item on the list with the lesser file name value.

By walking the lists you never search for individual items.  This should address the speed issue of your worst case scenario especially well.

Depending on the database you use, you may want to think about storing the list of files in a BLOB or CLOB field (some N sized database field for holding text).  This would allow you to pass your current list (read from the file system), and a user ID, to a stored procedure on the server which could do the same walking operation and produce normal database output... say one column for the operation (upload or download) and the second column for the file name.
LVL 29

Assisted Solution

fibo earned 50 total points
ID: 38904438
1 - You can build an index combining several fields. eg name, date, size
2 - In the case of MySQL, this index would also be usable as name or name, date
3 - If this is efficient, you may define to build the index not with the whole content of the field(s) but with the n leftmost characters

You may want to experiment with different index strategies: create several candidates, then use the "EXPLAIN" command on a SELECT: it would tell you which indexes have been considered and which ones, if any, have been used (in the case of MySQL: it uses only one of the available indexes, this selection being made at run-time). If none... then you should reconsider since you are in the worst-case. May be trying some other index?

If you want to detect if a file content has been changed or is similar to some other file content, you should also consider storing in your database a hash-code of the content, eg, the MD5 (or other) of the content: if 2 files have 2 different hashes, they are different (but if they have the same, then check their size, and if identical you will need to compare the real content).
Indexing on the hash code might also prove a good idea, but again here some experimentation is needed.

One may be tempted to create lots of indexes, assuming that this increases performance. It does, but only at SELECT time, and this has the price of a performance drop at updating time: the more indexes you have, the more need to be updated with the new data.
LVL 29

Expert Comment

ID: 38948950

Where do you stand on this?

Author Comment

ID: 39085243
I need more time

Author Closing Comment

ID: 39146473
All great suggestions thank you!

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

CCModeler offers a way to enter basic information like entities, attributes and relationships and export them as yEd or erviz diagram. It also can import existing Access or SQL Server tables with relationships.
Since upgrading to Office 2013 or higher installing the Smart Indenter addin will fail. This article will explain how to install it so it will work regardless of the Office version installed.
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …
Polish reports in Access so they look terrific. Take yourself to another level. Equations, Back Color, Alternate Back Color. Write easy VBA Code. Tighten space to use less pages. Launch report from a menu, considering criteria only when it is filled…

910 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

23 Experts available now in Live!

Get 1:1 Help Now