Order of Magnitude of SELECT query.

Posted on 2013-02-01
Last Modified: 2013-05-07
Hey Experts,

I have a good algorithm question for you.

First off, I'm trying to implement incremental updating differently than before. At the moment, the code would walk over every file in a folder using Depth First Search, and track where I was and what version of the file I already documented using a hierarchy based data structure (folder and file representations as objects).

I'll call the hierarchy based data structure - I.E. - The object based folder and file representations the "Directory Representation".

So, as the code stepped through the filesystem, it would also step through my Directory Representation as well, comparing files as I went - and picking out differences. this would allow me to determine what has changed, been added, or deleted since the last sync when we were all done. Cool stuff.

a) So, in a worst case scenario, the directory is 1 level deep only with n files in it. This would mean that for each file, I would have to search my Directory Representation's +-n items, making it ~O(n^2) search to compare every file.

b) This doesn't usually happen because usually there is a lot of folders, and I'm in the correct folder in the Directory Representation so I only need to to search that specific Directory's files in the Directory Representation to compare. I think this is what makes it MUCH more efficient. (It is very quick)

In reality, the Directory Representation uses a Sorted List, so the search is more like log(n), so searching it n times would be more like nLog(n) for each directory's files. (nice)

SO: Now to my question.

We are thinking about replacing this hierarchy based Directory Representation with a Database.

This is troublesome because I think of a database as a 1-D structure, so every time the code needs to check the previous state of a file, a SELECT query would be used, and the ENTIRE database of files would need be searched. In my view, this would make the incremental updater much slower when there are large amounts of files invloved - and it would make the problem more like the worst case scenario (a).

So, can anyone suggest ways to search a database n times when the database is of size n  (database represents a folder and file hierarchy) so that the overall compare is an Order of Magnitude or two faster than O(n^2)? (n*n=n^2)?

So I guess this is a database sorting/optimization problem...

We're using SQL lite for now...

Thanks for any help or opinions!

Question by:jeffiepoo
  • 3
  • 2
  • 2
  • +2
LVL 45

Assisted Solution

aikimark earned 300 total points
ID: 38846717
With a database table expected to encounter a lot of searching or needing to be joined with another table, you would normally create an index.  Indexes are usually log(N) efficient, depending on the number of columns and their type.

However, with a database, you can approach your problem differently and leverage the power of the database engine.  For instance, you can populate a table with data from the existing file system and join it with the table of prior (status) data, looking for differences.  This would normally require three queries (different size/date for same path; new files in path; deleted files in path).
LVL 51

Accepted Solution

Mark Wills earned 100 total points
ID: 38846852
I think I am misunderstanding the point...

You have a directory representation and you need to check for file changes, so what is the actual comparison ? Sounds like you need to search every folder anyway, because that represents the "new".

Or, do you gather a new file, and then go searching for where it should now live ?

There has to be a double-sided action ("what is" vs "something new"). So, the search is not the only consideration.

It started to sound like you DB approach was going to store a file as if it were a table, and that's when I got a bit confused by your scenario.

You need to design the database to best utilise the power of the database, so you invent constructs like "date last changed" and "date synched" so you can do your searches using a "set" style approach rather than RBAR (row-by-agonising-row). I think the engine search response will be significant - simply because that is what they are designed to do and they spend vast sums of effort and resource to make sure it is as efficient as possible.

While you might need to visit a folder than scan for files, an index within the DB might contain several pointers on the one page (building block of a db) and is already caching parts of the db. So, I dont think it is a simple matter of how many lookups, but how you have structured your database as to gaining (sometimes profoundly significant) efficiencies in moving to a database structure.

Author Comment

ID: 38847726
I like the index suggestion - this would probably be the best option with a database.

Say I have a database made up of a single table with 5 columns. How would I index the database on all 5 columns?

If I indexed the database this way, would a SELECT statement be faster in every circumstance?

Sorry I'm kinda new to this. Sorry I can't help you if you didn't understand my post - it's complicated and I read it 3 times to make sure it was clear and it wouldn't be worth the typing room to re-explain it a bunch of times.

Thanks for the help guys!

NEW Veeam Agent for Microsoft Windows

Backup and recover physical and cloud-based servers and workstations, as well as endpoint devices that belong to remote users. Avoid downtime and data loss quickly and easily for Windows-based physical or public cloud-based workloads!

LVL 51

Assisted Solution

by:Mark Wills
Mark Wills earned 100 total points
ID: 38847791
You can index multiple times, multiple columns. You would need to do a bit of analysis to make sure the indexes are actually worthwhile and to what extent you actually index.

Also depends a bit on which database as to how best structure the tables and subsequently the indexes. They all have very subtle and slight variations (because they all reckon they know how to best perform).

So, you first need to try to understand (and it can be a bit of a black art) what the query optimizer is doing...

Also read how to create an index :
LVL 21

Assisted Solution

developmentguru earned 50 total points
ID: 38864496
There is a way to get rid of the Log(N) on your current search.  If both lists are sorted then you can simply walk the lists together.  Wherever the lists match in increment both.  When they do not match, the lowest is the "odd man out", depending on which list it is in you would know which way the file needs to move.  In the odd man out scenario you would only move to the next item on the list with the lesser file name value.

By walking the lists you never search for individual items.  This should address the speed issue of your worst case scenario especially well.

Depending on the database you use, you may want to think about storing the list of files in a BLOB or CLOB field (some N sized database field for holding text).  This would allow you to pass your current list (read from the file system), and a user ID, to a stored procedure on the server which could do the same walking operation and produce normal database output... say one column for the operation (upload or download) and the second column for the file name.
LVL 29

Assisted Solution

fibo earned 50 total points
ID: 38904438
1 - You can build an index combining several fields. eg name, date, size
2 - In the case of MySQL, this index would also be usable as name or name, date
3 - If this is efficient, you may define to build the index not with the whole content of the field(s) but with the n leftmost characters

You may want to experiment with different index strategies: create several candidates, then use the "EXPLAIN" command on a SELECT: it would tell you which indexes have been considered and which ones, if any, have been used (in the case of MySQL: it uses only one of the available indexes, this selection being made at run-time). If none... then you should reconsider since you are in the worst-case. May be trying some other index?

If you want to detect if a file content has been changed or is similar to some other file content, you should also consider storing in your database a hash-code of the content, eg, the MD5 (or other) of the content: if 2 files have 2 different hashes, they are different (but if they have the same, then check their size, and if identical you will need to compare the real content).
Indexing on the hash code might also prove a good idea, but again here some experimentation is needed.

One may be tempted to create lots of indexes, assuming that this increases performance. It does, but only at SELECT time, and this has the price of a performance drop at updating time: the more indexes you have, the more need to be updated with the new data.
LVL 29

Expert Comment

ID: 38948950

Where do you stand on this?

Author Comment

ID: 39085243
I need more time

Author Closing Comment

ID: 39146473
All great suggestions thank you!

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

These days, all we hear about hacktivists took down so and so websites and retrieved thousands of user’s data. One of the techniques to get unauthorized access to database is by performing SQL injection. This article is quite lengthy which gives bas…
This article will inform Clients about common and important expectations from the freelancers (Experts) who are looking at your Gig.

697 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question