Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

Order of Magnitude of SELECT query.

Posted on 2013-02-01
9
Medium Priority
?
375 Views
Last Modified: 2013-05-07
Hey Experts,

I have a good algorithm question for you.

First off, I'm trying to implement incremental updating differently than before. At the moment, the code would walk over every file in a folder using Depth First Search, and track where I was and what version of the file I already documented using a hierarchy based data structure (folder and file representations as objects).

I'll call the hierarchy based data structure - I.E. - The object based folder and file representations the "Directory Representation".

So, as the code stepped through the filesystem, it would also step through my Directory Representation as well, comparing files as I went - and picking out differences. this would allow me to determine what has changed, been added, or deleted since the last sync when we were all done. Cool stuff.

a) So, in a worst case scenario, the directory is 1 level deep only with n files in it. This would mean that for each file, I would have to search my Directory Representation's +-n items, making it ~O(n^2) search to compare every file.

b) This doesn't usually happen because usually there is a lot of folders, and I'm in the correct folder in the Directory Representation so I only need to to search that specific Directory's files in the Directory Representation to compare. I think this is what makes it MUCH more efficient. (It is very quick)

In reality, the Directory Representation uses a Sorted List, so the search is more like log(n), so searching it n times would be more like nLog(n) for each directory's files. (nice)

SO: Now to my question.

We are thinking about replacing this hierarchy based Directory Representation with a Database.

This is troublesome because I think of a database as a 1-D structure, so every time the code needs to check the previous state of a file, a SELECT query would be used, and the ENTIRE database of files would need be searched. In my view, this would make the incremental updater much slower when there are large amounts of files invloved - and it would make the problem more like the worst case scenario (a).

So, can anyone suggest ways to search a database n times when the database is of size n  (database represents a folder and file hierarchy) so that the overall compare is an Order of Magnitude or two faster than O(n^2)? (n*n=n^2)?

So I guess this is a database sorting/optimization problem...

We're using SQL lite for now...

Thanks for any help or opinions!

-Jeff
0
Comment
Question by:jeffiepoo
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
  • 2
  • +2
9 Comments
 
LVL 46

Assisted Solution

by:aikimark
aikimark earned 1200 total points
ID: 38846717
With a database table expected to encounter a lot of searching or needing to be joined with another table, you would normally create an index.  Indexes are usually log(N) efficient, depending on the number of columns and their type.

However, with a database, you can approach your problem differently and leverage the power of the database engine.  For instance, you can populate a table with data from the existing file system and join it with the table of prior (status) data, looking for differences.  This would normally require three queries (different size/date for same path; new files in path; deleted files in path).
0
 
LVL 51

Accepted Solution

by:
Mark Wills earned 400 total points
ID: 38846852
I think I am misunderstanding the point...

You have a directory representation and you need to check for file changes, so what is the actual comparison ? Sounds like you need to search every folder anyway, because that represents the "new".

Or, do you gather a new file, and then go searching for where it should now live ?

There has to be a double-sided action ("what is" vs "something new"). So, the search is not the only consideration.

It started to sound like you DB approach was going to store a file as if it were a table, and that's when I got a bit confused by your scenario.

You need to design the database to best utilise the power of the database, so you invent constructs like "date last changed" and "date synched" so you can do your searches using a "set" style approach rather than RBAR (row-by-agonising-row). I think the engine search response will be significant - simply because that is what they are designed to do and they spend vast sums of effort and resource to make sure it is as efficient as possible.

While you might need to visit a folder than scan for files, an index within the DB might contain several pointers on the one page (building block of a db) and is already caching parts of the db. So, I dont think it is a simple matter of how many lookups, but how you have structured your database as to gaining (sometimes profoundly significant) efficiencies in moving to a database structure.
0
 
LVL 6

Author Comment

by:jeffiepoo
ID: 38847726
I like the index suggestion - this would probably be the best option with a database.

Say I have a database made up of a single table with 5 columns. How would I index the database on all 5 columns?

If I indexed the database this way, would a SELECT statement be faster in every circumstance?

Sorry I'm kinda new to this. Sorry I can't help you if you didn't understand my post - it's complicated and I read it 3 times to make sure it was clear and it wouldn't be worth the typing room to re-explain it a bunch of times.

Thanks for the help guys!

-Jeff
0
The top UI technologies you need to be aware of

An important part of the job as a front-end developer is to stay up to date and in contact with new tools, trends and workflows. That’s why you cannot miss this upcoming webinar to explore the latest trends in UI technologies!

 
LVL 51

Assisted Solution

by:Mark Wills
Mark Wills earned 400 total points
ID: 38847791
You can index multiple times, multiple columns. You would need to do a bit of analysis to make sure the indexes are actually worthwhile and to what extent you actually index.

Also depends a bit on which database as to how best structure the tables and subsequently the indexes. They all have very subtle and slight variations (because they all reckon they know how to best perform).

So, you first need to try to understand (and it can be a bit of a black art) what the query optimizer is doing...

http://www.sqlite.org/optoverview.html

Also read how to create an index : http://www.sqlite.org/lang_createindex.html
0
 
LVL 21

Assisted Solution

by:developmentguru
developmentguru earned 200 total points
ID: 38864496
There is a way to get rid of the Log(N) on your current search.  If both lists are sorted then you can simply walk the lists together.  Wherever the lists match in increment both.  When they do not match, the lowest is the "odd man out", depending on which list it is in you would know which way the file needs to move.  In the odd man out scenario you would only move to the next item on the list with the lesser file name value.

By walking the lists you never search for individual items.  This should address the speed issue of your worst case scenario especially well.

Depending on the database you use, you may want to think about storing the list of files in a BLOB or CLOB field (some N sized database field for holding text).  This would allow you to pass your current list (read from the file system), and a user ID, to a stored procedure on the server which could do the same walking operation and produce normal database output... say one column for the operation (upload or download) and the second column for the file name.
0
 
LVL 29

Assisted Solution

by:fibo
fibo earned 200 total points
ID: 38904438
1 - You can build an index combining several fields. eg name, date, size
2 - In the case of MySQL, this index would also be usable as name or name, date
3 - If this is efficient, you may define to build the index not with the whole content of the field(s) but with the n leftmost characters

You may want to experiment with different index strategies: create several candidates, then use the "EXPLAIN" command on a SELECT: it would tell you which indexes have been considered and which ones, if any, have been used (in the case of MySQL: it uses only one of the available indexes, this selection being made at run-time). If none... then you should reconsider since you are in the worst-case. May be trying some other index?

If you want to detect if a file content has been changed or is similar to some other file content, you should also consider storing in your database a hash-code of the content, eg, the MD5 (or other) of the content: if 2 files have 2 different hashes, they are different (but if they have the same, then check their size, and if identical you will need to compare the real content).
Indexing on the hash code might also prove a good idea, but again here some experimentation is needed.

One may be tempted to create lots of indexes, assuming that this increases performance. It does, but only at SELECT time, and this has the price of a performance drop at updating time: the more indexes you have, the more need to be updated with the new data.
0
 
LVL 29

Expert Comment

by:fibo
ID: 38948950
Hi,

Where do you stand on this?
0
 
LVL 6

Author Comment

by:jeffiepoo
ID: 39085243
I need more time
0
 
LVL 6

Author Closing Comment

by:jeffiepoo
ID: 39146473
All great suggestions thank you!
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Make the most of your online learning experience.
In today's business world, data is more important than ever for informing marketing campaigns. Accessing and using data, however, may not come naturally to some creative marketing professionals. Here are four tips for adapting to wield data for insi…
I've attached the XLSM Excel spreadsheet I used in the video and also text files containing the macros used below. https://filedb.experts-exchange.com/incoming/2017/03_w12/1151775/Permutations.txt https://filedb.experts-exchange.com/incoming/201…
In this video, Percona Solution Engineer Rick Golba discuss how (and why) you implement high availability in a database environment. To discuss how Percona Consulting can help with your design and architecture needs for your database and infrastr…

688 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question