Link to home
Start Free TrialLog in
Avatar of Markus Fischer
Markus FischerFlag for Switzerland

asked on

For Experts: Query Plan and Index Names

Hello experts,

I just had one of these maddening experiences... and thought I'd make it my next "expert series" question.

I was trying to optimize a very slow query (unnaturally slow, in fact) by adding a new index, and was pleased to see a tremendous increase in performance. So I made the change permanent in about half a dozen similar tables and started testing my application again. There was no change. Everything was just as slow as before.

Naturally, I thought I made a stupid error in propagating the change and checked everything. I deleted and recreated the indexes, tried various variations, wrote new test queries... all to no avail. The new index wasn't working anymore. I had seen it working once, but could not reproduce it any more. I hate it when that happens. Computer programming shouldn't involve magic.

The option "JetShowPlan" didn't help, because it "currently does not handle subqueries, vt parameters, and subqueries" (sic).

A couple of hours later, I tried to retrace my steps *exactly*. The first time, I had named my index "X", so I did that again, feeling a bit silly (magical thinking, huh?). But it did work. Hard to believe, isn't it? If the index is called "X", it's used by the query; if it's called "Optimize", "ndxLastValue", or "abc"... it's not!

I'll upload a test database to ee-stuff in the first comment. Check it out for yourself.

It's a table with a multi-field index, the last number being a "version number". The data in the table is never edited; instead, a new record is added with a higher version number whenever the data is updated. Naturally, I need a query to show the last value for each "field", identified by the main key fields.

    SELECT *
    FROM tblValues AS V
    WHERE Nb >= All (
        Select Nb From tblValues
        Where A=V.A And B=V.B And C=V.C And D=V.D
        );

The main key fields are A, B, C, and D; Nb being the version number. Simple enough, no? The added index to improve the performance is naturally a combined index on A, B, C, and D. But you need to name that index carefully or it will not be used. Also note that there are several links from other tables, and thus several other implicit indexes. Without these links, the bug disappears...

Similar experiences, ideas, additional tests, rants, and even simple "wow!" comments welcome.

Cheers all!

Markus
(°v°)
___________________________________________________
Previous question in the experts series: http:/Q_22763356.html
Avatar of Markus Fischer
Markus Fischer
Flag of Switzerland image

ASKER

And here is the database:
https://filedb.experts-exchange.com/incoming/ee-stuff/4778-IndeX.zip

The table is tblValues and the query qselLastValues. There is also a module to demonstrate the effect of the index name: run Benchmark from the immediate pane.

Have fun!

(°v°)
Avatar of DatabaseMX (Joe Anderson - Former Microsoft Access MVP)
Marcus ... how come you never closed the previous Q?  Doesn't appear any else is going to chime in :-(

mx
I just did. Sorry for the delay.
(°v°)
ok ... I downloaded ... with check it out in a bit ... thx for closing the other Q.

mx
Ok, Wow!
I am tempted to run the query with randomly selected indexes, storing name and results to a table, for a few thousand/million times.

I was tempted, too! At the moment, I believe that only the relative positions of the indexes in alphabetical order play a role (and the first query in the Sub is a tiny bit slower).

Since this behavior/bug disappears when removing the relationships, my current idea is that the query plan somehow uses some of these indexes (all starting with "tbl" in this case) in a horribly inefficient Rushmore set... and then stops looking for anything better.

(°v°)
ASKER CERTIFIED SOLUTION
Avatar of jerryb30
jerryb30
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I am fighting the urge to do a control-break to see full results. So, I should have a sample size of about 10k in about 3 hours.
I suppose I should have allowed for no index being set, too.
Maybe later.
I regularly deal with large, fairly static, recordsets,  in Access, and have learned that indexes can be a 2-edged sword in terms of spead and db size. But I have a lot more to learn, so your (as usual insidious) conumdrum may pop up later to make me think about how I do (or do not) use indexes.
Thanks for that.

I tried (a bit less than 10000, though) and get averages of about 2 seconds for indexes A--T, and 1.5 hundredth for indexes U--Z. I think it confirms the general assumption of alphabetical order of indexes.

Using your table:

    SELECT indexname,
        Count(*) AS n,
        Avg(runtime) AS [Avg],
        Min(runtime) AS [Min],
        Max(runtime) AS [Max]
    FROM Table1
    GROUP BY indexname;

The negative time *is* scary, though. Is there a jinx after all?

(°v°)
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
> Could this have something to do with the issue ??

Most definitely, yes. If you remove the relationships, the problem goes away (but then the query is two times slower, which is another mystery!).

In fact, I added ShowIndexes() to list these relationship indexes, which are obviously (mis-)used by the query engine. Their names all start with "tbl"...

(°v°)
"(but then the query is two times slower, which is another mystery"

Well, that would make sense to me because defining relationships between related tables is supposed to improve query performance when those fields are joined in queries, etc.

I need to study up on sub queries, etc. For example, qselLastValues has the sub query in the criteria cell ... but the  Show box is not checked, yet you get results when you run the query.  Why is that ?

Also, what is the purpose of the sub query here ?

mx
Having run about 2400 times (substantiating Markus's stats), and then trying 2 letter index names, as well as a range between 65 and 122 (both not exhibiting the U-Z times), I am required to repeat 'Wow!'.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Hi Leigh, up so early? Thanks for stepping by!

mx,

I really selected "display all fields" from the query properties, which -- as Leigh pointed out -- is the translation of "SELECT *" in SQL. For a single table query, this can also be achieved by dragging V.* into the query grid (there is a slight difference, but let's leave that out at this time).

> Also, what is the purpose of the sub query here ?

I need the latest data for any combination of A, B, C, and D. Thus I use a subquery saying basically: "show me all records having the highest Nb for any combination". In SQL this translates to:

Nb >=All (Select Nb From tblValues Where A=V.A And B=V.B And C=V.C And D=V.D)

Another way of looking at it is that the real key is the combination A, B, C, and D. The records with lower Nb's are in fact obsolete, although I need to retain them for other queries. The table tblANb is in real life a log of changes (i.e. addition of new records reflecting new values), with signature and date, which can thus be traced (and undone if needed).

I don't know of any way to obtain the latest values without using a subquery, by the way.

(°v°)
You gotta love the time zone difference.  It's 22:28 in San Diego at the moment.  Anyway ... you two guys are the cream of the crop on EE on this and other similar subjects as far as I'm concerned.   I already emailed 3 colleagues and suggested the at least 'monitor' this Q ... which will no doubt turn out to be a fascinating 'expert series' thread.

ok ... so what steps should I take here (running the Benchmark, etc) to see/test the performance issue ... so that I might be of some use on this thread?

It will be tomorrow however, ... time to chill out for a bit .... you guys have a good day.

mx

SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I can only agree.

Agree with Markus on his findings and "Wow!".
And agree with Jim and the concept "there's only so much we can control".
And that this is worth raising to MS.

The correlated subquery seems to play some part in this particular execution plan choice (which is hardly surprising - it's bound to be *the* predominant factor).
But not necessarily the presence of *any* subquery.
Consider, for example, a joined subquery instead of correlated.

SELECT V.*
FROM
  tblValues V
    INNER JOIN
  (SELECT A,B,C,D, Max(Nb) As MaxNb FROM tblValues GROUP BY A,B,C,D) X
    ON (X.D=V.D) AND (X.C=V.C) AND (X.B=V.B) AND (X.A=V.A) AND (V.Nb>=X.MaxNb);

(Or for a more explicit formatting...)
SELECT
  V.*
FROM
  tblValues V
    INNER JOIN
  (
   SELECT
     V2.A, V2.B, V2.C, V2.D, Max(Nb) As MaxNb
   FROM
     tblValues V2
   GROUP BY
     A,B,C,D
  ) X
    ON
      V.Nb>=X.MaxNb
    AND
      X.A=V.A
    AND
      X.B=V.B
    AND
      X.C=V.C
    AND
      X.D=V.D;

This (while being roughly of equal execution time - if anything perhaps a smidge slower) isn't dependent on the index name.
I'd presume because of the execution priority - with the subquery evaluated first, without outward optimization attempts to the parent query.

Consequently the circumstances in the earlier queries (which seem to imply a "tie" of plans - both Markus original and the Max() version - relying on index name to win out) aren't so present.

But the index issue clearly isn't with the All predicate.  But the correlated subquery (which is more of a concern - given how vital and widespread they are... obviously not every situation lends itself to a joined subquery!)
Just thought - for those that just want to run it directly without the effort... ;-)
https://filedb.experts-exchange.com/incoming/ee-stuff/4784-IndeX.zip

Has ever so slightly altered version of Markus' testing code so I could test for the different queries with just a few iterations (didn't feel the need to go to Jerry's lengths lol ;-).

BTW the negative times in testing with Timer - was the field "runtime" of single type?  (Or some other type supporting non integer values?  Otherwise of course rounding can produce some funny results).
I've seen as much before though I don't think there's a Jinx here ;-)
Thanks Leigh.

I downloaded your queries and they confirm earlier tests which lead me to choose the ">=All" subquery syntax over other variants. To be fair, we should also try the "Top 1" subquery, which is sometimes the winner on MS-SQL and Oracle, or so I'm told.

The subquery you shared in http:#19944771 does not use the new index at all. The execution time is exactly the one you get when removing all relationships and indexes but the primary key, which is two times slower than the ">=All" subquery using the "X" index. So I do believe that your solution forces the use of the primary key as sole index. Since we cannot see the plan, this is only conjecture, naturally.

On my computer, the orders of magnitude for the queries are:

2 sec for the "bug", i.e. my ">=All" query without a special index or with the special index coming before the relational "tbl" indexes in alphabetical order. This is where Jet chooses a very bad plan...

3 hundredth of a second for the table without relationships, without the special index, with "PrimaryKey" renamed to "zPK", or using one of the alternate subqueries you provided.

1.5 hudredth of a second for the ">=All" query, with an index name sorting after any "tbl" indexes created by the relationships. I.e. the magical "X" query...

JDettman,

Yes, this could be a valid bug report, albeit a rather technical one. However, I'm not a MVP, which means that there is no way I can file any *serious* bug report to MS. Anyway, I only intended to share it with you all. Feel free to send a bug report is you think it might make a difference, or provide a URL where a bug is actually worth reporting.

Your comment implies that Jet is able to consider several plans and evaluate their relative costs. The current matter suggests that it builds a plan by considering indexes in alphabetical order and stops looking as soon as all required fields have been found in any index.

In our example, this means that the indexes stemming from tblAB, tblAC, and tblCD -- or, worse, only tblAB and tblCD -- are fed into Rushmore for a horrible end result. I can't imagine that Jet acutally considers any alternatives. Incidentally, if you remove the relationships, only the Primary Key is used. Then nothing is gained from any additional indexes.

In an ideal world, the Primary Key is really all that is needed for this query. Since the Primary Key contains A, B, C, D, and Nb, it is ideal for this scenario, and most especially for the ">=All" case. Note that the sort order for Nb is descending, for optimization of the .Seek feature.

Strangely enough, if you rename the Primary Key to "uPK" or the like, forcing it to come after the "tbl" indexes in alphabetical order, Jet no longer considers the additional index. Execution times always lie in the 0.03 seconds range, regardless of the presence or name of any additional index.

To All,

At present, nothing beats the magical "X" index. This is a very sad day for computer science. Am I drunk or only depressed?

(°v°)
Hello all!

I ran some more tests and uploaded the current test database to:
https://filedb.experts-exchange.com/incoming/ee-stuff/4901-IndeX.zip

I tried with a slightly simpler table, having four byte key fields named A, B, C, and D, with a small set of queries showing the records for each combination of A, B, and C having the highest D. D is thus equivalent to Nb in the previous test. There might be other ways to obtain that answer, naturally.

I created a simple sub deleting all indexes and rebuilding them based on a passed array. For example Array("ABCD", "AB", "BC") means create the primary key on the four fields and two indexes each on a pair of fields. The indexes are simply named "ndx0", "ndx1", etc., thus preserving their order alphabetically.

Benchmarking these combinations led me to the following conclusions. Note: this is based on my interpretation of the query qtotTestResults, and probably doesn't make much sense without studying the test database...


1. The execution plan of a simple query can involve the creation of a temporary index. This is not the case for subqueries in the WHERE clause. Such subqueries on non-indexed tables produce totally horrible results.

It also appears that the temporary indexes are not deleted immediately, and can be reused by the next query running in the same context (which I find to be good news). This overestimates the performance of queries using them, as I normally ran each test 10 times.


2. Point 1 means that Leigh's suggestion using a subquers in the FROM clause shows good results regardless of the indexes. This query cannot be fooled into using a wrong set of indexes. Either all indexes present are considered or the creation of a temporary index is weighted against all other solutions.


3. The execution optimizer appears to collect indexes in reverse alphabetical order until all needed fields have been found. These are fed into rushmore. An ill-chosen set of indexes results in very poor performance. I guess this is the actual "bug" here. The names of indexes should have no impact on performance.


4. JetEngine appears to be unable to use the index "ABCD" to optimize a query needing the index "ABC". This is very annoying. I would tend to call that a bug as well, but I don't know enough about query optimization to be certain here. Note that DAO's Seek method allows using indexes in this manner.


5. My actual data shows a different behaviour regarding simple indexes and relationships. I'm somehow unable to force the use of the primary key, even if I name it "zPK" or the like. I was not able to reproduce this problem in the last test, so I'm probably missing something. In that test, it appears that relationships and simple indexes are used in exactly the same way with similar global performances.


6. The test data shows that the Exists() subquery performs better than the >=All() subquery which I had selected based on earlier performance tests. However, this seems to be dependant on the actual values in the key fields (I didn't run further tests to explore that).


At present, I will keep my current indexes, and document the fact that the naming of them does impact the performance.

Leigh has reported this to the private MVP newsgroups, so I will not attempt to create another bug report for this. He might report here before the question becomes locked.


Thank you all for participating. As there are no real solutions and since this was more a discussion than a question anyway, I will split the points.

Markus
(°v°)
Hello all!

I ran some more tests and uploaded the current test database to:
https://filedb.experts-exchange.com/incoming/ee-stuff/4901-IndeX.zip

I tried with a slightly simpler table, having four byte key fields named A, B, C, and D, with a small set of queries showing the records for each combination of A, B, and C having the highest D. D is thus equivalent to Nb in the previous test.

I created a simple sub deleting all indexes and rebuilding them based on a passed array. For example Array("ABCD", "AB", "BC") means create the primary key on the four fields and two indexes each on a pair of fields. The indexes are simply named "ndx0", "ndx1", etc., thus preserving the ordering of the indexes alphabetically. This allows for more intensive tests. I also have an option to implement the indexes as such (simple indexes on the table) or as relationships to a dummy master table.

Benchmarking these combinations led me to the following conclusions. Note: this is based on my interpretation of the query qtotTestResults, and probably doesn't make much sense without studying the test database at least a little.


1. The execution plan of a simple query can involve the creation of a temporary index. This is not the case for subqueries in the WHERE clause. Such subqueries on non-indexed tables produce totally horrible results.

It also appears that the temporary indexes are not deleted immediately, and can be reused by the next query running in the same context (which I find to be good news). This overrates queries using them, as I ran each ten times in a row.


2. Point 1 means that Leigh's suggestion using a subquers in the FROM clause shows good results regardless of the indexes. This query cannot be fooled into using a wrong set of indexes. Either all indexes present are considered or the creation of a temporary index is weighted against all other solutions.


3. The execution optimizer appears to collect indexes in reverse alphabetical order until all needed fields have been found. These are fed into rushmore. An ill-chosen set of indexes results in very poor performance. This is the actual bug here. The names of the indexes should have no repercussion on query performance.


4. JetEngine appears to be unable to use the index "ABCD" to optimize a query needing the index "ABC". This is very annoying. I would call this a bug also, but I don't know enough about query optimization to be certain here. Note that DAO's Seek method does allow using indexes in that manner.


5. My actual data shows a different behaviour regarding simple indexes and relationships. I was not able to reproduce this problem, so I'm probably missing something. In my test, it appears that relationships and simple indexes are used in exactly the same way with similar global performances.


6. The test data shows that the Exists() subquery performs better than the >=All() subquery which I had selected previously. However, this seems to be dependant on the actual values in the key fields (I didn't run further tests to explore that). This of course means that there is no "best" solution for any general problem.


At present, I will keep my current indexes, and document the fact that the naming of them does impact the performance.

Leigh has reported this to the private MVP newsgroups, so I will not attempt to create another bug report for this. He might report here if something turns up before the question becomes locked.


Thank you all for participating. As there are no real solutions and since this was more a discussion than a question anyway, I will split the points.

Markus
(°v°)
Rats. I thought I had lost that comment... -- (^v°)