Heavy indexing on Large Tables (Index Tables)


Thanks in advance for your help :)

I'm writing to ask about how to provide fast queries against very wide and also deep tables.

The 4 tables are large, ranging from 40cols/300m rows to 500cols/30m rows.
They are keyed on Customer ID.  These tables are used in real time by the call center software.

The Analysts also want to use these tables, but they query on a variety of fields other than Customer ID.  The analysts and various non-real-time apps share a copy of these large tables in the prod-copy environment.

We thought of indexing the various columns they query on, but when prod-copy is built each day, building these indexes would take forever.  Also, there is a strong preference that the tables in prod-copy stay the same as production.

A solution has been proposed - As the large tables are truncated & loaded daily by the ETL software, to have the ETL software also load "Index Tables".  Each index-table would contain one query-by column along with customer ID.  A view would bring all these index-tables together and join them to the base table, making them invisible to the analysts.  (these index tables could be IOT's, or whatever is fastest)

One drawback is that if they suddenly want to do a query on one of the other 500 columns (in a rapid way), that the ETL process needs to be changed to build an additional index-table.

(1) What is your opinion of this solution?  Do you know of any other solutions?

(2) What are these tables called?  I've tried to search the internet but I'm not aware of what this index-table strategy is known by.

Thank you very much,
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Franck PachotCommented:
>>building these indexes would take forever
If you can partition on a time column, then you have only to rebuild the last partition. But I'm not sure you can because you sait that primary key is customer id.
I suppose you know about bitmap indexes: you can index each column separately, and index can be combined if needed.

(1) and (2) In oracle this is called materialized view. Oracle can build them if you want (you define the view and just do a refresh. Again, if only one partition is touched, then Oracle can rebuild the materialized view with only that partition) and Oracle can use them even if you query the original table, so this is a bit transparent.

But I'm not sure about the need for materialized views in that case. Even if you can't partition on a date, you can hash partition on customer id. Then index rebuilds can be done in parallel query (using all the resources of you machine).

And there is no reason that refreshing a materialized view would be faster than an index rebuild.

If the queries uses different columns to filter (where clause) then the solution is bitmap indexes.
Have a few hash partitions on customer i) so you can use parallel query, and the sotrs that are needed for index rebuild will have smaller chunks. Few can be 8 or 16. Can be higher if the queries are using parallel query as well. Set manual workarea policy and bi sort area size, and rebuild indexes in parallel after each load. Rebuild them in nologging if you can afford an outage to rebuild them in case of maedia failure. That is what is commonly done in that case (DSS).

mj62mj62Author Commented:
Thanks for your feedback.  I can see how these would be considered materialized views - do you know if this strategy of using them as indexes is called something?

The tables are presently partitioned on Customer_ID so that the Call Center software can get the rows for a given customer quite quickly.  If they want say, 7 new indexes on each table, then I'm not sure I can partition it by each of the 7 different columns (or is that what you meant?).

I can see that creating indexes and building them using lots of parallel has potential, but with 30m rows and 500 columns, 7 new indexes (for example) is 7 full table scans, which is going to take time even in parallel, on a wide table.  Which is why we looked at 'index tables' as they can be populated at the exact same time that the base table is being truncated and loaded by the ETL software.  Just ask the ETL to populate those as well, perhaps as IOT's (do IOT's make sense for this?)

7 new indexes may not sound like much, but across 4 tables that's 28 large indexes...

As a side note, (although I'm interested in this problem both as a real-world and academic exercise), if I add indexes to Prod-copy for the Analysts, I have to add them to Prod as well, and is seems crazy having 28 unused indexes in prod which is only used by the Call Centre software on a customer-by-customer basis.

You've opened my eyes to bitmap indexes - I never realized they were so good for AND criteria in a where clause.  Although the columns in question have High cardinality, so normally bitmaps are not recommended for them...
mj62mj62Author Commented:
.. for others reading this I found this to be a good summary of bitmap indexes:


Based on the guide for number-of-distinct-values listed on that page, I can't see something like customer_start_date being a column that would work well with bitmap indexes (which is one of the query columns that analysts want to do on my table...)

The Ultimate Tool Kit for Technolgy Solution Provi

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy for valuable how-to assets including sample agreements, checklists, flowcharts, and more!

Franck PachotCommented:

Maybe you need to explain what you mean by 'index tables'.
I don't see how the ETL can populate them faster than an index. Because, after all, an index will be needed on those tables. And those index tables will be longer to query than an index that has directly the rowid. I don't think it is good solution, but I'll test if the build of an index can use a materialized view that has only the indexed columns and the rowid instead of the whole table.

I don't understant the reason why you need to have same structure and same indexes in prod and in the copy for analysts. Totally different needs. And you probably don't 4 tables and all the colums in the copy.

Do not put bitmap indexes on you production. They are not suited for transactional updates !

What is the amount of changed data at each copy ? If it is a small percentage, then you should refresh the copy with the difference only (and the indexes are maintained). If it is a high percentage, then a full refresh is a good option, inserting without any indexes, and rebuilding them afterwards.

mj62mj62Author Commented:
We have copies of the call-center software running on the prod-copy.  This allows us to see the exact system behavior and for the business-users to evaluate new policies/changes/etc.  So unfortunately yes, I must keep the tables the same.

You asked what these index-tables are, I'll go through a step-by-step example:

> A solution has been proposed - As the large tables are truncated & loaded daily by the ETL software, to have the ETL software also load "Index Tables".  Each index-table would contain one query-by column along with customer ID.  A view would bring all these index-tables together and join them to the base table, making them invisible to the analysts.  (these index tables could be IOT's, or whatever is fastest)

Example Situation:
- Existing table is partitioned and indexed on Customer ID
- Table is truncated and reloaded by an ETL Application every day
- Analyst wants to query on customer_bill_amount (which is not indexed)

2 possible solutions - Solution (A) normal indexes and Solution (B) index tables

Solution (A)
Step 1 - truncate and load table  (x hours)
Step 2 - when step 1 is finished, then create index on customer_bill_amount (x hours)
Total Time: x + x

Solution (B)
Step 1 & Step 2 concurrently - truncate and load table, whilst loading, also create and load an IOT in this format:
customer_bill_amount, customer_id  (x hours)
Step 3: Create view across IOT and original table which hides the IOT from the Analyst user (zero hours)
Total Time: x

You can see Solution (A) takes x+x hours, whereas Solution (B) only takes x hours.  Imagine then that we have 7 indexes to build instead of 1.  Solution A takes 7x hours and Solution B only takes x.

Imagine further, that we have to apply the solution across 4 tables.

Franck PachotCommented:
I tested. Even if a materialized view has rowid and index columns, it is not used to create an index (tested on 11g).
mj62mj62Author Commented:
Hi Franckpachot,

I do appreciate your efforts but I don't think you're understanding what I'm asking.

In Solution (B), the IOT (Index Organized Table) would be joined to the base table, on customer_id, using the view.  I'm not talking about materialized views because those take time to create, which is one of the benefits of solution (b)...

Franck PachotCommented:

You say x instead of x+x because in A you are doing table + indexes sequentially, and in B) you are doing them in parallel. But normal indexes can also be maintained in parallel:just create them before.
However, it costs more than creating them later.

Your solution B) has a cost as well. Inserting in an additional IOT has the same cost a maintaining an index on the table. So the second x can be 3 or 4 times the first x. And if you don't use IOT, yes it will eb faster, but do not help when analysts will make queries.

An index is column+rowid (direct access to physical location of the row in the table)
What you propose is: column + customerid and then customerid + rowid
That is longer to query, and that is longer to load. So I don't see that as a solution.

Of course, what you want to avoid is the many full scans when building indexes. But that is still faster than maintaining index (or 'table index') while loading. And it needs those full scans: they will be sorted differently, so even if Oracle allowed it, you cannot expect an improvement by full scanning only one time. Either you rebuild all from scratch, and that needs those full scans. Or you maintaining indexes while you load, but load is probably slower.

So, to summarize my opinion, your solution B cannot be faster and all you can do to improve solution B can be done on solution A.


Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Franck PachotCommented:
I think I understood the question, and replied to it...
If there are not a lot of answer, that is not that the question is difficult to understand, but because the idea behind that is not an usual one.
Franck PachotCommented:
(see previous comment)
mj62mj62Author Commented:
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Oracle Database

From novice to tech pro — start learning today.