Link to home
Start Free TrialLog in
Avatar of mj62mj62
mj62mj62

asked on

Heavy indexing on Large Tables (Index Tables)

Hello,

Thanks in advance for your help :)

I'm writing to ask about how to provide fast queries against very wide and also deep tables.

The 4 tables are large, ranging from 40cols/300m rows to 500cols/30m rows.
They are keyed on Customer ID.  These tables are used in real time by the call center software.

The Analysts also want to use these tables, but they query on a variety of fields other than Customer ID.  The analysts and various non-real-time apps share a copy of these large tables in the prod-copy environment.

We thought of indexing the various columns they query on, but when prod-copy is built each day, building these indexes would take forever.  Also, there is a strong preference that the tables in prod-copy stay the same as production.

A solution has been proposed - As the large tables are truncated & loaded daily by the ETL software, to have the ETL software also load "Index Tables".  Each index-table would contain one query-by column along with customer ID.  A view would bring all these index-tables together and join them to the base table, making them invisible to the analysts.  (these index tables could be IOT's, or whatever is fastest)

One drawback is that if they suddenly want to do a query on one of the other 500 columns (in a rapid way), that the ETL process needs to be changed to build an additional index-table.

Questions:
(1) What is your opinion of this solution?  Do you know of any other solutions?

(2) What are these tables called?  I've tried to search the internet but I'm not aware of what this index-table strategy is known by.

Thank you very much,
MJ
Avatar of Franck Pachot
Franck Pachot
Flag of Switzerland image

>>building these indexes would take forever
If you can partition on a time column, then you have only to rebuild the last partition. But I'm not sure you can because you sait that primary key is customer id.
I suppose you know about bitmap indexes: you can index each column separately, and index can be combined if needed.

(1) and (2) In oracle this is called materialized view. Oracle can build them if you want (you define the view and just do a refresh. Again, if only one partition is touched, then Oracle can rebuild the materialized view with only that partition) and Oracle can use them even if you query the original table, so this is a bit transparent.

But I'm not sure about the need for materialized views in that case. Even if you can't partition on a date, you can hash partition on customer id. Then index rebuilds can be done in parallel query (using all the resources of you machine).

And there is no reason that refreshing a materialized view would be faster than an index rebuild.

If the queries uses different columns to filter (where clause) then the solution is bitmap indexes.
Have a few hash partitions on customer i) so you can use parallel query, and the sotrs that are needed for index rebuild will have smaller chunks. Few can be 8 or 16. Can be higher if the queries are using parallel query as well. Set manual workarea policy and bi sort area size, and rebuild indexes in parallel after each load. Rebuild them in nologging if you can afford an outage to rebuild them in case of maedia failure. That is what is commonly done in that case (DSS).

Regards,
Franck.
Avatar of mj62mj62
mj62mj62

ASKER

Thanks for your feedback.  I can see how these would be considered materialized views - do you know if this strategy of using them as indexes is called something?

The tables are presently partitioned on Customer_ID so that the Call Center software can get the rows for a given customer quite quickly.  If they want say, 7 new indexes on each table, then I'm not sure I can partition it by each of the 7 different columns (or is that what you meant?).

I can see that creating indexes and building them using lots of parallel has potential, but with 30m rows and 500 columns, 7 new indexes (for example) is 7 full table scans, which is going to take time even in parallel, on a wide table.  Which is why we looked at 'index tables' as they can be populated at the exact same time that the base table is being truncated and loaded by the ETL software.  Just ask the ETL to populate those as well, perhaps as IOT's (do IOT's make sense for this?)

7 new indexes may not sound like much, but across 4 tables that's 28 large indexes...

As a side note, (although I'm interested in this problem both as a real-world and academic exercise), if I add indexes to Prod-copy for the Analysts, I have to add them to Prod as well, and is seems crazy having 28 unused indexes in prod which is only used by the Call Centre software on a customer-by-customer basis.

You've opened my eyes to bitmap indexes - I never realized they were so good for AND criteria in a where clause.  Although the columns in question have High cardinality, so normally bitmaps are not recommended for them...
.. for others reading this I found this to be a good summary of bitmap indexes:

http://www.dba-oracle.com/oracle_tips_bitmapped_indexes.htm

Based on the guide for number-of-distinct-values listed on that page, I can't see something like customer_start_date being a column that would work well with bitmap indexes (which is one of the query columns that analysts want to do on my table...)

Hi,

Maybe you need to explain what you mean by 'index tables'.
I don't see how the ETL can populate them faster than an index. Because, after all, an index will be needed on those tables. And those index tables will be longer to query than an index that has directly the rowid. I don't think it is good solution, but I'll test if the build of an index can use a materialized view that has only the indexed columns and the rowid instead of the whole table.

I don't understant the reason why you need to have same structure and same indexes in prod and in the copy for analysts. Totally different needs. And you probably don't 4 tables and all the colums in the copy.

Do not put bitmap indexes on you production. They are not suited for transactional updates !

What is the amount of changed data at each copy ? If it is a small percentage, then you should refresh the copy with the difference only (and the indexes are maintained). If it is a high percentage, then a full refresh is a good option, inserting without any indexes, and rebuilding them afterwards.

Regards,
Franck.
We have copies of the call-center software running on the prod-copy.  This allows us to see the exact system behavior and for the business-users to evaluate new policies/changes/etc.  So unfortunately yes, I must keep the tables the same.

You asked what these index-tables are, I'll go through a step-by-step example:

> A solution has been proposed - As the large tables are truncated & loaded daily by the ETL software, to have the ETL software also load "Index Tables".  Each index-table would contain one query-by column along with customer ID.  A view would bring all these index-tables together and join them to the base table, making them invisible to the analysts.  (these index tables could be IOT's, or whatever is fastest)


Example Situation:
- Existing table is partitioned and indexed on Customer ID
- Table is truncated and reloaded by an ETL Application every day
- Analyst wants to query on customer_bill_amount (which is not indexed)

2 possible solutions - Solution (A) normal indexes and Solution (B) index tables

Solution (A)
Step 1 - truncate and load table  (x hours)
Step 2 - when step 1 is finished, then create index on customer_bill_amount (x hours)
Total Time: x + x

Solution (B)
Step 1 & Step 2 concurrently - truncate and load table, whilst loading, also create and load an IOT in this format:
customer_bill_amount, customer_id  (x hours)
Step 3: Create view across IOT and original table which hides the IOT from the Analyst user (zero hours)
Total Time: x


You can see Solution (A) takes x+x hours, whereas Solution (B) only takes x hours.  Imagine then that we have 7 indexes to build instead of 1.  Solution A takes 7x hours and Solution B only takes x.

Imagine further, that we have to apply the solution across 4 tables.

Thanks,
MJ
I tested. Even if a materialized view has rowid and index columns, it is not used to create an index (tested on 11g).
Hi Franckpachot,

I do appreciate your efforts but I don't think you're understanding what I'm asking.

In Solution (B), the IOT (Index Organized Table) would be joined to the base table, on customer_id, using the view.  I'm not talking about materialized views because those take time to create, which is one of the benefits of solution (b)...

ASKER CERTIFIED SOLUTION
Avatar of Franck Pachot
Franck Pachot
Flag of Switzerland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I think I understood the question, and replied to it...
If there are not a lot of answer, that is not that the question is difficult to understand, but because the idea behind that is not an usual one.
(see previous comment)
.