gs79
asked on
loading huge data into a fact table from a huge source table
I have a souce table xyz which contains around 23 million records with a lot of attributes. I need to load fact table xyz_f which is basically a select of xyz with some records filtered by which will insert around 23 million records by some id_cds..The table xyz_f being truncated and indexes in the fact table is dropped..
it is taking now close to an hr to load the data. the task is decrease the load time..
Please make any recommendations..
For now bulk collect is being used to load the data with a frequency of 10000.
the records that goes into the target fact table is
select * from xyz
where
id not in(124,234,2345,2345.....)
which results in close to 23 million record everyday
]
it is taking now close to an hr to load the data. the task is decrease the load time..
Please make any recommendations..
For now bulk collect is being used to load the data with a frequency of 10000.
the records that goes into the target fact table is
select * from xyz
where
id not in(124,234,2345,2345.....)
which results in close to 23 million record everyday
]
Can you not set up xyz_f as a materialized view and never 'reload' it again?
Can't you just append to the fact table the "new" data? Of course it's not that easy if you have updates and deletes on the source table, but you may find a way around.
In general, ALTER TABLE EXCHANGE PARTITION is what you want to use for these jobs. The fact table ought to be partitioned by date, with local indexes, so that new partitions can be added in a split second, with all their indexes pre-created and analyzed. If you can also get the source table partitioned by list (to isolate your IDs of interest), and then subpartition by date (or the other way round), even better since you just scan the interesting data.
In general, ALTER TABLE EXCHANGE PARTITION is what you want to use for these jobs. The fact table ought to be partitioned by date, with local indexes, so that new partitions can be added in a split second, with all their indexes pre-created and analyzed. If you can also get the source table partitioned by list (to isolate your IDs of interest), and then subpartition by date (or the other way round), even better since you just scan the interesting data.
ASKER
I just need to truncate the table xyz_f and load the table with new records provided by the below query on a daily basis..
select * from xyz
where
id not in(124,234,2345,2345.....)
right now bulk collect is being used with a commit frequency of 10000..which is taking 30/40 minutes to load..
Can parallel pipelined functions be used to insert records into table in parallel slaves after enabling the table for parallel dml?
Please advice if there are any other methods..
Thanks
as of now its taking 30/40 minutes to insert
select * from xyz
where
id not in(124,234,2345,2345.....)
right now bulk collect is being used with a commit frequency of 10000..which is taking 30/40 minutes to load..
Can parallel pipelined functions be used to insert records into table in parallel slaves after enabling the table for parallel dml?
Please advice if there are any other methods..
Thanks
as of now its taking 30/40 minutes to insert
>>Please advice if there are any other methods..
First post: Materialized views? http:#a35087080
With those, no need to truncate/reinsert. It is kept up to date. Even if you cannot do a fast refresh during the day, have them updated during your currently nightly maintenance window.
>>Can parallel pipelined functions be used
What are you thinking pipelined functions? Parallel is parallel. I'm not sure how a pipelined function would help speed things up but you never know. I suggest you set up some tests and try out everything you can think of.
First post: Materialized views? http:#a35087080
With those, no need to truncate/reinsert. It is kept up to date. Even if you cannot do a fast refresh during the day, have them updated during your currently nightly maintenance window.
>>Can parallel pipelined functions be used
What are you thinking pipelined functions? Parallel is parallel. I'm not sure how a pipelined function would help speed things up but you never know. I suggest you set up some tests and try out everything you can think of.
You need to describe the whole procedure to tell whether pipelined functions could play any role in this.
One way or the other: if you can say it in plain SQL (i.e. NOT PL/SQL) it will be faster than manually looping through each record, even with bulk collect. Think of something like:
Dropping the table gets rid of the indexes, that is VERY advisable when you are INSERTing so many rows on an empty table, especially if you have many indexes.
For the SELECT part, you may want to see the reference for the PARALLEL hint. You can also do it without hinting if you setup your source table in this way: ALTER TABLE XYZ PARALLEL 4;
Note you can also create the target table with 4 partitions, and optionally make its indexes local as well. Try and see what's faster in your specific scenario. Of course, the sample degree of 4 must also be adjusted by trial and error.
One way or the other: if you can say it in plain SQL (i.e. NOT PL/SQL) it will be faster than manually looping through each record, even with bulk collect. Think of something like:
drop table target;
create table target parallel 4 as select /*+parallel(xyz 4)*/ * from xyz where id not in(124,234,2345,2345.....);
create index target_ix1 on target (...) compute statistics;
create index target_ix2 on target (...) compute statistics;
...
execute dbms_stats.gather_table_stats(user, 'TARGET', cascade => false)
Dropping the table gets rid of the indexes, that is VERY advisable when you are INSERTing so many rows on an empty table, especially if you have many indexes.
For the SELECT part, you may want to see the reference for the PARALLEL hint. You can also do it without hinting if you setup your source table in this way: ALTER TABLE XYZ PARALLEL 4;
Note you can also create the target table with 4 partitions, and optionally make its indexes local as well. Try and see what's faster in your specific scenario. Of course, the sample degree of 4 must also be adjusted by trial and error.
ASKER
dropping the table is not an option as it will affect the dependancies..So i am using a procedure where I am performing following operations sequentially:
1. Drop index
2. truncate the table
3. alter session enable parallel dml
3. inserting into table(table has a degree 8 parallel) and also using hint in the "select from source table statment"
4. alter session disable parallel dml
5. recreate the index
still no performance improvement.
Another thing I tried is creating a parallel pipelined function to process sql cursor in parallel slaves and do a parallel dml on the table. But I could not see any parallelism in action... I will provide the queries in a while..but any thoughts on this..
thanks
1. Drop index
2. truncate the table
3. alter session enable parallel dml
3. inserting into table(table has a degree 8 parallel) and also using hint in the "select from source table statment"
4. alter session disable parallel dml
5. recreate the index
still no performance improvement.
Another thing I tried is creating a parallel pipelined function to process sql cursor in parallel slaves and do a parallel dml on the table. But I could not see any parallelism in action... I will provide the queries in a while..but any thoughts on this..
thanks
Why haven't you commented on the materialized view comments?
ASKER
I am new here and I did suggest materialized view approach and the solution was discarded. I may have try this as well and I will give it a try. Meanwhile can you please me with the script to create MV for the table I have described above with the correct refresh option..
Still wonder why I am not able to see parallelism in action when I used parallel pipelined functions. There is been lot of white papers written on this techniques boasting its powerfulness in ETL. But i feel it's so whimsical with the way it works..
any thoughts..
Thanks..
Still wonder why I am not able to see parallelism in action when I used parallel pipelined functions. There is been lot of white papers written on this techniques boasting its powerfulness in ETL. But i feel it's so whimsical with the way it works..
any thoughts..
Thanks..
What are your refresh requirements and impact requirements? I would probably go with refresh on commit but that depends on if you can withstand the performance impact.
I'm on mobile so can't come up with any working script right now. All the options are in the docs along with the references for the refresh job.
If you can't get it by tomorrow, I'll try to create a working example.
I'm on mobile so can't come up with any working script right now. All the options are in the docs along with the references for the refresh job.
If you can't get it by tomorrow, I'll try to create a working example.
ASKER
I want this to be refreshed only once a day preferably lets say 1 am..please provide me if you have a working sample..
thanks
thanks
ASKER CERTIFIED SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
ASKER
Thanks every one..the mv.s work..i was hoping to solve this using pipelined functions..