Guy Hengel [angelIII / a3]
asked on
tuning a query over DBLINK
Input:
3 Servers, all with Oracle 9.2
* PROD --- live production server
* STANDBY --- standby database, updated every night by applying the logs, the rest of the day it is open as read_only
* DW --- a database that reports using data on the main database.
On PROD (and hence by STANDBY), there is a "big" table (10M rows), partitionned on a time field (creationtime).
+ index on some_field
I have a query that queries this table with a range condition on PROD/STANDBY:
select * from big_table where creationtime >= x and creationtime < y and some_field in ( list of values ) ;
returns in 8:35 minutes (0 rows)
I run the same query over DBLINK:
select * from big_table@PROD where creationtime >= x and creationtime < y and some_field in ( list of values ) ;
returns in 1:44:32 minutes (0 rows)
what could be the cause of this ?
what could be done to optimize this query over the dblink?
IMPORTANT NOTES:
* I cannot add any view, functions, MViews or anything else to the PROD database. It's a custom application database, I have no change permissions to it at all.
* I tried to add the following hints with no (positive) effect:
- driving_site(big_table)
- parallel
- all_rows
- index ( myindex )
Thanks
3 Servers, all with Oracle 9.2
* PROD --- live production server
* STANDBY --- standby database, updated every night by applying the logs, the rest of the day it is open as read_only
* DW --- a database that reports using data on the main database.
On PROD (and hence by STANDBY), there is a "big" table (10M rows), partitionned on a time field (creationtime).
+ index on some_field
I have a query that queries this table with a range condition on PROD/STANDBY:
select * from big_table where creationtime >= x and creationtime < y and some_field in ( list of values ) ;
returns in 8:35 minutes (0 rows)
I run the same query over DBLINK:
select * from big_table@PROD where creationtime >= x and creationtime < y and some_field in ( list of values ) ;
returns in 1:44:32 minutes (0 rows)
what could be the cause of this ?
what could be done to optimize this query over the dblink?
IMPORTANT NOTES:
* I cannot add any view, functions, MViews or anything else to the PROD database. It's a custom application database, I have no change permissions to it at all.
* I tried to add the following hints with no (positive) effect:
- driving_site(big_table)
- parallel
- all_rows
- index ( myindex )
Thanks
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Which user do you use in the DBLINK. Is it the same user that you use when running the query directly on PROD or a different user
ASKER
same user/schema.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
>Can you use following meothod in order to get explain plan?
nice information, good to know.
however, this output confirms that the good index (Index Range Scan) of the table is used.
>are the x & y in your query bind variables.
I use simple variables in the code, but for testing I use litteral values (both same execution time)
nice information, good to know.
however, this output confirms that the good index (Index Range Scan) of the table is used.
>are the x & y in your query bind variables.
I use simple variables in the code, but for testing I use litteral values (both same execution time)
If the explain plan is same, then the most probably the problem is in the underlying network. Check the wait events for the session. Check specifically for the waits events like 'SQL*Net .....' wait events. If you are getting high values for these events then the problem would be within the underlying network.
Another test that you can run is
On DW system
start sql*plus and connect to the PROD instance using the same tnsnames entry that your DBLINK uses.
Run the query and check the performance.
If it is a pure network problem, then this query performance should also be slow.
On DW system
start sql*plus and connect to the PROD instance using the same tnsnames entry that your DBLINK uses.
Run the query and check the performance.
If it is a pure network problem, then this query performance should also be slow.
ASKER
Here a report of the wait events, seems to confirm your last guess goes into the right direction:
now, what part to check first, the SQL*Net seconds waiting (1/820) seconds in average, or rather the latch free timeouts for example?
Ma Avr 04 page
System-wide Wait Analysis
for current wait events
Average
Event Total Seconds Total Wait
Name Waits Waiting Timeouts (in secs)
-------------------------- ---- ------------ -------- ------------ ---------
SQL*Net more data from dblink 82,052,571 115,298 0 .000
db file sequential read 61,054,126 74,980 0 .000
SQL*Net message to dblink 56,003,325 560 0 .000
SQL*Net message from dblink 56,003,324 453,138 0 .010
db file scattered read 52,383,908 93,782 0 .000
direct path read 11,853,067 417,496 0 .040
log file parallel write 3,022,317 51,753 0 .020
direct path write 1,422,830 310,557 0 .220
log file sync 1,319,468 36,425 7 .030
PX Deq: Table Q Normal 1,189,847 24,136 1,019 .020
latch free 764,165 29,369 673,500 .040
control file sequential read 341,706 209 0 .000
control file parallel write 308,090 28,788 0 .090
PX Deq: Execute Reply 270,819 6,080 937 .020
buffer busy waits 198,275 20,527 605 .100
PX Deq Credit: send blkd 142,000 5,506 511 .040
db file parallel write 89,905 124,229 0 1.380
free buffer waits 89,485 86,982 86,507 .970
PX qref latch 83,901 1,334 79,869 .020
direct path read (lob) 56,558 360 0 .010
PX Deq: Execution Msg 38,633 5,034 2,089 .130
wakeup time manager 30,527 887,645 30,507 29.080
LGWR wait for redo copy 27,439 121 6,094 .000
PX Deq Credit: need buffer 12,511 747 140 .060
PX Deq: Table Q qref 12,452 123 44 .010
rdbms ipc reply 11,562 4,304 1,400 .370
enqueue 9,742 24,246 8,021 2.490
write complete waits 8,326 8,133 8,122 .980
SQL*Net more data from client 7,636 19 0 .000
jobq slave wait 4,824 14,045 4,687 2.910
log buffer space 3,882 875 1 .230
library cache pin 2,167 5,243 1,773 2.420
SQL*Net more data to dblink 2,068 1 0 .000
PX Deq: Signal ACK 1,658 52 852 .030
single-task message 1,454 130 0 .090
library cache lock 1,358 3,959 1,343 2.920
PX Deq: Msg Fragment 1,165 16 3 .010
PX Deq: Parse Reply 1,010 110 44 .110
undo segment extension 498 0 498 .000
local write wait 421 195 153 .460
async disk IO 400 61 0 .150
process startup 275 32 0 .120
db file parallel read 239 32 0 .140
PX Deq: Join ACK 233 1 0 .000
SQL*Net break/reset to dblink 198 0 0 .000
wait for stopper event to be i 120 13 117 .100
ncreased
PX Deq: Table Q Sample 116 2 0 .010
log file switch completion 115 67 32 .590
row cache lock 89 181 59 2.030
db file single write 67 4 0 .060
log file sequential read 54 0 0 .010
log file single write 54 1 0 .020
wait list latch free 52 1 0 .020
inactive session 45 44 45 .990
PX Deq: Table Q Get Keys 42 1 0 .020
library cache load lock 20 2 0 .100
PX Deq: Txn Recovery Start 5 0 0 .010
reliable message 3 3 2 .930
buffer deadlock 2 0 2 .000
refresh controlfile command 1 0 0 .000
control file heartbeat 1 4 1 3.910
61 ligne(s) sΘlectionnΘe(s).
now, what part to check first, the SQL*Net seconds waiting (1/820) seconds in average, or rather the latch free timeouts for example?
Ma Avr 04 page
System-wide Wait Analysis
for current wait events
Average
Event Total Seconds Total Wait
Name Waits Waiting Timeouts (in secs)
--------------------------
SQL*Net more data from dblink 82,052,571 115,298 0 .000
db file sequential read 61,054,126 74,980 0 .000
SQL*Net message to dblink 56,003,325 560 0 .000
SQL*Net message from dblink 56,003,324 453,138 0 .010
db file scattered read 52,383,908 93,782 0 .000
direct path read 11,853,067 417,496 0 .040
log file parallel write 3,022,317 51,753 0 .020
direct path write 1,422,830 310,557 0 .220
log file sync 1,319,468 36,425 7 .030
PX Deq: Table Q Normal 1,189,847 24,136 1,019 .020
latch free 764,165 29,369 673,500 .040
control file sequential read 341,706 209 0 .000
control file parallel write 308,090 28,788 0 .090
PX Deq: Execute Reply 270,819 6,080 937 .020
buffer busy waits 198,275 20,527 605 .100
PX Deq Credit: send blkd 142,000 5,506 511 .040
db file parallel write 89,905 124,229 0 1.380
free buffer waits 89,485 86,982 86,507 .970
PX qref latch 83,901 1,334 79,869 .020
direct path read (lob) 56,558 360 0 .010
PX Deq: Execution Msg 38,633 5,034 2,089 .130
wakeup time manager 30,527 887,645 30,507 29.080
LGWR wait for redo copy 27,439 121 6,094 .000
PX Deq Credit: need buffer 12,511 747 140 .060
PX Deq: Table Q qref 12,452 123 44 .010
rdbms ipc reply 11,562 4,304 1,400 .370
enqueue 9,742 24,246 8,021 2.490
write complete waits 8,326 8,133 8,122 .980
SQL*Net more data from client 7,636 19 0 .000
jobq slave wait 4,824 14,045 4,687 2.910
log buffer space 3,882 875 1 .230
library cache pin 2,167 5,243 1,773 2.420
SQL*Net more data to dblink 2,068 1 0 .000
PX Deq: Signal ACK 1,658 52 852 .030
single-task message 1,454 130 0 .090
library cache lock 1,358 3,959 1,343 2.920
PX Deq: Msg Fragment 1,165 16 3 .010
PX Deq: Parse Reply 1,010 110 44 .110
undo segment extension 498 0 498 .000
local write wait 421 195 153 .460
async disk IO 400 61 0 .150
process startup 275 32 0 .120
db file parallel read 239 32 0 .140
PX Deq: Join ACK 233 1 0 .000
SQL*Net break/reset to dblink 198 0 0 .000
wait for stopper event to be i 120 13 117 .100
ncreased
PX Deq: Table Q Sample 116 2 0 .010
log file switch completion 115 67 32 .590
row cache lock 89 181 59 2.030
db file single write 67 4 0 .060
log file sequential read 54 0 0 .010
log file single write 54 1 0 .020
wait list latch free 52 1 0 .020
inactive session 45 44 45 .990
PX Deq: Table Q Get Keys 42 1 0 .020
library cache load lock 20 2 0 .100
PX Deq: Txn Recovery Start 5 0 0 .010
reliable message 3 3 2 .930
buffer deadlock 2 0 2 .000
refresh controlfile command 1 0 0 .000
control file heartbeat 1 4 1 3.910
61 ligne(s) sΘlectionnΘe(s).
Hi
following is the "SQL*Net message from dblink" Reference Note from Metalink. The note id is 34396.1. See if it can help you in isolating the problem
HTH
Vishal
_________START OF THE REFERENCE NOTE_________________-
This is a reference note for the wait event "SQL*Net message from dblink" which includes the following subsections:
Brief definition
Individual wait details (eg: For waits seen in <View:V$SESSION_WAIT>)
Systemwide wait details (eg: For waits seen in <View:V$SYSTEM_EVENT>)
Reducing waits / wait times
See Note 61998.1 for an introduction to Wait Events.
Definition:
Versions:7.0 - 9.2 Documentation: 9.0
The Oracle shadow process is waiting for a message over a database link from a remote process. Note that this wait is also used when waiting for data from "extproc" or from a remote gateway process.
Individual Waits:
Parameters:
P1 = driver id
P2 = bytes#
P3 = Not used
driver id
In Oracle8i onwards P1RAW can be decoded into ASCII characters to give a clue as to which Net driver is used.
Eg: P1RAW=0x62657100 = 'beq\0' , P1RAW=0x54435000 = 'TCP\0' etc.
In earlier releases the value here is the value of the disconnect function of the Net driver being used (which is not much use).
bytes# (number of bytes received)
The number of bytes we need to receive. (This is often just 1 even though the expected packet will be much larger)
Wait Time:
This wait blocks until a message is received from the remote connection (or until an abnormal end of file condition occurs on the underlying Net transport layer). There is no Oracle timeout on the wait.
Finding Blockers:
The blocker is the network plus the remote process. If the remote process is another database instance (accessed via a database link) then the information in <View:V$SESSION> on the REMOTE database will have the same client details as the V$SESSION entry on the local database for the blocked session. You need to look at the remote instance to determine where that session is spending time.
Systemwide Waits:
If systemwide waits for this event are significant it is best to determine where the remote connections are to and switch attention to the remote instance / instances to determine where they are spending time. One can also look at:
Sessions with high values in <View:V$SESSTAT> for:
SQL*Net roundtrips to/from dblink
bytes sent via SQL*Net to dblink
bytes received via SQL*Net from dblink
The Network between the local and remote systems (problems are usually related to time spent ON the remote instance rather than in the network but it is worth checking if the network between the instances is slow / not)
Reducing Waits / Wait times:
It is normally best to see what propertion of the time is actually spent on the remote instance as the most common cause of long waits over a DB link is that the time is actually all spent doing work on the remote instance (either waiting or working).
Trace the session locally and at the remote instance to see what work a user "transaction" actually consists of
Check the execution plans for any distributed queries
See if frequently accessed remote data can be kept in a local snapshot (Materialized View).
following is the "SQL*Net message from dblink" Reference Note from Metalink. The note id is 34396.1. See if it can help you in isolating the problem
HTH
Vishal
_________START OF THE REFERENCE NOTE_________________-
This is a reference note for the wait event "SQL*Net message from dblink" which includes the following subsections:
Brief definition
Individual wait details (eg: For waits seen in <View:V$SESSION_WAIT>)
Systemwide wait details (eg: For waits seen in <View:V$SYSTEM_EVENT>)
Reducing waits / wait times
See Note 61998.1 for an introduction to Wait Events.
Definition:
Versions:7.0 - 9.2 Documentation: 9.0
The Oracle shadow process is waiting for a message over a database link from a remote process. Note that this wait is also used when waiting for data from "extproc" or from a remote gateway process.
Individual Waits:
Parameters:
P1 = driver id
P2 = bytes#
P3 = Not used
driver id
In Oracle8i onwards P1RAW can be decoded into ASCII characters to give a clue as to which Net driver is used.
Eg: P1RAW=0x62657100 = 'beq\0' , P1RAW=0x54435000 = 'TCP\0' etc.
In earlier releases the value here is the value of the disconnect function of the Net driver being used (which is not much use).
bytes# (number of bytes received)
The number of bytes we need to receive. (This is often just 1 even though the expected packet will be much larger)
Wait Time:
This wait blocks until a message is received from the remote connection (or until an abnormal end of file condition occurs on the underlying Net transport layer). There is no Oracle timeout on the wait.
Finding Blockers:
The blocker is the network plus the remote process. If the remote process is another database instance (accessed via a database link) then the information in <View:V$SESSION> on the REMOTE database will have the same client details as the V$SESSION entry on the local database for the blocked session. You need to look at the remote instance to determine where that session is spending time.
Systemwide Waits:
If systemwide waits for this event are significant it is best to determine where the remote connections are to and switch attention to the remote instance / instances to determine where they are spending time. One can also look at:
Sessions with high values in <View:V$SESSTAT> for:
SQL*Net roundtrips to/from dblink
bytes sent via SQL*Net to dblink
bytes received via SQL*Net from dblink
The Network between the local and remote systems (problems are usually related to time spent ON the remote instance rather than in the network but it is worth checking if the network between the instances is slow / not)
Reducing Waits / Wait times:
It is normally best to see what propertion of the time is actually spent on the remote instance as the most common cause of long waits over a DB link is that the time is actually all spent doing work on the remote instance (either waiting or working).
Trace the session locally and at the remote instance to see what work a user "transaction" actually consists of
Check the execution plans for any distributed queries
See if frequently accessed remote data can be kept in a local snapshot (Materialized View).
ASKER
Ok guys, I found that my database suffered several problems:
TEMP tablespace close to full (and no autogrow)
DATA tablespace close to full , but with LOW file increase
jobs that failed and rerun automatically all the time
Problems all solved now, tips used, performance is up again.
Thanks for the support
I will distribute the points for some new things learned!
pbocanegra (temp tablespace)
paquicuba (DRIVING_SITE hint)
riazpk (dbms_xplan.display)
vishal68 (bind variables / DBLINK)
TEMP tablespace close to full (and no autogrow)
DATA tablespace close to full , but with LOW file increase
jobs that failed and rerun automatically all the time
Problems all solved now, tips used, performance is up again.
Thanks for the support
I will distribute the points for some new things learned!
pbocanegra (temp tablespace)
paquicuba (DRIVING_SITE hint)
riazpk (dbms_xplan.display)
vishal68 (bind variables / DBLINK)
ASKER
tnsping only reflects the time to connect, not the query performance as far as I know. if I am wrong there, let me read some article, please.
anyhow:
>The SORT Area
in how far the sort area is considered if there is no ORDER BY or GROUP BY in the query?
> Temporary tablespace
has 6 GB of which 1.5 are free. should be fine so far.
@paquicuba:
I tried with both with and without (additional) index hint, no change.
the explain plan, when run on the PROD, uses the INDEX
the explain plan, when run on the DW, only shows FILTER on REMOTE table