Link to home
Start Free TrialLog in
Avatar of juliakir
juliakir

asked on

Sybase 12.5: Compare Table Load Performace between Dev and Prod box

Hi Experts,

I have job that moves data from DB2 to Sybase (4mil records),using Informatica ETL .  On Dev box it takes 23 min (drop indexes, load, create indexes).
On Prod box it takes around 2-3 hours. DBAs are saying because of SRDF disks prod will always be slower , but why performance is SOO much different? Table has ~ 10 columns  and 2indexes , that are recreated during job run . Creating indexes takes ~ from 20 to 50 min on Prod box . Why ?

Appreciate your help
Julia
ASKER CERTIFIED SOLUTION
Avatar of grant300
grant300

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of juliakir
juliakir

ASKER

Bill,
Thanks so much for your answer . I learned a lot.

I have feeling our SFDF is setup in  synchronous mode, as my compnay big on failt-tolerance and recovery.
But even if this is the case do you think it explains following numbers I got from today run?

Informatica runs in bulk mode . Here is expert from log file
READER_1_1_1> BLKR_16003 Initialization completed successfully.
WRITER_1_1_1> WRT_8146 Writer: Target is database [NYDBNAME], user [username] bulk mode [ON]
WRITER_1_1_1> WRT_8106 Warning! Bulk Mode session - recovery is not guaranteed.
WRITER_1_1_1> CMN_1021 Database driver event...

Job inserts 100K rows  at time. Inserts exactly same number rows in QA and Prod .
QA today:
Took 20 min today to load table
2 min to create 2 indexes !!!
Prod today:
Load - 1:12 min
Indexes 52 min !!!

QA rows insert at the rate 100K in 30 sec
Prod rows insert at the rate 100K in 2-3-4 min

YOur help is greatly appreciated
Julia


small correction

Is this what you were looking for?Yes No
juliakir:
Bill,
Thanks so much for your answer . I learned a lot.

I have feeling our SFDF is setup in  synchronous mode, as my compnay big on failt-tolerance and recovery.
But even if this is the case do you think it explains following numbers I got from today run?

Informatica runs in bulk mode . Here is expert from log file
READER_1_1_1> BLKR_16003 Initialization completed successfully.
WRITER_1_1_1> WRT_8146 Writer: Target is database [NYDBNAME], user [username] bulk mode [ON]
WRITER_1_1_1> WRT_8106 Warning! Bulk Mode session - recovery is not guaranteed.
WRITER_1_1_1> CMN_1021 Database driver event...

Job inserts 100K rows  at time. Inserts exactly same number rows in QA and Prod .
QA today:
Took 20 min today to load table
2 min to create 2 indexes !!!
Prod today:
Load - 1hour:12 min
Indexes 52 min !!!

QA rows insert at the rate 100K in 30 sec
Prod rows insert at the rate 100K in 2-3-4 min

YOur help is greatly appreciated
Julia
Bill's point are as usual spot on, and it would be good to confirm (via DBAs and sys admins) that nothing is going wrong at the hardware or Sybase configuration level.

I am pretty sure your problem will be the SRDF in synchronous mode.

I've been part of a large SRDF deployment in an extremely risk-averse financial client. All of their initial testing of the synchronous performance hit had been on OLTP loads. They were rather dismayed to find - as you seem to be - a slowdown of 4-5 times for major batch jobs. (Which is when they called me in. 8-) )

Synchronous SRDF seems to be designed for, and can handle, a constant trickle of intermittent I/Os, and it gasps and dies when it's hit with a serious flood. Bulk loads and create indexes are certainly causing plenty of writes in your database.

Some specific suggestions for you, based on what I found at that client:

- when creating indexes, make sure you create any clustered indexes first. If you have any non-clustered indexes, (re)creating the clustered index will force them to be rebuilt again.

- check with the sys admins and Storage admins, make sure the SRDF cluster isn't undergoing any other major scheduled operations (eg backups) at the same time as your bulk load

- in this scenario (if I'm right about it being SRDF), having a lot of Sybase memory (with an otherwise default configuration) can actually be a bad thing. If you have huge Sybase memory, writes will go into cache and then be flushed all at once, overwhelming the IO systems. But don't decrease the memory, instead even out the cache flushes by decreasing the Sybase parameter "recovery interval in minutes" and *cautiously* increasing "housekeeper free write percent"

- follow all of Bill's suggestions above, they make good sense whether or not SRDF is the culprit

- lastly there is another very simple explanation - what is the main difference between DEV and PROD?... the users! Ordinary user activity could be blocking these load and index jobs. Depending on applications and front end tools even something as simple as leaving a session open when they go home for the day could be part of the problem. (I am presuming you're running your load out of hours.) Check for out of hours reporting jobs, and DBA housekeeping jobs like update statistics etc. In a large organisation it is surprisingly easy for the main DBA team not to always know about these things! They will know how to check for blocking processes during your jobs.
I had one other thought on the Cluster index build though it may be a long shot.

If that data is loaded in your table already in sorted order, you can use the option on the index create to tell it so.  This might be the case if you are adding transactions in sequence to the end of the table.  The cluster index build is a lot faster, and does a fraction of the I/O if it does not have to reorder the base table.

Joe's suggestion of looking for competing jobs is great.  I assumed somebody had taken that into account already but, of course, it is not an ideal world.  I thought that a bulk load operation puts a lock on the table which will prevent other writes but I may be wrong or be remembering obsolete info from prior versions.  If it does write lock the table, you need only look for jobs that are doing a lot of database writing and clogging up the SRDF channel.  All reads are satisfied from the load disk array and are not affected by SRDF.

Best of luck,
Bill
one thing to note I am doing revoke select permissions on the table before job starts and grant it back after index rebuild. Will this eliminate user activity impact during load ?

Another thing to note then job was created initially then load was running fine with same performance in dev without dropping indexes . After we move to prod same job took arond  5 hours without dropping indexes. That is why I changed it to do so (drop and create indexes).

Also recently time for index creation is not consistent . Some days it takas 17 min , some 1 hour.

In you opinion how long should take to create 2 indexes on 4 mil rows table(takes couple min in Dev) ?. One indexes 4 columns . Second 3 columns.

Thanks a lot
Julia
It would seem your DEV box timings put a floor under what it should take.  SRDF is going to slow production down from that.

The inconsistencies probably stem from not only the load on the database server but the aggregate load on the entire disk array that is implementing SRDF.  You could have a lot of other stuff going on on entirely different servers that are causing a flood of writes that SRDF needs to deal with.  It may well be the link between the two arrays gets saturated by processing that is not even on the server you are using.

I don't know how far apart the arrays are or what kind of MAN/WAN is implemented but it this is a chronic problem for a bunch of applications/servers, your DBAs/Sys Admins may want to look at either increasing the bandwidth of the SRDF connection or adding an additional link.

Another option is to have the Sys Admins balance the write load on the array by looking at all of the write-intensive batch processing array-wide.  Most nightly batch cycles have a pretty fair share of read-intensive jobs, e.g. reporting, extraction, update statistics, etc. that won't affect the SRDF array.  Spreading the write intensive jobs out as much as possible during the batch window should improve everyones lot.

Regards,
Bill
What Bill said. 8-)

Revoking permissions is not enough to guarantee zero user activity. If nothing else, DBA housekeeping jobs are probably running with DBA privileges which you cannot revoke. When you consider how many canned reports, scripts in cron jobs, etc, might also be running as "dbo" then no, this is definitely not enough to assume nothing else is happening within the database.

Even if it did, other activity in the same ASE could still be harming the performance of these jobs.

Even if there wasn't anything else in the same ASE, other activity on the same host could still be harming the performance of these jobs.

Even if there's nothing else running on the host, other activity on the same network switch, and/or disk systems, and/or network infrastructure used by the SRDF could still be harming the performance of these jobs.

... I think you get the idea. 8-)

The very fact that you can run the same job and have it vary so much in timings tells us right away that there is inconsistent but interfering load occurring somewhere where it hurts. Your DBAs should be able to tell you if that's happening in the same tables, databases or ASE. They can probably tell you if it's happening on the same host. Your sys admins should be able to tell you about the host or network switch. Your storage people should be able to tell you about the SRDF and all the systems it needs to touch.

I think you have two problems:

1) synchronous SRDF is slowing things down by a factor of 4-5. This unfortunately is the true price of block-level disk replication that the storage vendors do not want you to know. Their benchmarks are typically based on reads or OLTP workloads; batch jobs with many writes kill this technology *particularly* in synchronous mode.

2) that said, you also have other factors intruding. You need to isolate what else is clashing. This will not always be obvious! My advice is to ask about DBA housekeeping (including database backups) and storage housekeeping (SAN backups to tape, say) as though are usually the I/O killers. Maybe another host is also doing major batch writes to the same SAN when your job is running?


> In you opinion how long should take to create 2 indexes on 4 mil rows table(takes couple min in Dev) ?. One indexes 4 columns . Second 3 columns.

Very difficult to say without knowing more. Are the DEV and PROD hardware identical? Are the Sybase environments identical? Are the tables the same size (# of rows of data) with the *same* data in them? But since we would generally expect any differences to favour PROD over DEV, we would expect things to be at least as fast in DEV as in PROD.

... although that said there are some shortcuts sometimes taken in DEV which aren't appropriate in PROD, like setting the DSYNC option to false for DEV devices but true for PROD, and that could slow things down in PROD. I've seen that happen. I've also seen someone accidentally leave a PROD tempdb on DSYNC=true and that made PROD slower than DEV. Another thing to ask your DBAs about. 8-)
Bill and Joe ,

Thank you VERY much . My husband(leonstryker) tells me you are the Best !

in response to Joe last posting . Table in Dev and Prod Exactly the same . Same number rows ans coolumns.
            NYDBNAME (Sybase Adaptive Server Enterprise/12.5.4/EBF 14080 ESD#3/P/x86_64/Enterprise Linux/ase1254/2030/64-bit/OPT/Fri Oct 13 21:22:28 2006)

Sorry for may  be silly question can you explain what DSYNC=true means ?

My poor DBAs .. I got some training this weekend :)

Thank you very much
Julia
A little bit of context:

A device is a bit of disk that's been reserved for one particular ASE instance. Creating a device means that amount of space now can't be used by anything else, but nothing in Sybase is using it yet. We create and alter (extend) databases onto one or more devices - it's common but not necessary for a database to span multiple devices, and for a device to contain more than one piece of a database on it (same or different databases).

In "the old days" there was a long battle about what type of disk was best to use for Sybase devices - ordinary (UNIX) files, or "raw" partitions (unformatted disk that didn't have a file system on it). I won't revive this battle now! DSYNC is an option that was introduced in ASE 12.0 to allow "cooked" (or file) devices under UNIX to have some of the same protections & guarantees that raw partitions offer.

DSYNC is an attribute that you can set per device. You can see its current setting by running "sp_helpdevice" (optionally with the name of a device; without it gives you a summary of all devices in this ASE). By default it is ON (TRUE), which is to say, "make files as safe as raw partitions". However in this case "safe" also means "slower", so it's not unusual for DSYNC to be set manually to be false for all devices in a DEV server since you don't usually care so much about recoverability there. This is one way DEV could be faster than PROD even without SRDF as a factor.

The other way I mentioned is because DSYNC is now true by default, that means you have it all the time even when (in PROD) you might not want to. Tempdb is a special system database used for strictly temporary operations. Because it is recreated any time ASE is started, its contents are lost in every shutdown, both planned and unplanned. Generally this means we don't have to protect it as vigorously as the other databases since it will be lost no matter what we do anyway, right? So we would happily sacrifice the recoverability given by DSYNC=true for tempdb in exchange for more performance in tempdb via DSYNC=false for all tempdb devices.

But remember TRUE is the default, so it's entirely possible someone created or extended tempdb in PROD and forgot to set its device(s) to DSYNC=false. If tempdb was DSYNC false in DEV but not in PROD we would expect many things (including some create index commands) to be slower in PROD than in DEV. It's easy to overlook, I'd guess every DBA has done it at some point. I sure have. 8-)
thanks Joe . Run sp_helpdevice . Looks like in dev is set to ON and OFF in Prod

Prod
tempdb_data1      /tempdblocal/NYDBNAME_tempdb_data1.dat            special, dsync off, physical disk, 1000.00 MB      2      0      2      33554432      34066431
tempdb_log1      /tempdbfs/NYDBNAME_tempdb_log1.dat                special, dsync off, physical disk, 500.00 MB      2      0      3      50331648      50587647
tempdbsa_data1      /tempdblocal/NYDBNAME_tempdbsa_data1.dat          special, dsync off, physical disk, 100.00 MB      2      0      4      67108864      67160063
tempdbsa_log1      /tempdbfs/NYDBNAME_tempdbsa_log1.dat              special, dsync off, physical disk, 25.00 MB      2      0      5      83886080      83898879

thanks a lot
That's interesting.  It looks as if the DBAs have chosen to split the Log and Data portions of TEMPDB onto different devices.  Even more curious, the Log and Data segments are on different physical roots, /tempdblocal versus /tempdbfs.

I wonder if /tempdbfs is not actually a tmpfs (ramdisk) and that they decided to speed things up by putting the tempdb logs on it.  That makes a certain amount of sense since only the logging is written to disk synchronously assuming there is enough regular database cache to keep it all in memory anyway.  Most tempdb objects live only a short time and the dirty pages never actually get written to disk.

Hopefully /tempdblocal is just that, local, and is not SRDFed.

BTW, there is an additional option for devices starting with Sybase ASE 15; DIRECT IO.  It allows you to use file system devices but, once they are created, essentially treats them as raw devices.  It gives you all the benefits of file system devices with the reliability and speed of raw.

Regards,
Bill
Jukia: Ok, DSYNC OFF for tempdb, which is what we wanted. But the fact that it's ON in DEV and OFF in PROD says it should be faster in PROD, not slower. Hmm. As Bill says we should make sure none of those tempdb devices are included in the SRDF though.

Bill, data & log separated in tempdb made a lot of sense before ASE 11.9.2, because ULC behaviour was disabled in any database with data & log mixed - including tempdb! - until 11.9.2. These days there is less urgent a need for it but I still routinely do it at my sites as it makes it easier to keep track of tempdb space usage when you can separately report on data & log.