Sybase 12.5: Compare Table Load Performace between Dev and Prod box

Hi Experts,

I have job that moves data from DB2 to Sybase (4mil records),using Informatica ETL .  On Dev box it takes 23 min (drop indexes, load, create indexes).
On Prod box it takes around 2-3 hours. DBAs are saying because of SRDF disks prod will always be slower , but why performance is SOO much different? Table has ~ 10 columns  and 2indexes , that are recreated during job run . Creating indexes takes ~ from 20 to 50 min on Prod box . Why ?

Appreciate your help
Julia
juliakirAsked:
Who is Participating?
 
grant300Connect With a Mentor Commented:
Something is very wrong; the problem is that there are a bunch of possibilities.  There are three or four possibilities in a generic sense:
 - Insufficient memory, either available on the machine or configured (properly) in Sybase
 - Paging of memory
 - High disk I/O latency and/or low disk bandwidth -- the most likely cause with SRDF.

For the purposes of this discussion, I am going to assume that the PROD box is equal or greater in capability than the DEV box.  I will also assume you are using the Bulk Load facility from Informatica.  Since you are dropping the indexes and recreating them, I will also assume you are getting fast bulk copy to work.  If, in fact, you are doing INSERTS from Informatica, you will be beating the heck out of the Log and that narrows down your problem space.

Unfortunately, all put one of the things you might try fall squarely in the DBA/Sys Admin territory so there may not be much you can do.

 - Start by comparing the Sybase server configurations.  In particular, make certain the memory configuration is set properly on PROD.  Not only do you have to set the memory that Sybase can use, you have to increase the size of the default buffer cache as well.  I have seen this mistake more times than I can count.
 - Make certain that the PROD server actually has enough memory to prevent paging of Sybase.  Paging kills the database.  I have seen a box where the very smart BIOS disables a bad memory chip (and it's pair) and no one noticed.  It may also be that there are other processes running on the PROD box that are squeezing Sybase out of RAM.  Sybase should be configured to allocate all of the memory at startup and lock it into RAM.  It will throw an error at startup if it cannot do what you ask of it.
 - Now run TOP and compare the two machines under load.  In particular, look at the amount of "I/O WAIT" in the CPU times.  If you are badly disk bound, as I suspect you are, you will see it here.

Since you are using EMC SRDF, one possibility is that your DBAs have it set up in synchronous mode.  That means that when the server issues and I/O, it is not reported as completed until the write occurs on the far end and an acknowledgment is returned.  This adds whatever network latency for transmitting the data blocks and receiving the ACK to each I/O.  Could that slow things down by a factor of 5; absolutely.

Most of the time, you probably don't see the SRDF impact too badly because the system is not loaded to 100 percent, at least not in a sustained fashion.  Most small writes (with the exception of LOGs) are cached and don't actually hit the disk until the checkpoint writer process pushes dirty pages to disk.  Since this happens in the background, you don't see how long it is taking.

Your load process has a boat load of data, fills up the cache, and the dirty pages need to get flushed to disk before you can put more data in.  That puts the large SRDF latencies in the execution path for your job.  You may be able to mitigate this by increasing the amount of memory available to Sybase.  You are probably only loading 400-800MB of data so even a 32 bit installation should be able to assign that much cache to the database server.

If you are doing INSERTs from Informatica, your Log writes are the limiting factor.  If you have a large enough installation, that is at least 4 CPUs with 4 Sybase engines lit up, you can turn on asynchronous log writes.  This will decouple your process from the disk latencies to some extent.  You might also consider increasing the size of your user log pages to 8 or 16KB.  SRDF adds more latency per I/O than it does per KB transfered so making fewer, larger I/Os should help throughput.

While it won't help your INDEX builds, you might see significant benefit on the PROD box by modifying the Informatica job to use the Sybase Bulk facility.  I don't know which version of Informatica you are running but I thought (could be wrong) that later versions supported the bulk interface directly.  If not, you change the job by writing out a delimited flat file and firing off BCP (the Sybase bulk copy utility) to actually load the data.  With no indexes and triggers, the load is "minimally logged".  What happens is that the rows are formatted and directly written to table pages.  The only logging is that required to keep track of space utilization.  Bulk loading along with having enough memory/cache setup to contain the majority of the new data should eleviate the SRDF bottleneck.  It is also the one thing you have complete control over.

Another thing you might want to talk with the DBAs about next time they do an upgrade is to increase the Sybase page size.  The default of 2KB is way to small for the way disk hardware works these days.  Even on small machines, I always set the page size to 8KB.  Again, fewer larger I/Os will help in your situation.

BTW, the tempdb devices should not be SRDFed at all.  tempdb is recreated each time the database server starts up.  In addition, many systems pound the heck our of tempdb so that last thing you want is to put it on a slow device.  I have actually had good success putting tempdb on tmpfs (ramdisk) file systems.

You can ask the DBAs/Sys Admins to take the volumes that Sybase lives on and change the SRDF setup to run asynchronously however, that brings about it's own set of issues.  It will mitigate your performance issues but may result in an unrecoverable database at the far end under certain failure modes.  Databases like to control, or at least know the order in which things are written to disk.  I would call Sybase tech support and see what they say about SRDF and similar remote disk mirroring and whether there is a recommended (or at least relatively safe) way to configure it for asynchronous operation.

A better solution for remote hot standby of a Sybase database is to use Replication Server.  Rep Server is very unobtrusive the to PROD server because it reads the transaction log rather than adding triggers to tables.  It reverses the operations that took place from the log and transmits them to the remote server where an agent then applies them.  Rep Server also has an internal store-and-forward component that ensures that nothing gets dropped on the ground in any failure mode of server or network.  It also has other clever optimizations such as replicating stored procedure calls rather than all of the changed data that the stored procedure might produce.  Depending on your application, this can be quite an advantage.

Hope this help out.

Regards,
Bill
0
 
juliakirAuthor Commented:
Bill,
Thanks so much for your answer . I learned a lot.

I have feeling our SFDF is setup in  synchronous mode, as my compnay big on failt-tolerance and recovery.
But even if this is the case do you think it explains following numbers I got from today run?

Informatica runs in bulk mode . Here is expert from log file
READER_1_1_1> BLKR_16003 Initialization completed successfully.
WRITER_1_1_1> WRT_8146 Writer: Target is database [NYDBNAME], user [username] bulk mode [ON]
WRITER_1_1_1> WRT_8106 Warning! Bulk Mode session - recovery is not guaranteed.
WRITER_1_1_1> CMN_1021 Database driver event...

Job inserts 100K rows  at time. Inserts exactly same number rows in QA and Prod .
QA today:
Took 20 min today to load table
2 min to create 2 indexes !!!
Prod today:
Load - 1:12 min
Indexes 52 min !!!

QA rows insert at the rate 100K in 30 sec
Prod rows insert at the rate 100K in 2-3-4 min

YOur help is greatly appreciated
Julia


0
 
juliakirAuthor Commented:
small correction

Is this what you were looking for?Yes No
juliakir:
Bill,
Thanks so much for your answer . I learned a lot.

I have feeling our SFDF is setup in  synchronous mode, as my compnay big on failt-tolerance and recovery.
But even if this is the case do you think it explains following numbers I got from today run?

Informatica runs in bulk mode . Here is expert from log file
READER_1_1_1> BLKR_16003 Initialization completed successfully.
WRITER_1_1_1> WRT_8146 Writer: Target is database [NYDBNAME], user [username] bulk mode [ON]
WRITER_1_1_1> WRT_8106 Warning! Bulk Mode session - recovery is not guaranteed.
WRITER_1_1_1> CMN_1021 Database driver event...

Job inserts 100K rows  at time. Inserts exactly same number rows in QA and Prod .
QA today:
Took 20 min today to load table
2 min to create 2 indexes !!!
Prod today:
Load - 1hour:12 min
Indexes 52 min !!!

QA rows insert at the rate 100K in 30 sec
Prod rows insert at the rate 100K in 2-3-4 min

YOur help is greatly appreciated
Julia
0
Never miss a deadline with monday.com

The revolutionary project management tool is here!   Plan visually with a single glance and make sure your projects get done.

 
Joe WoodhousePrincipal ConsultantCommented:
Bill's point are as usual spot on, and it would be good to confirm (via DBAs and sys admins) that nothing is going wrong at the hardware or Sybase configuration level.

I am pretty sure your problem will be the SRDF in synchronous mode.

I've been part of a large SRDF deployment in an extremely risk-averse financial client. All of their initial testing of the synchronous performance hit had been on OLTP loads. They were rather dismayed to find - as you seem to be - a slowdown of 4-5 times for major batch jobs. (Which is when they called me in. 8-) )

Synchronous SRDF seems to be designed for, and can handle, a constant trickle of intermittent I/Os, and it gasps and dies when it's hit with a serious flood. Bulk loads and create indexes are certainly causing plenty of writes in your database.

Some specific suggestions for you, based on what I found at that client:

- when creating indexes, make sure you create any clustered indexes first. If you have any non-clustered indexes, (re)creating the clustered index will force them to be rebuilt again.

- check with the sys admins and Storage admins, make sure the SRDF cluster isn't undergoing any other major scheduled operations (eg backups) at the same time as your bulk load

- in this scenario (if I'm right about it being SRDF), having a lot of Sybase memory (with an otherwise default configuration) can actually be a bad thing. If you have huge Sybase memory, writes will go into cache and then be flushed all at once, overwhelming the IO systems. But don't decrease the memory, instead even out the cache flushes by decreasing the Sybase parameter "recovery interval in minutes" and *cautiously* increasing "housekeeper free write percent"

- follow all of Bill's suggestions above, they make good sense whether or not SRDF is the culprit

- lastly there is another very simple explanation - what is the main difference between DEV and PROD?... the users! Ordinary user activity could be blocking these load and index jobs. Depending on applications and front end tools even something as simple as leaving a session open when they go home for the day could be part of the problem. (I am presuming you're running your load out of hours.) Check for out of hours reporting jobs, and DBA housekeeping jobs like update statistics etc. In a large organisation it is surprisingly easy for the main DBA team not to always know about these things! They will know how to check for blocking processes during your jobs.
0
 
grant300Commented:
I had one other thought on the Cluster index build though it may be a long shot.

If that data is loaded in your table already in sorted order, you can use the option on the index create to tell it so.  This might be the case if you are adding transactions in sequence to the end of the table.  The cluster index build is a lot faster, and does a fraction of the I/O if it does not have to reorder the base table.

Joe's suggestion of looking for competing jobs is great.  I assumed somebody had taken that into account already but, of course, it is not an ideal world.  I thought that a bulk load operation puts a lock on the table which will prevent other writes but I may be wrong or be remembering obsolete info from prior versions.  If it does write lock the table, you need only look for jobs that are doing a lot of database writing and clogging up the SRDF channel.  All reads are satisfied from the load disk array and are not affected by SRDF.

Best of luck,
Bill
0
 
juliakirAuthor Commented:
one thing to note I am doing revoke select permissions on the table before job starts and grant it back after index rebuild. Will this eliminate user activity impact during load ?

Another thing to note then job was created initially then load was running fine with same performance in dev without dropping indexes . After we move to prod same job took arond  5 hours without dropping indexes. That is why I changed it to do so (drop and create indexes).

Also recently time for index creation is not consistent . Some days it takas 17 min , some 1 hour.

In you opinion how long should take to create 2 indexes on 4 mil rows table(takes couple min in Dev) ?. One indexes 4 columns . Second 3 columns.

Thanks a lot
Julia
0
 
grant300Commented:
It would seem your DEV box timings put a floor under what it should take.  SRDF is going to slow production down from that.

The inconsistencies probably stem from not only the load on the database server but the aggregate load on the entire disk array that is implementing SRDF.  You could have a lot of other stuff going on on entirely different servers that are causing a flood of writes that SRDF needs to deal with.  It may well be the link between the two arrays gets saturated by processing that is not even on the server you are using.

I don't know how far apart the arrays are or what kind of MAN/WAN is implemented but it this is a chronic problem for a bunch of applications/servers, your DBAs/Sys Admins may want to look at either increasing the bandwidth of the SRDF connection or adding an additional link.

Another option is to have the Sys Admins balance the write load on the array by looking at all of the write-intensive batch processing array-wide.  Most nightly batch cycles have a pretty fair share of read-intensive jobs, e.g. reporting, extraction, update statistics, etc. that won't affect the SRDF array.  Spreading the write intensive jobs out as much as possible during the batch window should improve everyones lot.

Regards,
Bill
0
 
Joe WoodhousePrincipal ConsultantCommented:
What Bill said. 8-)

Revoking permissions is not enough to guarantee zero user activity. If nothing else, DBA housekeeping jobs are probably running with DBA privileges which you cannot revoke. When you consider how many canned reports, scripts in cron jobs, etc, might also be running as "dbo" then no, this is definitely not enough to assume nothing else is happening within the database.

Even if it did, other activity in the same ASE could still be harming the performance of these jobs.

Even if there wasn't anything else in the same ASE, other activity on the same host could still be harming the performance of these jobs.

Even if there's nothing else running on the host, other activity on the same network switch, and/or disk systems, and/or network infrastructure used by the SRDF could still be harming the performance of these jobs.

... I think you get the idea. 8-)

The very fact that you can run the same job and have it vary so much in timings tells us right away that there is inconsistent but interfering load occurring somewhere where it hurts. Your DBAs should be able to tell you if that's happening in the same tables, databases or ASE. They can probably tell you if it's happening on the same host. Your sys admins should be able to tell you about the host or network switch. Your storage people should be able to tell you about the SRDF and all the systems it needs to touch.

I think you have two problems:

1) synchronous SRDF is slowing things down by a factor of 4-5. This unfortunately is the true price of block-level disk replication that the storage vendors do not want you to know. Their benchmarks are typically based on reads or OLTP workloads; batch jobs with many writes kill this technology *particularly* in synchronous mode.

2) that said, you also have other factors intruding. You need to isolate what else is clashing. This will not always be obvious! My advice is to ask about DBA housekeeping (including database backups) and storage housekeeping (SAN backups to tape, say) as though are usually the I/O killers. Maybe another host is also doing major batch writes to the same SAN when your job is running?


> In you opinion how long should take to create 2 indexes on 4 mil rows table(takes couple min in Dev) ?. One indexes 4 columns . Second 3 columns.

Very difficult to say without knowing more. Are the DEV and PROD hardware identical? Are the Sybase environments identical? Are the tables the same size (# of rows of data) with the *same* data in them? But since we would generally expect any differences to favour PROD over DEV, we would expect things to be at least as fast in DEV as in PROD.

... although that said there are some shortcuts sometimes taken in DEV which aren't appropriate in PROD, like setting the DSYNC option to false for DEV devices but true for PROD, and that could slow things down in PROD. I've seen that happen. I've also seen someone accidentally leave a PROD tempdb on DSYNC=true and that made PROD slower than DEV. Another thing to ask your DBAs about. 8-)
0
 
juliakirAuthor Commented:
Bill and Joe ,

Thank you VERY much . My husband(leonstryker) tells me you are the Best !

in response to Joe last posting . Table in Dev and Prod Exactly the same . Same number rows ans coolumns.
            NYDBNAME (Sybase Adaptive Server Enterprise/12.5.4/EBF 14080 ESD#3/P/x86_64/Enterprise Linux/ase1254/2030/64-bit/OPT/Fri Oct 13 21:22:28 2006)

Sorry for may  be silly question can you explain what DSYNC=true means ?

My poor DBAs .. I got some training this weekend :)

Thank you very much
Julia
0
 
Joe WoodhousePrincipal ConsultantCommented:
A little bit of context:

A device is a bit of disk that's been reserved for one particular ASE instance. Creating a device means that amount of space now can't be used by anything else, but nothing in Sybase is using it yet. We create and alter (extend) databases onto one or more devices - it's common but not necessary for a database to span multiple devices, and for a device to contain more than one piece of a database on it (same or different databases).

In "the old days" there was a long battle about what type of disk was best to use for Sybase devices - ordinary (UNIX) files, or "raw" partitions (unformatted disk that didn't have a file system on it). I won't revive this battle now! DSYNC is an option that was introduced in ASE 12.0 to allow "cooked" (or file) devices under UNIX to have some of the same protections & guarantees that raw partitions offer.

DSYNC is an attribute that you can set per device. You can see its current setting by running "sp_helpdevice" (optionally with the name of a device; without it gives you a summary of all devices in this ASE). By default it is ON (TRUE), which is to say, "make files as safe as raw partitions". However in this case "safe" also means "slower", so it's not unusual for DSYNC to be set manually to be false for all devices in a DEV server since you don't usually care so much about recoverability there. This is one way DEV could be faster than PROD even without SRDF as a factor.

The other way I mentioned is because DSYNC is now true by default, that means you have it all the time even when (in PROD) you might not want to. Tempdb is a special system database used for strictly temporary operations. Because it is recreated any time ASE is started, its contents are lost in every shutdown, both planned and unplanned. Generally this means we don't have to protect it as vigorously as the other databases since it will be lost no matter what we do anyway, right? So we would happily sacrifice the recoverability given by DSYNC=true for tempdb in exchange for more performance in tempdb via DSYNC=false for all tempdb devices.

But remember TRUE is the default, so it's entirely possible someone created or extended tempdb in PROD and forgot to set its device(s) to DSYNC=false. If tempdb was DSYNC false in DEV but not in PROD we would expect many things (including some create index commands) to be slower in PROD than in DEV. It's easy to overlook, I'd guess every DBA has done it at some point. I sure have. 8-)
0
 
juliakirAuthor Commented:
thanks Joe . Run sp_helpdevice . Looks like in dev is set to ON and OFF in Prod

Prod
tempdb_data1      /tempdblocal/NYDBNAME_tempdb_data1.dat            special, dsync off, physical disk, 1000.00 MB      2      0      2      33554432      34066431
tempdb_log1      /tempdbfs/NYDBNAME_tempdb_log1.dat                special, dsync off, physical disk, 500.00 MB      2      0      3      50331648      50587647
tempdbsa_data1      /tempdblocal/NYDBNAME_tempdbsa_data1.dat          special, dsync off, physical disk, 100.00 MB      2      0      4      67108864      67160063
tempdbsa_log1      /tempdbfs/NYDBNAME_tempdbsa_log1.dat              special, dsync off, physical disk, 25.00 MB      2      0      5      83886080      83898879

thanks a lot
0
 
grant300Commented:
That's interesting.  It looks as if the DBAs have chosen to split the Log and Data portions of TEMPDB onto different devices.  Even more curious, the Log and Data segments are on different physical roots, /tempdblocal versus /tempdbfs.

I wonder if /tempdbfs is not actually a tmpfs (ramdisk) and that they decided to speed things up by putting the tempdb logs on it.  That makes a certain amount of sense since only the logging is written to disk synchronously assuming there is enough regular database cache to keep it all in memory anyway.  Most tempdb objects live only a short time and the dirty pages never actually get written to disk.

Hopefully /tempdblocal is just that, local, and is not SRDFed.

BTW, there is an additional option for devices starting with Sybase ASE 15; DIRECT IO.  It allows you to use file system devices but, once they are created, essentially treats them as raw devices.  It gives you all the benefits of file system devices with the reliability and speed of raw.

Regards,
Bill
0
 
Joe WoodhousePrincipal ConsultantCommented:
Jukia: Ok, DSYNC OFF for tempdb, which is what we wanted. But the fact that it's ON in DEV and OFF in PROD says it should be faster in PROD, not slower. Hmm. As Bill says we should make sure none of those tempdb devices are included in the SRDF though.

Bill, data & log separated in tempdb made a lot of sense before ASE 11.9.2, because ULC behaviour was disabled in any database with data & log mixed - including tempdb! - until 11.9.2. These days there is less urgent a need for it but I still routinely do it at my sites as it makes it easier to keep track of tempdb space usage when you can separately report on data & log.
0
All Courses

From novice to tech pro — start learning today.