Update vs. Select Into

A large Temp table (1M+ rows) needs to have a column updated from another large table (1M+ rows) inside a store procedure. What is better, an Update or a Select Into another temp table and drop the first?  Why?
LVL 29
leonstrykerAsked:
Who is Participating?
 
grant300Commented:
Try it both ways (perhaps with a subset of the data) and see what really happens.  Do a
Select * Into #Tab2 From #Tab1 and time it.  That is the floor under whatever operation you are going to do.  I think you will be surprised how fast it is.

Then do a simple update of #Tab1; something like
Update #Tab1 Set Col8 = 45.123, Col10 = 'This is a test string'
and time that.  That puts the floor under your update operations.

Either way you have to pay for the Join; these two tests just pull that out of the equation.  My suspicion is that for a relatively narrow table, the Select/Into wins.  If you have a bunch of columns, then an update might, under just the right circumstances, be better.

Creating a TEMP table and extending it with Select/Into is a very fast operation.  It is highly optimized because it is a very close sibling of the operation that the query processor does when it creates internal work tables.  There are some system table updates and space allocation is logged but for 1M+ rows, that overhead gets lost in the noise.

As far as optimizing the updates goes, you will take it on the nose as with the logging and will probably be limited by the log write to disk.  That assumes that you can formulate things to do what are called in-place updates.  Those are updates where no index columns are changed and the length of the columns is less than or equal to the old length.  If those conditions are not met, Sybase does the equivalent of a delete and insert.  If you are updating every row and you cannot guarantee in-place updates (floats work but varchars do not), you are essentially writing all new rows anyway.  That's in addition to paying for the logging.

Bottom line, in most instances you are better off with a Select/Into than an Update of all rows in temp table.

One thing to watch our for with temp tables and indexes in stored procedure is that, in general, an index created on a temp table in a stored procedure is not available to the optimizer within that same stored procedure.  You have a couple of choices.  The first is to create the table and index in Procedure A then do the join in Procedure B which is called from A.  The added complication is that when compiling Procedure B, you have create the temp table and index in the script before the CREATE PROC statement so that it has something to compile against.

One more thing you might do to speed up your temp table operations, particularly if they are logged, is to put TEMPDB on a TEMPFS.  A TEMPFS is a memory resident file system available in one form or another on Linux and Unix O/Ss.  The advantage of a TEMPDB on TEMPFS is that it has the potential to speed up any number of TEMPDB operations, not just the one you are looking at.  It also allows you to effectively use more memory on a 32 bit O/S than you can configure with just the Sybase Cache.

One more thought.  You say you have to update the temp table from another very large table which, I presume, is a regular table in a user database.  If that is the case, you are probably better off structuring your Join to leverage an index on the permanent table rather than trying to index temp tables.

Best of luck,
Bill
0
 
dqmqCommented:
Select Into almost certainly has more overhead to create the table, acquire the space, etc.  If the temp table is proactively designed so that the update doesn't cause a lot of expansion that will be faster.  
0
 
gjutrasCommented:
the usual sql anser to any performance question is it depends.  Most often followed with try and both ways and see what is faster.
Are you updating 1, some or all rows?
How many columns, type, size?
are the join columns indexed on the temp table and the other large table?
0
Cloud Class® Course: Ruby Fundamentals

This course will introduce you to Ruby, as well as teach you about classes, methods, variables, data structures, loops, enumerable methods, and finishing touches.

 
leonstrykerAuthor Commented:
But Select Into is not a logged operation while an Update is. Won't that make a big difference?
0
 
leonstrykerAuthor Commented:
All rows will be updated.
One or two columns are updated at a time.
The datatype is usually float or varchar
Indexes are used, but these in themselves take a while when created on the temp tables.
0
 
gjutrasCommented:
I'd suggest the update would be faster in your scenario.  Large table indexes already built, a couple of columns to update. instead of the overhead of having to recreate a table, copying all data again and then recreating the indexes.
0
 
dqmqCommented:
>Won't that make a big difference?

That's a good point.  But logging is a sequential operation and usually pretty well optimized.  At least logging overhead would be somewhat constant.  Space allocation overhead could vary from run to run, depending on the condition of the database.  

I did some testing and found that UPDATE time was fairly stable, but SELECT INTO time was all over the place--sometimes slower; sometimes significantly faster.

I guess, I'll recant my answer and side with the "it depends" contingent.
0
 
grant300Commented:
Unless you have a very busy system with a log of users, space allocation in TEMPDB is not an issue.  Remember that the space underneath temp tables is freed up when the table disappears which is pretty quickly.

As far as your timing results go, I find them very interesting.  I am curious about what other operations are going on on the system at the same time.  The SELECT INTO is probably CPU bound so if there are other things going on in the database or on the server, you will see a slow down.  The UPDATE is probably disk bound which means that, unless the TEMPDB device is getting hit heavily buy other operations, you will see a slower but consistent speed.  The fact that the SELECT INTO is sometimes "significantly faster" means it is using less resources overall and should probably be your choice, all else being equal.

Bill
0
 
leonstrykerAuthor Commented:
The store proc is run as part of an EOD process and generates a Risk report. On the production server it has been known to run for an hour, but in all fairness it actually runs twice (to get the difference between the two days.)

TempDb gets hit very heavy by this procedure. There are a number of large temp tables created and exist for a duration of the procedure. There may also be other reports, though somewhat smaller, running at the same time.
0
 
grant300Commented:
Either way you go, you are hitting tempdb pretty hard.  I would have a serious discussion with your DBA about putting tempdb on a tempfs.  It has tremendous potential to speed things up, particularly if they are logged operations.

He/she does not have to move the existing tempdb.  Sybase, with version 12.5.? supports multiple tempdbs.  You can create an additional tempdb on a tempfs just for this stuff and then steer these big batch operations toward that one.  The easiest way is to find a login to the new tempdb.

I think I would try this before modifying code as you might find you can live with it just by putting tempdb on a tempfs.
0
 
Joe WoodhousePrincipal ConsultantCommented:
Sorry all, I can't resist weighing in since I have to nitpick a few things that have been said here.

Firstly, SELECT INTO is *not* unlogged. It is *minimally* logged. There is no such thing in Sybase ASE as an unlogged write to data. (Shush in the back row, I know about top level actions, and even those are still logged.) Minimally logged means all we log are the page allocations and deallocations. That includes all index pages, temp table pages, worktable pages and other system pages that are written to as part of the operation.

Secondly, I haven't seen anyone talking about the different UPDATE types. All UPDATES are definitely not created equal! There is a tremendous difference in total I/O, log records and end to end time to complete a direct-in-place update vs. a deferred update. These are written up well in the Sybase manuals and I won't fully describe each of them here, save some highlights:

If the column(s) you're updating are part of an index key, or for some reason you've built referential integrity into the temp table, SELECT INTO is probably going to be faster. (You are unlikely to have replication or triggers on a temp table.) Even if the updated column(s) are not part of an index, if they are variable length, it may be that a SELECT INTO will be faster.

I find it helpful in these sorts of scenarios to think about how many times any given row has to be handled. If we're doing a deferred update (updating a column that's part of a primary key or unique index) then effectively each update is being processed as a DELETE followed by an INSERT, and not to mention we have to read the row in the first place. So that's one read and two writes of data. Each data write involves an index write, which in turn needs an index read. We're up to 1 read + 2 writes, plus 1R+2W *per index*.

Each of those writes has to be fully logged, so for each row in a deferred update, we're looking at 1R+2W+2W, *plus* 1R+2W+2W per index.

(I am slightly cheating, since it isn't quite that bad, but it makes for a good story.)

Ok, maybe we're lucky and it isn't a deferred update. Let's say it's just a direct-expensive update instead, say a row length is changing and we don't have room for the new row on the same page. It has to move to a different page, which means changing a rowID in an index entry, which means 1R+2W for data (have to write the old page to update the row offset table to remove the RID, then insert the row on the new page). Fortunately it's only 1R+1W per index as we don't have to move the index row, just update the RID. If a clustered index we don't need to update every index page, either, but NC indexes store the final RID on every page in the index tree where a particular row is referenced. (If that's confusing, I'm just say it's better than a deferred update where every row has 2W associated per index.) Of course every write is still logged, for a total of 1R+2W+2W, and 1R+1W+1W per index.

The absolute best case for an update is a direct-in-place. This is where every row stays where it is, no rows move on the page or to a different page. You can forget this if you have any variable length columns. Remember any nullable column is variable length, too. With this kind of update we're looking at 1R+1W+1W per row. (No indexes to update.)

Whereas for SELECT INTO, we're looking at one read of the original data, and one write to the new table. That write is still logged, but minimally (so less to write), but even so we'll call it 1R+1W+1W. Then if building new indexes that would be another 1R+1.2W per index (rough overhead for intermediate pages in the index tree), for a total of 2R+3.2W. Then we drop the old table, which is 1R+1W per index, but I'm not sure we should count that here as it can be done separately later. Or we could just truncate it then drop it. 8-)

That's worst than the best case UPDATE, but it's a lot better than the worst case. It compares favourably to the average case UPDATE which was 1R+2W+2W, and 1R+1W+1W per index.

Summary: The best possible UPDATE will beat SELECT INTO, but SELECT INTO beats pretty much everything else. And it's rare you'll be doing direct-in-place updates, the restrictions and requirements are pretty fierce.

Now those were the raw I/O figures. Other factors could still distort the results - how big and how full the data caches (and any log caches, if any) are (although you can't bind tempdb's logs to its own cache, only for user databases); whether asynchronous prefetch is being used effectively for the reads; whether async logging is being used for the log writes, and so on.

But I've been distracted by the technology. I think we should instead ask if it even makes sense to be using temp tables at all. If this is an EOD process are you really regerenerating those 1M row tables from scratch each day? Is there any opportunity to maintain them persistently and refresh as required? This may be the same or even more work overall, but then it gets distributed through your entire day rather than trying to fit it into a batch window. It would also make EOD run a *lot* faster.
0
 
leonstrykerAuthor Commented:
>If this is an EOD process are you really regerenerating those 1M row tables from scratch each day?

Yes and no. The majority of the items (trades) are the same, but their attributes (credit curves, risks, fx rates, etc) will change.  It is in fact the joining of the persistent data with new attributes, which is the issue.

0
 
Joe WoodhousePrincipal ConsultantCommented:
Got it. I'm familiar with Market Risk and Liquidity EOD with new curves & rates. There's no fast way of doing it, it has to be batch, and it has to wait until the new curves & rates are released.

While tempdb can be put into a named cache, and sometimes that's helpful, another thing to consider would be a dedicated scratch database for this work. Then you'd be able to build separate caches for log and data, and you'd have the option of the async log database option (assuming you meet the other prereqs).

Remember any indexing work is done in the data cache the underlying table is bound to, unless you've bound the indexes separately to a different cache. This affects proper sizing for a named cache.
0
 
leonstrykerAuthor Commented:
>dedicated scratch database for this work

Actually our solution, in the long term is a dedicated DB server just for the batch reports, as opposed to just a database, since the tempdb are shared.
0
 
Joe WoodhousePrincipal ConsultantCommented:
That works too. 8-)
0
 
leonstrykerAuthor Commented:
Thank you folks, it is always surprising to learn just how much you still do not know.

Leon
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.