Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

Update vs. Select Into

Posted on 2007-08-01
16
Medium Priority
?
953 Views
Last Modified: 2008-01-09
A large Temp table (1M+ rows) needs to have a column updated from another large table (1M+ rows) inside a store procedure. What is better, an Update or a Select Into another temp table and drop the first?  Why?
0
Comment
Question by:leonstryker
  • 6
  • 3
  • 3
  • +2
16 Comments
 
LVL 42

Expert Comment

by:dqmq
ID: 19611505
Select Into almost certainly has more overhead to create the table, acquire the space, etc.  If the temp table is proactively designed so that the update doesn't cause a lot of expansion that will be faster.  
0
 
LVL 6

Assisted Solution

by:gjutras
gjutras earned 100 total points
ID: 19611516
the usual sql anser to any performance question is it depends.  Most often followed with try and both ways and see what is faster.
Are you updating 1, some or all rows?
How many columns, type, size?
are the join columns indexed on the temp table and the other large table?
0
 
LVL 29

Author Comment

by:leonstryker
ID: 19611518
But Select Into is not a logged operation while an Update is. Won't that make a big difference?
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
LVL 29

Author Comment

by:leonstryker
ID: 19611534
All rows will be updated.
One or two columns are updated at a time.
The datatype is usually float or varchar
Indexes are used, but these in themselves take a while when created on the temp tables.
0
 
LVL 6

Expert Comment

by:gjutras
ID: 19611612
I'd suggest the update would be faster in your scenario.  Large table indexes already built, a couple of columns to update. instead of the overhead of having to recreate a table, copying all data again and then recreating the indexes.
0
 
LVL 19

Accepted Solution

by:
grant300 earned 400 total points
ID: 19612058
Try it both ways (perhaps with a subset of the data) and see what really happens.  Do a
Select * Into #Tab2 From #Tab1 and time it.  That is the floor under whatever operation you are going to do.  I think you will be surprised how fast it is.

Then do a simple update of #Tab1; something like
Update #Tab1 Set Col8 = 45.123, Col10 = 'This is a test string'
and time that.  That puts the floor under your update operations.

Either way you have to pay for the Join; these two tests just pull that out of the equation.  My suspicion is that for a relatively narrow table, the Select/Into wins.  If you have a bunch of columns, then an update might, under just the right circumstances, be better.

Creating a TEMP table and extending it with Select/Into is a very fast operation.  It is highly optimized because it is a very close sibling of the operation that the query processor does when it creates internal work tables.  There are some system table updates and space allocation is logged but for 1M+ rows, that overhead gets lost in the noise.

As far as optimizing the updates goes, you will take it on the nose as with the logging and will probably be limited by the log write to disk.  That assumes that you can formulate things to do what are called in-place updates.  Those are updates where no index columns are changed and the length of the columns is less than or equal to the old length.  If those conditions are not met, Sybase does the equivalent of a delete and insert.  If you are updating every row and you cannot guarantee in-place updates (floats work but varchars do not), you are essentially writing all new rows anyway.  That's in addition to paying for the logging.

Bottom line, in most instances you are better off with a Select/Into than an Update of all rows in temp table.

One thing to watch our for with temp tables and indexes in stored procedure is that, in general, an index created on a temp table in a stored procedure is not available to the optimizer within that same stored procedure.  You have a couple of choices.  The first is to create the table and index in Procedure A then do the join in Procedure B which is called from A.  The added complication is that when compiling Procedure B, you have create the temp table and index in the script before the CREATE PROC statement so that it has something to compile against.

One more thing you might do to speed up your temp table operations, particularly if they are logged, is to put TEMPDB on a TEMPFS.  A TEMPFS is a memory resident file system available in one form or another on Linux and Unix O/Ss.  The advantage of a TEMPDB on TEMPFS is that it has the potential to speed up any number of TEMPDB operations, not just the one you are looking at.  It also allows you to effectively use more memory on a 32 bit O/S than you can configure with just the Sybase Cache.

One more thought.  You say you have to update the temp table from another very large table which, I presume, is a regular table in a user database.  If that is the case, you are probably better off structuring your Join to leverage an index on the permanent table rather than trying to index temp tables.

Best of luck,
Bill
0
 
LVL 42

Assisted Solution

by:dqmq
dqmq earned 100 total points
ID: 19612487
>Won't that make a big difference?

That's a good point.  But logging is a sequential operation and usually pretty well optimized.  At least logging overhead would be somewhat constant.  Space allocation overhead could vary from run to run, depending on the condition of the database.  

I did some testing and found that UPDATE time was fairly stable, but SELECT INTO time was all over the place--sometimes slower; sometimes significantly faster.

I guess, I'll recant my answer and side with the "it depends" contingent.
0
 
LVL 19

Expert Comment

by:grant300
ID: 19612869
Unless you have a very busy system with a log of users, space allocation in TEMPDB is not an issue.  Remember that the space underneath temp tables is freed up when the table disappears which is pretty quickly.

As far as your timing results go, I find them very interesting.  I am curious about what other operations are going on on the system at the same time.  The SELECT INTO is probably CPU bound so if there are other things going on in the database or on the server, you will see a slow down.  The UPDATE is probably disk bound which means that, unless the TEMPDB device is getting hit heavily buy other operations, you will see a slower but consistent speed.  The fact that the SELECT INTO is sometimes "significantly faster" means it is using less resources overall and should probably be your choice, all else being equal.

Bill
0
 
LVL 29

Author Comment

by:leonstryker
ID: 19612918
The store proc is run as part of an EOD process and generates a Risk report. On the production server it has been known to run for an hour, but in all fairness it actually runs twice (to get the difference between the two days.)

TempDb gets hit very heavy by this procedure. There are a number of large temp tables created and exist for a duration of the procedure. There may also be other reports, though somewhat smaller, running at the same time.
0
 
LVL 19

Expert Comment

by:grant300
ID: 19613037
Either way you go, you are hitting tempdb pretty hard.  I would have a serious discussion with your DBA about putting tempdb on a tempfs.  It has tremendous potential to speed things up, particularly if they are logged operations.

He/she does not have to move the existing tempdb.  Sybase, with version 12.5.? supports multiple tempdbs.  You can create an additional tempdb on a tempfs just for this stuff and then steer these big batch operations toward that one.  The easiest way is to find a login to the new tempdb.

I think I would try this before modifying code as you might find you can live with it just by putting tempdb on a tempfs.
0
 
LVL 24

Assisted Solution

by:Joe Woodhouse
Joe Woodhouse earned 400 total points
ID: 19619417
Sorry all, I can't resist weighing in since I have to nitpick a few things that have been said here.

Firstly, SELECT INTO is *not* unlogged. It is *minimally* logged. There is no such thing in Sybase ASE as an unlogged write to data. (Shush in the back row, I know about top level actions, and even those are still logged.) Minimally logged means all we log are the page allocations and deallocations. That includes all index pages, temp table pages, worktable pages and other system pages that are written to as part of the operation.

Secondly, I haven't seen anyone talking about the different UPDATE types. All UPDATES are definitely not created equal! There is a tremendous difference in total I/O, log records and end to end time to complete a direct-in-place update vs. a deferred update. These are written up well in the Sybase manuals and I won't fully describe each of them here, save some highlights:

If the column(s) you're updating are part of an index key, or for some reason you've built referential integrity into the temp table, SELECT INTO is probably going to be faster. (You are unlikely to have replication or triggers on a temp table.) Even if the updated column(s) are not part of an index, if they are variable length, it may be that a SELECT INTO will be faster.

I find it helpful in these sorts of scenarios to think about how many times any given row has to be handled. If we're doing a deferred update (updating a column that's part of a primary key or unique index) then effectively each update is being processed as a DELETE followed by an INSERT, and not to mention we have to read the row in the first place. So that's one read and two writes of data. Each data write involves an index write, which in turn needs an index read. We're up to 1 read + 2 writes, plus 1R+2W *per index*.

Each of those writes has to be fully logged, so for each row in a deferred update, we're looking at 1R+2W+2W, *plus* 1R+2W+2W per index.

(I am slightly cheating, since it isn't quite that bad, but it makes for a good story.)

Ok, maybe we're lucky and it isn't a deferred update. Let's say it's just a direct-expensive update instead, say a row length is changing and we don't have room for the new row on the same page. It has to move to a different page, which means changing a rowID in an index entry, which means 1R+2W for data (have to write the old page to update the row offset table to remove the RID, then insert the row on the new page). Fortunately it's only 1R+1W per index as we don't have to move the index row, just update the RID. If a clustered index we don't need to update every index page, either, but NC indexes store the final RID on every page in the index tree where a particular row is referenced. (If that's confusing, I'm just say it's better than a deferred update where every row has 2W associated per index.) Of course every write is still logged, for a total of 1R+2W+2W, and 1R+1W+1W per index.

The absolute best case for an update is a direct-in-place. This is where every row stays where it is, no rows move on the page or to a different page. You can forget this if you have any variable length columns. Remember any nullable column is variable length, too. With this kind of update we're looking at 1R+1W+1W per row. (No indexes to update.)

Whereas for SELECT INTO, we're looking at one read of the original data, and one write to the new table. That write is still logged, but minimally (so less to write), but even so we'll call it 1R+1W+1W. Then if building new indexes that would be another 1R+1.2W per index (rough overhead for intermediate pages in the index tree), for a total of 2R+3.2W. Then we drop the old table, which is 1R+1W per index, but I'm not sure we should count that here as it can be done separately later. Or we could just truncate it then drop it. 8-)

That's worst than the best case UPDATE, but it's a lot better than the worst case. It compares favourably to the average case UPDATE which was 1R+2W+2W, and 1R+1W+1W per index.

Summary: The best possible UPDATE will beat SELECT INTO, but SELECT INTO beats pretty much everything else. And it's rare you'll be doing direct-in-place updates, the restrictions and requirements are pretty fierce.

Now those were the raw I/O figures. Other factors could still distort the results - how big and how full the data caches (and any log caches, if any) are (although you can't bind tempdb's logs to its own cache, only for user databases); whether asynchronous prefetch is being used effectively for the reads; whether async logging is being used for the log writes, and so on.

But I've been distracted by the technology. I think we should instead ask if it even makes sense to be using temp tables at all. If this is an EOD process are you really regerenerating those 1M row tables from scratch each day? Is there any opportunity to maintain them persistently and refresh as required? This may be the same or even more work overall, but then it gets distributed through your entire day rather than trying to fit it into a batch window. It would also make EOD run a *lot* faster.
0
 
LVL 29

Author Comment

by:leonstryker
ID: 19619488
>If this is an EOD process are you really regerenerating those 1M row tables from scratch each day?

Yes and no. The majority of the items (trades) are the same, but their attributes (credit curves, risks, fx rates, etc) will change.  It is in fact the joining of the persistent data with new attributes, which is the issue.

0
 
LVL 24

Expert Comment

by:Joe Woodhouse
ID: 19619532
Got it. I'm familiar with Market Risk and Liquidity EOD with new curves & rates. There's no fast way of doing it, it has to be batch, and it has to wait until the new curves & rates are released.

While tempdb can be put into a named cache, and sometimes that's helpful, another thing to consider would be a dedicated scratch database for this work. Then you'd be able to build separate caches for log and data, and you'd have the option of the async log database option (assuming you meet the other prereqs).

Remember any indexing work is done in the data cache the underlying table is bound to, unless you've bound the indexes separately to a different cache. This affects proper sizing for a named cache.
0
 
LVL 29

Author Comment

by:leonstryker
ID: 19619618
>dedicated scratch database for this work

Actually our solution, in the long term is a dedicated DB server just for the batch reports, as opposed to just a database, since the tempdb are shared.
0
 
LVL 24

Expert Comment

by:Joe Woodhouse
ID: 19619813
That works too. 8-)
0
 
LVL 29

Author Comment

by:leonstryker
ID: 19619904
Thank you folks, it is always surprising to learn just how much you still do not know.

Leon
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

'Between' is such a common word we rarely think about it but in SQL it has a very specific definition we should be aware of. While most database vendors will have their own unique phrases to describe it (see references at end) the concept in common …
In part one, we reviewed the prerequisites required for installing SQL Server vNext. In this part we will explore how to install Microsoft's SQL Server on Ubuntu 16.04.
this video summaries big data hadoop online training demo (http://onlineitguru.com/big-data-hadoop-online-training-placement.html) , and covers basics in big data hadoop .
With just a little bit of  SQL and VBA, many doors open to cool things like synchronize a list box to display data relevant to other information on a form.  If you have never written code or looked at an SQL statement before, no problem! ...  give i…
Suggested Courses

810 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question