Solved

improving with a simple query which handles lot of data

Posted on 2011-09-28
12
237 Views
Last Modified: 2012-05-12

The base query is: Update A Set A.col1 = B.col2 from A join B on A.col1 = B.col1

Ultimately what I will doing it putting this in a loop, because there are too many records, and I would like to do it in a batch, so it is committed quicker

DECLARE      @Increment tinyint, @Iterate bigint
SELECT      @Iterate = CEILING(COUNT(*) / 5000.0) FROM A JOIN B ON A.Col1 = B.Col1
SET @Increment = 1
WHILE @Increment <= @Iterate
      BEGIN
UPDATE A SET A.col1 = B.col2 FROM A join B ON A.col1 = B.col1
JOIN (SELECT TOP 5000 A.Col1 FROM A INNER JOIN B ON A.Col1 = B.Col1 WHERE A.Col1 <> B.Col2) C
ON A.Col1 = C.Col1
SET @Increment = @Increment + 1
      END

But to test the efficieny of the index on the join, I am testing on a SELECT and What is being run is:

SELECT A.Col1 , B.Col2
FROM A
INNER JOIN B
ON A.Col1 = B.Col1
JOIN
(SELECT TOP 1 A.Col1
FROM A
INNER JOIN B
ON A.Col1 = B.Col1
WHERE A.Col1 <> B.Col2
) C
ON A.Col1 = C.Col1

IO Statistics for the above shows
Table 'A'. Scan count 789628, logical reads 3562058, physical reads 491, read-ahead reads 128015.
Table 'B'. Scan count 1, logical reads 2540, physical reads 0, read-ahead reads 32.

A has 20M records
B has 500k records.

Both A.col1 & B.col1 are indexed. (Non clustered index with only one column in it)

Am I doing the best with the index or is there something more than be done to improve the query?


0
Comment
Question by:25112
  • 7
  • 4
12 Comments
 
LVL 25

Accepted Solution

by:
TempDBA earned 100 total points
Comment Utility
You need to have index on (B.col1,B.col2)
0
 
LVL 50

Assisted Solution

by:Lowfatspread
Lowfatspread earned 400 total points
Comment Utility
yes have the index on col1,col2 for table b...

you may find that

doing the update in 2 phases is better

e,g.

1. where a.col1 < b.col2
2. where a.col1 > b.col2

rather than a.col1 <> b.col2

after you change a,col1 to b.col2  is it possible that that introduces further matches  again in col1?
not sure why you bother with the initial count query ... wouldn't it be safer to loop around until your update
statement don't update anything? (by doing a select @@rowcount check after the update...)
0
 
LVL 50

Assisted Solution

by:Lowfatspread
Lowfatspread earned 400 total points
Comment Utility
A has 20M records
B has 500k records.

so how many a rows are you going to have updated in the end?
is the distribution of b to a fairly even or can it vary widely?
what datatype is col1 (if its variable length will the update significantly increase/decrease the data stored?) ?

does col1 appear in any other table A indexes?
is the col1 index just for this processing...
what is the clustering index on table A?

if a large proportion of the 20million rows are being affected , and there aren't too many constraints defined on the table
then bcp unloading the table via a select to substitute the new col2 value , then drop/recreate table, bcp load new data, recreate any additional indexes maybe a better strategy...

0
 
LVL 5

Author Comment

by:25112
Comment Utility
>>You need to have index on (B.col1,B.col2)

do you mean B.COl2 as included column?
0
 
LVL 5

Author Comment

by:25112
Comment Utility
1. where a.col1 < b.col2
2. where a.col1 > b.col2

rather than a.col1 <> b.col2

good idea- thanks.
0
 
LVL 5

Author Comment

by:25112
Comment Utility

>>after you change a,col1 to b.col2  is it possible that that introduces further matches  again in col1?

perhaps.. (possible).. but that satisfied business need, right? any alternatives?
0
How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

 
LVL 5

Author Comment

by:25112
Comment Utility

>>not sure why you bother with the initial count query

which column do you think we can get away with? @Iterate tells us how many times we need to update, if we update 5000 at a time.. @Increment is compared against @Iterate .. can you illustrate how you would use @@rowcount here..
0
 
LVL 5

Author Comment

by:25112
Comment Utility

>>so how many a rows are you going to have updated in the end?
18.5 M is what I see.. will be updated...

>>what datatype is col1?
all columns involved are 'int'

>>does col1 appear in any other table A indexes?
one more index.. where it is the first column, and then there are few more columns after that. The index exists regardless of this operation. I am willing to make more index if needed. The PK is non clustered unique index. there is not Clustered index at this time on this table.

>>there aren't too many constraints defined on the table
do you FKs?

>>then bcp unloading the table via a select to substitute the new col2 value , then drop/recreate table, bcp load new data bcp loads data into
i know bcp loads data in bulk to textfiles.. how can it do compare and replace.. do we do copy/paste in notepad?
0
 
LVL 50

Assisted Solution

by:Lowfatspread
Lowfatspread earned 400 total points
Comment Utility
given this is possible
>>after you change a,col1 to b.col2  is it possible that that introduces further matches  again in col1?
perhaps.. (possible).. but that satisfied business need, right? any alternatives?



THEN
>>not sure why you bother with the initial count query
which column do you think we can get away with?[ @Iterate tells us how many times we need to update, if we update 5000 at a time.. @Increment is compared against @Iterate .. can you illustrate how you would use @@rowcount here..


the @iterate calculation is no longer valid...

DECLARE      @rows int,@modified bigint

SElect @rows=-1,@modified=0
wHILE @rows<>0
      BEGIN
UPDATE A SET A.col1 = B.col2
  FROM A join B 
   ON A.col1 = B.col1 
JOIN (SELECT TOP 5000 A.Col1 
            FROM A INNER JOIN B ON A.Col1 = B.Col1 WHERE A.Col1 <> B.Col2) C
ON A.Col1 = C.Col1
SElect @rows=@@rowcount,@modified=@modified+@rowcount
print getdate()
print @modified
end

Open in new window

   

>>there aren't too many constraints defined on the table
do you FKs?

Yes foreign keys

Bcp...

Not you output the table via a query which substitutes the desired value of the column... basically your select statement
then you drop the table
then use bcp to reload the data...

since 90%+ of the table is being altered this is the method i would go with... it will avoid a lot of logging...
however can you construct the correct "output" query... goinf back to your first response about it being a business requirement to suport multiple transfers that implies you need additional information on table b to indicate the order in which the transfers should be applied (or else you potentially have a never ending loop a-b,b-c,c-a...)

even if multiple transfers are not intended ... you probably still need some ordering information on table B if it is valid
for transfers into and out of the same values...
you need to do some validation / analysis of the data and discuss with the business what the intended result of the
"transfer process is to be"

hth

ps these considerations should be basic practice for a DBA so please involve your know before you progress further with the design/implementation
0
 
LVL 5

Author Comment

by:25112
Comment Utility
i am following you.

could you also tel the rationale for the index (B.col1,B.col2)
0
 
LVL 50

Assisted Solution

by:Lowfatspread
Lowfatspread earned 400 total points
Comment Utility
because then the index is a covering index and sql does not have to access the row in table b if it decides to use the index....
0
 
LVL 5

Author Comment

by:25112
Comment Utility
thanks Sir
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Everyone has problem when going to load data into Data warehouse (EDW). They all need to confirm that data quality is good but they don't no how to proceed. Microsoft has provided new task within SSIS 2008 called "Data Profiler Task". It solve th…
Occasionally there is a need to clean table columns, especially if you have inherited legacy data. There are obviously many ways to accomplish that, including elaborate UPDATE queries with anywhere from one to numerous REPLACE functions (even within…
Using examples as well as descriptions, and references to Books Online, show the documentation available for datatypes, explain the available data types and show how data can be passed into and out of variables.
Viewers will learn how to use the SELECT statement in SQL to return specific rows and columns, with various degrees of sorting and limits in place.

763 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now