x
• Status: Solved
• Priority: Medium
• Security: Public
• Views: 253

# improving with a simple query which handles lot of data

The base query is: Update A Set A.col1 = B.col2 from A join B on A.col1 = B.col1

Ultimately what I will doing it putting this in a loop, because there are too many records, and I would like to do it in a batch, so it is committed quicker

DECLARE      @Increment tinyint, @Iterate bigint
SELECT      @Iterate = CEILING(COUNT(*) / 5000.0) FROM A JOIN B ON A.Col1 = B.Col1
SET @Increment = 1
WHILE @Increment <= @Iterate
BEGIN
UPDATE A SET A.col1 = B.col2 FROM A join B ON A.col1 = B.col1
JOIN (SELECT TOP 5000 A.Col1 FROM A INNER JOIN B ON A.Col1 = B.Col1 WHERE A.Col1 <> B.Col2) C
ON A.Col1 = C.Col1
SET @Increment = @Increment + 1
END

But to test the efficieny of the index on the join, I am testing on a SELECT and What is being run is:

SELECT A.Col1 , B.Col2
FROM A
INNER JOIN B
ON A.Col1 = B.Col1
JOIN
(SELECT TOP 1 A.Col1
FROM A
INNER JOIN B
ON A.Col1 = B.Col1
WHERE A.Col1 <> B.Col2
) C
ON A.Col1 = C.Col1

IO Statistics for the above shows

A has 20M records
B has 500k records.

Both A.col1 & B.col1 are indexed. (Non clustered index with only one column in it)

Am I doing the best with the index or is there something more than be done to improve the query?

0
25112
• 7
• 4
5 Solutions

Commented:
You need to have index on (B.col1,B.col2)
0

Commented:
yes have the index on col1,col2 for table b...

you may find that

doing the update in 2 phases is better

e,g.

1. where a.col1 < b.col2
2. where a.col1 > b.col2

rather than a.col1 <> b.col2

after you change a,col1 to b.col2  is it possible that that introduces further matches  again in col1?
not sure why you bother with the initial count query ... wouldn't it be safer to loop around until your update
statement don't update anything? (by doing a select @@rowcount check after the update...)
0

Commented:
A has 20M records
B has 500k records.

so how many a rows are you going to have updated in the end?
is the distribution of b to a fairly even or can it vary widely?
what datatype is col1 (if its variable length will the update significantly increase/decrease the data stored?) ?

does col1 appear in any other table A indexes?
is the col1 index just for this processing...
what is the clustering index on table A?

if a large proportion of the 20million rows are being affected , and there aren't too many constraints defined on the table
then bcp unloading the table via a select to substitute the new col2 value , then drop/recreate table, bcp load new data, recreate any additional indexes maybe a better strategy...

0

Author Commented:
>>You need to have index on (B.col1,B.col2)

do you mean B.COl2 as included column?
0

Author Commented:
1. where a.col1 < b.col2
2. where a.col1 > b.col2

rather than a.col1 <> b.col2

good idea- thanks.
0

Author Commented:

>>after you change a,col1 to b.col2  is it possible that that introduces further matches  again in col1?

perhaps.. (possible).. but that satisfied business need, right? any alternatives?
0

Author Commented:

>>not sure why you bother with the initial count query

which column do you think we can get away with? @Iterate tells us how many times we need to update, if we update 5000 at a time.. @Increment is compared against @Iterate .. can you illustrate how you would use @@rowcount here..
0

Author Commented:

>>so how many a rows are you going to have updated in the end?
18.5 M is what I see.. will be updated...

>>what datatype is col1?
all columns involved are 'int'

>>does col1 appear in any other table A indexes?
one more index.. where it is the first column, and then there are few more columns after that. The index exists regardless of this operation. I am willing to make more index if needed. The PK is non clustered unique index. there is not Clustered index at this time on this table.

>>there aren't too many constraints defined on the table
do you FKs?

>>then bcp unloading the table via a select to substitute the new col2 value , then drop/recreate table, bcp load new data bcp loads data into
i know bcp loads data in bulk to textfiles.. how can it do compare and replace.. do we do copy/paste in notepad?
0

Commented:
given this is possible
>>after you change a,col1 to b.col2  is it possible that that introduces further matches  again in col1?
perhaps.. (possible).. but that satisfied business need, right? any alternatives?

THEN
>>not sure why you bother with the initial count query
which column do you think we can get away with?[ @Iterate tells us how many times we need to update, if we update 5000 at a time.. @Increment is compared against @Iterate .. can you illustrate how you would use @@rowcount here..

the @iterate calculation is no longer valid...

``````DECLARE      @rows int,@modified bigint

SElect @rows=-1,@modified=0
wHILE @rows<>0
BEGIN
UPDATE A SET A.col1 = B.col2
FROM A join B
ON A.col1 = B.col1
JOIN (SELECT TOP 5000 A.Col1
FROM A INNER JOIN B ON A.Col1 = B.Col1 WHERE A.Col1 <> B.Col2) C
ON A.Col1 = C.Col1
SElect @rows=@@rowcount,@modified=@modified+@rowcount
print getdate()
print @modified
end
``````

>>there aren't too many constraints defined on the table
do you FKs?

Yes foreign keys

Bcp...

Not you output the table via a query which substitutes the desired value of the column... basically your select statement
then you drop the table
then use bcp to reload the data...

since 90%+ of the table is being altered this is the method i would go with... it will avoid a lot of logging...
however can you construct the correct "output" query... goinf back to your first response about it being a business requirement to suport multiple transfers that implies you need additional information on table b to indicate the order in which the transfers should be applied (or else you potentially have a never ending loop a-b,b-c,c-a...)

even if multiple transfers are not intended ... you probably still need some ordering information on table B if it is valid
for transfers into and out of the same values...
you need to do some validation / analysis of the data and discuss with the business what the intended result of the
"transfer process is to be"

hth

ps these considerations should be basic practice for a DBA so please involve your know before you progress further with the design/implementation
0

Author Commented:
i am following you.

could you also tel the rationale for the index (B.col1,B.col2)
0

Commented:
because then the index is a covering index and sql does not have to access the row in table b if it decides to use the index....
0

Author Commented:
thanks Sir
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

## Featured Post

• 7
• 4
Tackle projects and never again get stuck behind a technical roadblock.