Solved

# improving with a simple query which handles lot of data

Posted on 2011-09-28
237 Views

The base query is: Update A Set A.col1 = B.col2 from A join B on A.col1 = B.col1

Ultimately what I will doing it putting this in a loop, because there are too many records, and I would like to do it in a batch, so it is committed quicker

DECLARE      @Increment tinyint, @Iterate bigint
SELECT      @Iterate = CEILING(COUNT(*) / 5000.0) FROM A JOIN B ON A.Col1 = B.Col1
SET @Increment = 1
WHILE @Increment <= @Iterate
BEGIN
UPDATE A SET A.col1 = B.col2 FROM A join B ON A.col1 = B.col1
JOIN (SELECT TOP 5000 A.Col1 FROM A INNER JOIN B ON A.Col1 = B.Col1 WHERE A.Col1 <> B.Col2) C
ON A.Col1 = C.Col1
SET @Increment = @Increment + 1
END

But to test the efficieny of the index on the join, I am testing on a SELECT and What is being run is:

SELECT A.Col1 , B.Col2
FROM A
INNER JOIN B
ON A.Col1 = B.Col1
JOIN
(SELECT TOP 1 A.Col1
FROM A
INNER JOIN B
ON A.Col1 = B.Col1
WHERE A.Col1 <> B.Col2
) C
ON A.Col1 = C.Col1

IO Statistics for the above shows

A has 20M records
B has 500k records.

Both A.col1 & B.col1 are indexed. (Non clustered index with only one column in it)

Am I doing the best with the index or is there something more than be done to improve the query?

0
Question by:25112
• 7
• 4

LVL 25

Accepted Solution

TempDBA earned 100 total points
You need to have index on (B.col1,B.col2)
0

LVL 50

Assisted Solution

yes have the index on col1,col2 for table b...

you may find that

doing the update in 2 phases is better

e,g.

1. where a.col1 < b.col2
2. where a.col1 > b.col2

rather than a.col1 <> b.col2

after you change a,col1 to b.col2  is it possible that that introduces further matches  again in col1?
not sure why you bother with the initial count query ... wouldn't it be safer to loop around until your update
statement don't update anything? (by doing a select @@rowcount check after the update...)
0

LVL 50

Assisted Solution

A has 20M records
B has 500k records.

so how many a rows are you going to have updated in the end?
is the distribution of b to a fairly even or can it vary widely?
what datatype is col1 (if its variable length will the update significantly increase/decrease the data stored?) ?

does col1 appear in any other table A indexes?
is the col1 index just for this processing...
what is the clustering index on table A?

if a large proportion of the 20million rows are being affected , and there aren't too many constraints defined on the table
then bcp unloading the table via a select to substitute the new col2 value , then drop/recreate table, bcp load new data, recreate any additional indexes maybe a better strategy...

0

LVL 5

Author Comment

>>You need to have index on (B.col1,B.col2)

do you mean B.COl2 as included column?
0

LVL 5

Author Comment

1. where a.col1 < b.col2
2. where a.col1 > b.col2

rather than a.col1 <> b.col2

good idea- thanks.
0

LVL 5

Author Comment

>>after you change a,col1 to b.col2  is it possible that that introduces further matches  again in col1?

perhaps.. (possible).. but that satisfied business need, right? any alternatives?
0

LVL 5

Author Comment

>>not sure why you bother with the initial count query

which column do you think we can get away with? @Iterate tells us how many times we need to update, if we update 5000 at a time.. @Increment is compared against @Iterate .. can you illustrate how you would use @@rowcount here..
0

LVL 5

Author Comment

>>so how many a rows are you going to have updated in the end?
18.5 M is what I see.. will be updated...

>>what datatype is col1?
all columns involved are 'int'

>>does col1 appear in any other table A indexes?
one more index.. where it is the first column, and then there are few more columns after that. The index exists regardless of this operation. I am willing to make more index if needed. The PK is non clustered unique index. there is not Clustered index at this time on this table.

>>there aren't too many constraints defined on the table
do you FKs?

>>then bcp unloading the table via a select to substitute the new col2 value , then drop/recreate table, bcp load new data bcp loads data into
i know bcp loads data in bulk to textfiles.. how can it do compare and replace.. do we do copy/paste in notepad?
0

LVL 50

Assisted Solution

given this is possible
>>after you change a,col1 to b.col2  is it possible that that introduces further matches  again in col1?
perhaps.. (possible).. but that satisfied business need, right? any alternatives?

THEN
>>not sure why you bother with the initial count query
which column do you think we can get away with?[ @Iterate tells us how many times we need to update, if we update 5000 at a time.. @Increment is compared against @Iterate .. can you illustrate how you would use @@rowcount here..

the @iterate calculation is no longer valid...

``````DECLARE      @rows int,@modified bigint

SElect @rows=-1,@modified=0
wHILE @rows<>0
BEGIN
UPDATE A SET A.col1 = B.col2
FROM A join B
ON A.col1 = B.col1
JOIN (SELECT TOP 5000 A.Col1
FROM A INNER JOIN B ON A.Col1 = B.Col1 WHERE A.Col1 <> B.Col2) C
ON A.Col1 = C.Col1
SElect @rows=@@rowcount,@modified=@modified+@rowcount
print getdate()
print @modified
end
``````

>>there aren't too many constraints defined on the table
do you FKs?

Yes foreign keys

Bcp...

Not you output the table via a query which substitutes the desired value of the column... basically your select statement
then you drop the table
then use bcp to reload the data...

since 90%+ of the table is being altered this is the method i would go with... it will avoid a lot of logging...
however can you construct the correct "output" query... goinf back to your first response about it being a business requirement to suport multiple transfers that implies you need additional information on table b to indicate the order in which the transfers should be applied (or else you potentially have a never ending loop a-b,b-c,c-a...)

even if multiple transfers are not intended ... you probably still need some ordering information on table B if it is valid
for transfers into and out of the same values...
you need to do some validation / analysis of the data and discuss with the business what the intended result of the
"transfer process is to be"

hth

ps these considerations should be basic practice for a DBA so please involve your know before you progress further with the design/implementation
0

LVL 5

Author Comment

i am following you.

could you also tel the rationale for the index (B.col1,B.col2)
0

LVL 50

Assisted Solution

because then the index is a covering index and sql does not have to access the row in table b if it decides to use the index....
0

LVL 5

Author Comment

thanks Sir
0

## Featured Post

Everyone has problem when going to load data into Data warehouse (EDW). They all need to confirm that data quality is good but they don't no how to proceed. Microsoft has provided new task within SSIS 2008 called "Data Profiler Task". It solve th…
Occasionally there is a need to clean table columns, especially if you have inherited legacy data. There are obviously many ways to accomplish that, including elaborate UPDATE queries with anywhere from one to numerous REPLACE functions (even within…
Using examples as well as descriptions, and references to Books Online, show the documentation available for datatypes, explain the available data types and show how data can be passed into and out of variables.
Viewers will learn how to use the SELECT statement in SQL to return specific rows and columns, with various degrees of sorting and limits in place.