Avatar of D B
D BFlag for United States of America

asked on 

UDATE Statement with JOIN to Table with Duplicate Rows

Running MS SQL Server 2017. I've come across a 'partial' answer to this question but want to go into a bit more detail. I believe, from what I've read, the following code statement is non-deterministic as to what table1.col2 will be set to when table1 has a value of 1 for col1 and table2 has 3 rows with a value of 1 for col1 and 3 different values for col2.

I believe that the answer for my question is, if there is a many-to-many join (e.g. table1 has multiple rows with a value of 1 for col1, will the value of col2 be consistent? I believe it would (e.g. if one row with a value of 1 for col1 gets col2 set to a value of 4, then all rows with a value of 1 for col1 would get col2 set to a value of 4, since the update will be set based).

Neither of these statements are not something that can be easily 'proved' since you need to prove repeatability and how many times do you need to execute code with the same results before you can declare repeatibility? 1,000? 1,000,000? With my luck, the 1,000,001st time would give a different result.

Any internals people out there that would have a definitive answer?
UPDATE a
SET col2 = b.col2
FROM table1 a
INNER JOIN table2 b
ON b.col1 = a.col1;

Open in new window

Microsoft SQL Server

Avatar of undefined
Last Comment
D B
Avatar of _agx_
_agx_
Flag of United States of America image

No points...

AFAIK, that type of update is always non-deterministic and whatever order the optimizer chooses isn't something you should bank on. Changing the sql so it's deterministic statement is the only way to ensure a consistent result.

Did you see this thread?
https://www.experts-exchange.com/questions/28704278/UPDATE-Based-on-Join-nondeterministic-example.html 


Avatar of D B
D B
Flag of United States of America image

ASKER

That addresses my first part of the question in which I knew a many-to-one join was non-deterministic. My main question is the second part, where a many-to-many joinis used. I would think, as I stated, that if table1 had 4 rows with a value of 1 for col1 and table2 had 4 rows with a value of 1 for col1, although which row is joined on is non-deterministic, once one row is chosen as the candidate, will that row's col2 value be used to update all rows in table1 that have a value of 1 for col1?
Avatar of ste5an
ste5an
Flag of Germany image

The answer is simple:

It is non-deterministic.  The optimizer may chose under certain circumstances the same execution plan every time, this behavior is not guaranteed.
Avatar of _agx_
_agx_
Flag of United States of America image

@DB - Just to echo what Ste5an already said, I think the point is with non-deterministic behavior you can't make that assumption. The optimizer might produce that result, but there's no guarantee of that - which is what you're asking. If you need a predictable and consistent result, you must use a deterministic statement.

Avatar of D B
D B
Flag of United States of America image

ASKER

Well, I'm not 100% convinced I got a definitive answer. The issue is not repeatability, as I'm not looking for that.  
The question was, given the following data:
  table1          table2
col1  col2      col1  col2
  1     2         1     5
  1     3         1     6 
  1     4         2     8
  2     1         3     0         
  3     6         4     7 

Open in new window

Using this query:
UPDATE t1
SET    col2 = t2.col2
FROM   table1 t1
INNER JOIN table2 t2
ON     t1.col1 = t2.col1

Open in new window

I understand that which value is chosen from table2 to update table1 (for col1 value = 1) and that it isn't repeatable (e.g. if it updates with 5 on one run, it might update with 6 the next run). My primary question is, if it selects to update table1.col2 with 6 for the first row, will it use 6 for the 2nd and 3rd rows, or is is possible to have it update row 1 and 3 with 5 and row 2 with 6?

Further comment: I know I can make this deterministic using a window function. I am asking because this is existing production code and changing it is a big deal and can be done if necessary but so far it has 'behaved.' I would prefer a "yes" or "no" answer and not an "AFAIK" or "I believe" or "it should."

I wrote a query to load the tables, run the update, write the results of table1 to a temp table, then reset all the data in table1. I put it in a loop and ran it 100K times and it was consistent, but like I said, I could run it 10M times and it might be different once.

If the first run updates the columns with 5 and another run updates with 6, that is not what I'm concerned about. If it updates with 5, will all rows that have a col1 value of 1 be updated with the same value for col2?

Based on the link that _agx_ posted, it states, "SQL Server silently performs a nondeterministic UPDATE where one of the source rows arbitrarily “wins.” Does the one row that "wins" win for all rows that get updated with the matching JOIN?

ASKER CERTIFIED SOLUTION
Avatar of Scott Pletcher
Scott Pletcher
Flag of United States of America image

Blurred text
THIS SOLUTION IS ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
Avatar of ste5an
ste5an
Flag of Germany image

While @Scott is right, that the query plan is compiled and then used, this does not mean that the same value is used.

A simple sample (CIX update) shows that already with a few rows:

USE [ExpertsExchange]
GO

DROP TABLE IF EXISTS T1;
DROP TABLE IF EXISTS T2;
GO

CREATE TABLE T1 ( C1 INT NOT NULL, C2 INT NOT NULL ) ;
CREATE CLUSTERED INDEX CIX_T1 ON T1 ( C2 ASC );

CREATE TABLE T2 ( C1 INT NOT NULL, C2 INT NOT NULL ) ;
CREATE CLUSTERED INDEX CIX_T2 ON T2 ( C1 ASC );
GO

INSERT INTO T1 
VALUES	( 1, 1 ),
		( 2, 1 ),
		( 3, 1 ),
		( 4, 1 ),
		( 1, 2 ),
		( 2, 2 ),
		( 3, 2 ),
		( 4, 2 ),
		( 1, 3 ),
		( 2, 3 ),
		( 3, 3 ),
		( 4, 3 ),
		( 1, 4 ),
		( 2, 4 ),
		( 3, 4 ),
		( 4, 4 );

INSERT INTO T2
VALUES	( 1, 10 ),
		( 1, 20 ),
		( 1, 30 ),
		( 1, 40 ),
		( 2, 2 ),
		( 3, 3 ),
		( 4, 4 );
GO

UPDATE T1
SET    T1.C2 = T2.C2
FROM   T1
	INNER JOIN T2 ON T1.C1 = T2.C1;
GO

SELECT	*
FROM	T1;
GO

Open in new window


User generated image
User generated image
Thus for your questions: The query plan, once compiled, can be called deterministic, but NO, the same value does NOT have to be used by it.

Test instance at hands: Microsoft SQL Server 2019 (RTM-GDR) (KB4583458) - 15.0.2080.9 (X64)   Nov  6 2020 16:50:01   Copyright (C) 2019 Microsoft Corporation  Developer Edition (64-bit) on Windows 10 Enterprise 10.0 <X64> (Build 19042: ) (Hypervisor)
I said that you should see either a stream aggregate operation or a spool operation in the plan, and, IF you do, then the plan will be consistent.  

If you create clus indexes on the values, you give SQL the false impression that the values are unique, and then you may have issues because you will get a plan w/o one of those correction operations.

For example, in your code above, drop the word CLUSTERED out of the index creation.  Then the query plan contains a Stream Aggregate operation.


Avatar of ste5an
ste5an
Flag of Germany image

It's just a sample, showing that you don't see either a stream aggregate or spool operation. Besides that, the plan is always "consistent" or "deterministic" and these terms make no sense here.

The question was, whether the update uses the same value or not. Nothing more. The physical structure of the data, aka clustered indices is not defined and may change.

Thus, due to the scope of the question without making assumptions: Yes, the value can be the same or not. The query alone does not determine this.

EDIT.
I said that you should see either a stream aggregate operation or a spool operation in the plan, and, IF you do, then the plan will be consistent.
I don't see the necessity that parallel streams use the same spool value in the case e.g. of a large partitioned table.
Avatar of D B
D B
Flag of United States of America image

ASKER

Stan and Scott, thank you for the clarifications. In this case, there are clustered indexes in place on col1 on both tables (the JOINed columns). I will play around with it and see what I can determine. Believe it or not, our DBAs have the systems (even DEV) locked down so tight that we don't have SHOWPLAN access or access to any DMVs or other tools that would make our jobs easier. If I want to see a query plan I have to submit the query to our DBAs and ask them to run the script. :-(
Several people had already stated that it was not a consistent UPDATE in SQL Server.

Therefore, I was trying to point out the exceptions as to when it would be consistent.

Of course the best thing would be to use some method to ensure consistency.

Btw, your DBAs are total tools for not allowing you to see a query plan.  They must have Oracle DBAs in the past, who tend to super-lock everything because of how fragile Oracle plans are.
You'd certainly want to add OPTION (MAXDOP 1) to the query:

UPDATE a SET col2 = b.col2 FROM table1 a INNER JOIN table2 b ON b.col1 = a.col1 OPTION (MAXDOP 1);
Avatar of D B
D B
Flag of United States of America image

ASKER

Thanks all. Like I said, coming up with a solution to fix it isn't the issue. Getting the change made and pushed to production is. What I was trying to determine is, is it worth the effort to try to get it updated or just leave as-is. I think at this point, I'm going to leave as-is and hopefully we will have some other change in the near future and I can push in a fix then.
Microsoft SQL Server
Microsoft SQL Server

Microsoft SQL Server is a suite of relational database management system (RDBMS) products providing multi-user database access functionality.SQL Server is available in multiple versions, typically identified by release year, and versions are subdivided into editions to distinguish between product functionality. Component services include integration (SSIS), reporting (SSRS), analysis (SSAS), data quality, master data, T-SQL and performance tuning.

171K
Questions
--
Followers
--
Top Experts
Get a personalized solution from industry experts
Ask the experts
Read over 600 more reviews

TRUSTED BY

IBM logoIntel logoMicrosoft logoUbisoft logoSAP logo
Qualcomm logoCitrix Systems logoWorkday logoErnst & Young logo
High performer badgeUsers love us badge
LinkedIn logoFacebook logoX logoInstagram logoTikTok logoYouTube logo