Link to home
Start Free TrialLog in
Avatar of rowansmith
rowansmith

asked on

Optimal way to insert a unique record into table

What is the correct way to insert a value into a table where that value must be unique?  There seems to be two ways of doing this:

Assume myValue has a unique constraint index, and ofcourse this is not real T-SQL just pseudocode but you'll get the drift, also assume that the database is extremely busy with one of the following two pieces of code being executed continuously by many users and processes.


1:

select  count (*) from mytable where myValue = '1234567890'
if count = 0 then
 insert mytable(myalue) values('1234567890')
 return (insert-succeeded)
end if
return (insert-failed)

2:
insert mytable(myvalue) values ('1234567890')
if index-collision-error then
 return(insert-failed)
else
 return(insert-succeeded)
end if

My concern with option 1 is that it might be possible for a competing process/user to insert the value '1234567890' into myTable after the select statement is executed but before the insert statement is executed.  Is this possible or does SQL Server prevent this from occuring?

As for option 2 - well it just seems inefficient to throw exceptions all over the place.

Perhaps there is another (correct) way to do this.
SOLUTION
Avatar of Daniel Reynolds
Daniel Reynolds
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of rowansmith
rowansmith

ASKER

The table has a unique index on the field myValue so the database will prevent a duplicate entry from being made.  The question is what is the best (gauranteed) way to detect if a collision is about to happen and report to the user that the insert failed.

1. query for the value and if it does not exist do the insert and advise the user they succeeded.

2. try and do the insert and if it fails (due to the unique index) tell the user they failed or they succeeded.

Imagine 1000 users all trying to insert the same value into a table at the same time, only one of them can win the race.  It seems to me that option 1 would potentially generate the same errors or is some sort of locking going on preventing another user from executing the same select statement until the transaction in option 1 is completed?

I can not believe that their is not a correct and documented way to insert a unique value into a table and gaurantee that it is atomic.
If your application will allow it, you might be better off using newid() rather than your own unique generated value. newid() will generate a unique GUID string every time it is called.


Alternatively, if myValue is unique, why not make it an identity that SQL will autogenerate for each row.  You can retrieve the value just inserted by selecting SCOPE_IDENTITY() after your insert if you need to pass it back to the caller.
kselvia:
---------
A good idea, but in this case the unique identity can not be generated by the SQL Server as it is generated by a remote program(s) that are not using this SQL Server.

Basically remote nodes are collecting similar information about a unique IP Address.  If the IP Address already exists in the database then it's a simple matter of recording that information using that master record as a foreign key, however if the IP address does not exist then the master record must be created so that the child information can be stored.  The issue I have is multiple remote nodes attempting to create the master record simultaneously, at the moment the best way I have determined to get around this is to try and do the insert, if the insert fails due to the unique key violation, then look up the master record and record the appropriate information in another table using the master-records index as a foreign key.

It just seems to be a very lengthly/inefficient way of doing this when what I am trying to achieve is ensuring that a select statement and an insert statement (if required) can be run atomically on a table, without locking the entire table as this would severally bottle neck the remote nodes.

The otherway I have thought of doing this is to remove the unique index constraint and then do the following:

insert unique value into table
select count of unique values
if count > 1 then -- the value already existed
  rollback insert statement
  get primary key of the existing value and use it
else -- the value did not exist this is the first insert
  use @@identity from initial insert
end if

This dosn't raise any errors and means I am not using the database to enforce the uniqueness but instead the T-SQL statement is enforcing it.

Thoughts?
I'm sure I will get flamed for this, but you could always just ignore the duplicates

create unique clustered index mytable_myval_idx on mytable (myvalue) with IGNORE_DUPS

SQL will do nothing if the row already exists.  Do you care who inserted the master record?
Do you mean IGNORE_DUP_KEY = ON

BOL says this raises a warning and the insert statement fails where their is a collision.

In Query Analyzer I get the message "duplicate row was ignored" the @@identity does not increase but the acutal identity does increase - so next time I do a valid insert a few primary keys are missed.

Can I programatically read the warning?  Is it in @@ERROR ?  Is their a @@WARNING equivilant?

This seems to give me what I want:

insert unique value - I don't care if it fails or succeeds and no errors are raised so no fancy error handling is required
select the record with the unique value - to find the primary key
use the primary key as I need to in the other table(s)

Their is something I don't quite like about this approach but it seems to be the best one so far....

Any other ideas?
I think it is in @@ERROR but you can also check @@ROWCOUNT for 0 rows if it failed.  Yes identity is incremented even if insert fails.  Identity also increases if a transaction is rolled back after an insert.  In general, it's not a good idea to depend on identity's being in perfect sequence.

If you are going to capture and process the error event, you might as well use TRY/CATCH since you are on SLQ 2005

BEGIN TRY
      insert mytable(myvalue) values ('1234567890')
      return(insert-succeeded)
END TRY
BEGIN CATCH
      return(insert-failed)
END CATCH
BTW It seems to  me your idea of

insert unique value into table
select count of unique values
if count > 1 then -- the value already existed
  rollback insert statement
  get primary key of the existing value and use it
else -- the value did not exist this is the first insert
  use @@identity from initial insert
end if

would suffer from the same concurency issues as the original.  There could still be a race condition with 2 clients storing the same value at the same time.  I wouldn't depend on your code to ensure unique values.


I experimented with this and found that the second (and all subsequent inserts) would block until the transaction of the first insert was either completed or rolled back at which point in time the subsequent inserts would then be allowed to succeed or fail.

My understanding is that the insert to a index is an atomic action and two processes can not insert the same value independently of each other.  I tested this with two open SQL Sessions in the first I executed:

begin transaction
insert record A

in the second I executed the same, the second window did not return anything to the user until the first window was committed or rolled back, while the first window allowed the user to select, insert more records, whatever, effectively the second user was waiting for the first user to either commit his change to the index or roll it back.

So:

time              user 1                    user 2
1                   begin transaction   begin transaction
2                   insert recordA
3                   do some stuff ()      insert recordA <-- blocks until user 1 is committed or rollbacked
4                   do more stuff()
5                   commit                      
6                                                  record is inserted

So it seemed to work, but that said I like the IGNORE_DUP solution better as it allows one to maintain database integrity through the database and not through SQL code that can be changed erroneously!
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
With the if exists the question remains is the select and the insert atomic?  I doubt they are.  This would mean that a competing process could insert after the select has returned?
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
As promised. I've made the tests a bit more realistic as that it now will try to insert different myValues and that some of them will cause an insert while others won't. I'm assuming that the sheer number of loops will even out the differences between the test situations.

time needed to create     records
--------------------- -----------
                 8140     1048576

time needed to index
--------------------
                4046
  TRY/CATCH
-----------
       6610

where not exists()
------------------
              3030

if not exists() <unsafe>
------------------------
                    3190

if not exists() <SAFE>
----------------------
                  1796



Surprising that the 'safe' method seems faster than the 'unsafe' method. Might be because "internally" it takes a lock only once when there is a need to insert.
SET NOCOUNT ON 
 
-- create some test data, fast
IF OBJECT_ID('tempdb..##start') IS NOT NULL
    DROP TABLE ##start
 
CREATE TABLE ##start (id int NOT NULL)
 
INSERT ##start (id) VALUES (1)
 
DECLARE @loop int,
        @offset int,
        @start_time datetime
 
SELECT @loop   = 1,
       @offset = 1,
       @start_time = GetDate()
 
WHILE @loop <= 20 -- 2^20 = errr, a lot
    BEGIN
        INSERT ##start
        SELECT id + @offset
          FROM ##start
 
        SELECT @loop   = @loop + 1,
               @offset = @offset * 2
 
    END
 
SELECT 'time needed to create' = DateDiff (ms, @start_time, GetDate()), 'records' = Power(2, 20)
SELECT @start_time = GetDate()
 
CREATE UNIQUE CLUSTERED INDEX uq0_start ON ##start (id)
 
SELECT 'time needed to index' = DateDiff (ms, @start_time, GetDate())
 
GO
 
CREATE TABLE #t (id INT NOT NULL)
 
INSERT #t (id)
SELECT id FROM ##start
 
CREATE UNIQUE CLUSTERED INDEX tididxtmp ON #t ( id ) WITH (IGNORE_DUP_KEY = ON)
 
-- using ignore dup key
-- generates Duplicate key was ignored. errors.  These can not be suppressed and will be returned to the client in SQLError
 
DECLARE @counter int,
        @start_time datetime,
        @ceiling int,
        @id int
 
SELECT @counter    = 100000,
       @start_time = GetDate(),
       @ceiling    = Power(2, 20),
       @id         = 0
 
WHILE @counter > 0
    BEGIN
        INSERT #t (id) VALUES (@id)
 
        SELECT @counter = @counter - 1,
               @id      = Rand() * @ceiling
    END
 
SELECT 'using IGNORE_DUP_KEY' = DateDiff (ms, @start_time, GetDate())
 
DROP TABLE #t
GO
 
CREATE TABLE #t (id INT NOT NULL)
 
INSERT #t (id)
SELECT id FROM ##start
 
CREATE UNIQUE CLUSTERED INDEX tididxtmp ON #t ( id ) WITH (IGNORE_DUP_KEY = ON)
 
-- using TRY CATCH
DECLARE @counter int,
        @start_time datetime,
        @ceiling int,
        @id int
 
SELECT @counter    = 100000,
       @start_time = GetDate(),
       @ceiling    = Power(2, 20),
       @id         = 0
 
WHILE @counter > 0
    BEGIN
        BEGIN TRY
            INSERT #t (id) VALUES (@id)
        END TRY
        BEGIN CATCH
        END CATCH
 
        SELECT @counter = @counter - 1,
               @id      = Rand() * @ceiling
    END
 
SELECT 'TRY/CATCH' = DateDiff (ms, @start_time, GetDate())
 
DROP TABLE #t
GO
 
 
CREATE TABLE #t (id INT NOT NULL)
 
INSERT #t (id)
SELECT id FROM ##start
 
CREATE UNIQUE CLUSTERED INDEX tididxtmp ON #t ( id ) WITH (IGNORE_DUP_KEY = ON)
 
-- using WHERE NOT EXISTS()
DECLARE @counter int,
        @start_time datetime,
        @ceiling int,
        @id int
 
SELECT @counter    = 100000,
       @start_time = GetDate(),
       @ceiling    = Power(2, 20),
       @id         = 0
 
WHILE @counter > 0
    BEGIN
        INSERT #t (id) 
        SELECT @id
         WHERE NOT EXISTS ( SELECT 1 FROM #t WHERE id = @id)
 
        SELECT @counter = @counter - 1,
               @id      = Rand() * @ceiling
    END
 
SELECT 'where not exists()' = DateDiff (ms, @start_time, GetDate())
 
DROP TABLE #t
GO
CREATE TABLE #t (id INT NOT NULL)
 
INSERT #t (id)
SELECT id FROM ##start
 
CREATE UNIQUE CLUSTERED INDEX tididxtmp ON #t ( id ) WITH (IGNORE_DUP_KEY = ON)
 
-- using WHERE NOT EXISTS()
DECLARE @counter int,
        @start_time datetime,
        @ceiling int,
        @id int
 
SELECT @counter    = 100000,
       @start_time = GetDate(),
       @ceiling    = Power(2, 20),
       @id         = 0
 
WHILE @counter > 0
    BEGIN
 
        IF NOT EXISTS(SELECT 1 FROM #t WHERE id = @id)
            BEGIN
                INSERT #t (id) VALUES (@id)
            END
 
        SELECT @counter = @counter - 1,
               @id      = Rand() * @ceiling
    END
 
SELECT 'if not exists() <unsafe>' = DateDiff (ms, @start_time, GetDate())
 
DROP TABLE #t
GO
 
CREATE TABLE #t (id INT NOT NULL)
 
INSERT #t (id)
SELECT id FROM ##start
 
CREATE UNIQUE CLUSTERED INDEX tididxtmp ON #t ( id ) WITH (IGNORE_DUP_KEY = ON)
 
-- using WHERE NOT EXISTS()
DECLARE @counter int,
        @start_time datetime,
        @ceiling int,
        @id int
 
SELECT @counter    = 100000,
       @start_time = GetDate(),
       @ceiling    = Power(2, 20),
       @id         = 0
 
WHILE @counter > 0
    BEGIN
 
        BEGIN TRANSACTION
            IF NOT EXISTS (SELECT NULL FROM #t WITH (UPDLOCK, SERIALIZABLE) WHERE id = @id )
                INSERT #t (id) VALUES (@id)
 
            SELECT @counter = @counter - 1,
                   @id      = Rand() * @ceiling
 
        COMMIT TRANSACTION
 
        SELECT @counter = @counter - 1,
               @id      = Rand() * @ceiling
    END
 
SELECT 'if not exists() <SAFE>' = DateDiff (ms, @start_time, GetDate())
 
DROP TABLE #t
 
go

Open in new window

Very nice deroby.  I didn't know you could do this;

(SELECT NULL FROM #t WITH (UPDLOCK, SERIALIZABLE) WHERE id = @id )

Sadly, I just found a rather startling 'bug' in my test code : the SELECT @counter = @counter - 1 is done TWICE in every loop, hence you'll need to double the timing (more or less), sigh...

Fixing that brings me to :

time needed to create     records
--------------------- -----------
                 7966     1048576

time needed to index
--------------------
                3313
  TRY/CATCH
-----------
      10143

where not exists()
------------------
              2470

if not exists() <unsafe>
------------------------
                    1750

if not exists() <SAFE>
----------------------
                  2546


So that's more "as expected" (one would expect that Transactions have some overhead), too bad the penalty is that high =(
Still, I think the <SAFE> method is the way to go, or the WHERE NOT EXISTS() method which is just as safe IMHO and actually easier to code.
Brilliant.  So reading BOL - SERIALIZABLE states that it locks the table or datapage - so I am assuming that the whole table will only be locked where the table is smaller than the datapage?  So does the appropriate datapage where the information would exist get locked even if the information doesn't exist?

What is the purpose of UPDLOCK?  BOL says that an Update Lock is taken and held until the transaction completes, but what exactly is an update lock?

And why therefore are both locks types (HINTS) necessary?

Thanks.
So is the where not exists atomic?  I am guessing as it is one single command it is probably treated as a single transaction?

To be on the safe side would it not make sense to add the UPDLOCK and SERIALIZABLE to the WHERE NOT EXISTS clause?

INSERT #t (id)
        SELECT @id
          WHERE NOT EXISTS (
                SELECT 1 FROM #t
                WITH (UPDLOCK, SERIALIZABLE)
                WHERE id = @id
           )

I'll let deroby answer, but I did test UPDLOCK to prove to myself it worked.  I don't know what it locked, but I was able to insert id 2 from another session while the lock intention was held on id 1 (I put a WAITFOR DELAY in there to give me time to test it) and an insert of id 1 failed from the other session, so it worked just like you would want it to.  I'm not sure how it manages that.  It certaintly didn't do a PAGE or TABLE lock (since there are less than 50 bytes in the whole table)  and insert id=2 would have failed, and there was no ROW where id = 1 existing to lock. I should research it I guess :)
I think the WHERE EXISTS() construction is treated as one single query, so that should make things atomic yes
Adding the lock won't hurt, but I doubt it's really needed. Not sure how you would be able to test this =/

Not sure if you could add ROWLOCK to the locking hints (feel free to try) in order to minimize the impact of the lock, but as we are locking just one single record at a time, I doubt MSSQL will escalate the (row or page) lock into a full-blown table-lock anyway. So yes, I presume it will only do a page-lock. (if that page is the entire table, well, then it's the same as a table-lock indeed =)

I'm no guru when it comes to locks, but from my understanding UPDLOCK prevents other process to make any changes, but does not prohibit them from reading the data. This in contrast to XLOCK that will take an exclusive lock on the object and stops all other processes from reading or writing to it.

(I'm sure someone will correct me if I'm blatantly wrong here... =)

@kselvia : Yeah, I've been stumped by this too, used to take a table-lock in order to 'lock what's not there', but was told about this behavior some time ago and it opened up a variety of possibilities.

You can get a lock using "WHERE row_id = 100" and even when that record does not exist, it still locks out all other processes from "manipulating" that record. (manipulating in this case is INSERTing).

Black magic if you ask me, very counter-intuitive too, but one hell of a feature =)
Isolate and get rid of your duplicates first then implement the unique constraint: they should not be there in the first place...Dupplicates = incorrect count results...

My two cents...
It is amusing that the first answer was the correct answer but deroby and kselvia went a long way to making me happy that this was the right answer as even the author of the first answer was not 100% certain!