• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 291
  • Last Modified:

Ensuring unique records upon insertion

The code belongs to a stored proc that gets called by a scheduled job at regular intervals.
Despite the WHERE(NOT EXISTS...) clause, I'm getting records inserted that have duplicate ReadingTimestamp & Reference pairings.
Can someone suggest the cleanest method of preventing duplicate records? I'm thinking either (A) having the table discard (ignore) duplicates, (B) using a Commit Transaction after one final check right before committing, which seems a bit redundant, or (C) locking cus_GasMeterReadings table. There is another application that has to read this intermediate table, so (C) might mess that up.
Also, since this is the only SP that inserts to the table, one scheduled call of the SP may not finish before another call to it?
Thanks!
INSERT Montage.dbo.Cus_GasMeterReadings
SELECT dls.TimestampUTC, dl.QuantityID, dl.[Value], 0
FROM ION_Data.dbo.DataLog dl 
WITH (NOLOCK) INNER JOIN ION_Data.dbo.DataLogStamp dls
WITH (NOLOCK) ON dl.DataLogStampID = dls.ID
WHERE (NOT EXISTS (
	SELECT ReadingTimestamp, Reference
	FROM Montage.dbo.Cus_GasMeterReadings gmr
	WHERE dls.TimestampUTC = gmr.ReadingTimestamp
	AND dl.QuantityID = gmr.Reference)) 
AND dls.TimestampUTC >= @StartDate
AND dls.TimestampUTC < @EndDate
GROUP BY dls.TimestampUTC, dl.QuantityID, dl.[Value]
ORDER BY dls.TimestampUTC, dl.QuantityID

Open in new window

0
jdallen75
Asked:
jdallen75
  • 2
  • 2
  • 2
1 Solution
 
8080_DiverCommented:
Also, since this is the only SP that inserts to the table, one scheduled call of the SP may not finish before another call to it?
Now that is a problem!  If you are getting data over runs (i.e. call 2 comes in before call 1 finishes, then you are going to have a real problem.
gets called by a scheduled job at regular intervals.

Okay, define "Regular intervals" . . . does this mean every 5 minutes, every 5 seconds, 5 times every second, 1 time every 5 milliseconds, or what?
Have you thought about having the data inserted into a staging table that has an Identity column as its unique index / Primary Key and then periodically runing a task that selects a set of distinct records to insert into the destination table and then records the most recent timestamp in a control record in a table so that it knows where to start with the next set of distinct records?  You could even delete the records that match the ones that you have inserted.
Using a NOT EXISTS is very possibly part of the problem (that is a fairly expensive way to find out if things aren't in the table).
Also, the GROUP BY Is not at all necessary and, really, just clutters things up.  GROUP BY should only be used when you are doing things like COUNT, MAX, MIN, SUM.
Do you have a Primary Key on the target table?  If so, is it just an Identity column?  If so, do you have any Unique Indexes?  
See if the following works any better:
 
 

INSERT INTO Montage.dbo.Cus_GasMeterReadings
SELECT dls.TimestampUTC, 
       dl.QuantityID, 
       dl.[Value], 
       0
FROM ION_Data.dbo.DataLog dl WITH (NOLOCK) 
INNER JOIN ION_Data.dbo.DataLogStamp dls WITH (NOLOCK) 
  ON dl.DataLogStampID = dls.ID
LEFT OUTER JOIN Montage.dbo.Cus_GasMeterReadings Z
  ON  dls.TimestampUTC = Z.ReadingTimestamp
      dl.QuantityID = Z.Reference)
WHERE Z.ReadingTimestamp IS NULL
AND dls.TimestampUTC >= @StartDate
AND dls.TimestampUTC < @EndDate
ORDER BY dls.TimestampUTC, dl.QuantityID

Open in new window

0
 
BrandonGalderisiCommented:
If you need unique records, why not use a unique constraint across the two columns?  That IS their purpose.

As 8080 has pointed out you will have problems with concurrency.  where are your @startdate and @enddate coming from?  If you don't attempt to process overlapping values, then you will limit your concurrency issues.


> Using a NOT EXISTS is very possibly part of the problem (that is a fairly expensive way to find out if things aren't in the table)
Not if you are doing it on properly indexed tables.  


> GROUP BY should only be used when you are doing things like COUNT, MAX, MIN, SUM.

group by can also be used as a distinct of sorts.  It doesn't HAVE to be used with aggregates.

0
 
jdallen75Author Commented:
8080:
Regular in this case is every 15 minutes, though what you are looking at is Step 1 of 6 very similar steps (electricity, etc.) in a scheduled job.
I thought the GROUP BY was a little odd, but is was provided in a sample from a colleague.
Cus_GasMeterReading has a PKIdent column to serve as a unique, primary key - just an auto-incremented number. I can't check at the moment, but I believe I had set up non-unique indexes on ReadingTimestamp and Status (since they are used in WHERE and ORDER BY clauses elsewhere.)
If I were to set up a unique index on ReadingTimestamp + Reference together - I remember seeing an "ignore duplicates" setting - does that mean duplicate rows are ignored altogether, or just for the purpose of indexing (if that's possible)?
Thanks for the input.
PS: Why "WHERE Z.ReadingTimestamp IS NULL"?


0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
jdallen75Author Commented:
Brandon:
What happens when the stored proc tries to write duplicate records to the table if those two columns that have a unique constraint? Does it just ignore it and move on, provide a message, or throw an error?
Thanks.
0
 
BrandonGalderisiCommented:
ignore duplicates means that if you attempt to insert a duplicate, the record is ignored and NO error is generated.  The duplicate WILL NOT be inserted.

It will throw an error, unless as I explain above the ignore duplicates option for the index is set.
0
 
8080_DiverCommented:
PS: Why "WHERE Z.ReadingTimestamp IS NULL"?
As a passing bit of information, that makes sure that there is no matching record from the subselect that is aliased as Z.  Since it is a LEFT OUTER JOIN to the table into which you are inserting the data, it should preclude there being a row in that table already.
However, the IGNORE DUPLICATES is probably a better option in this instance. ;-)
0

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

  • 2
  • 2
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now