I am getting duplicate records because of NOLOCk , I understand that this is doing a uncommitted (or) dirty reads and ending up with duplicates.
Could you please suggest the Isloation level to be used as I am querying the OLTP Source table below (BASE_ASSIGNMENTS) which is continuously being updated all the time. I am reading this table and perform the ETL operations and store the aggregated data in the warehouse . Before I store in the warehouse , I do the ETL staging operation where I encounter the duplicate issue .
Please correct this query (used to load Staging from source table ) and help me with the solution.
SELECT DISTINCT 21 DATASOURCE_ID, B.KVISUMMARYINTID KVISUMMARYINTID, CAST(B.WAREHOUSE_ID AS NVARCHAR(32)) WH_ID,
FROM BASE_ASSIGNMENTS B WITH (NOLOCK)
LEFT OUTER JOIN
(SELECT ASSIGN_NUM, MIN(TRNDTE) TRNDTE FROM KVI_SUMMARY_HST WITH (NOLOCK) GROUP BY ASSIGN_NUM
) H
ON B.ASSIGN_NUM = H.ASSIGN_NUM WHERE B.REPORT_DATE BETWEEN '2016-01-24T00:00:00' AND '2016-01-26T23:59:59'
Thanks,
Sreekanth.
DatabasesMicrosoft SQL ServerMicrosoft SQL Server 2005Microsoft SQL Server 2008SQL
Last Comment
n_srikanth4
8/22/2022 - Mon
ste5an
So you're getting those dups while reading the staging table?
In the ETL process there should be no need for NOLOCK at all, cause the E-T-L is an ordered process with discrete steps.
n_srikanth4
ASKER
Hi Ste5an ,
I am getting the duplicate records while reading the data from the source and loading to the stage table using the query below :
Query :
SELECT DISTINCT 21 DATASOURCE_ID, B.KVISUMMARYINTID KVISUMMARYINTID, CAST(B.WAREHOUSE_ID AS NVARCHAR(32)) WH_ID,
FROM BASE_ASSIGNMENTS B WITH (NOLOCK)
LEFT OUTER JOIN
(SELECT ASSIGN_NUM, MIN(TRNDTE) TRNDTE FROM KVI_SUMMARY_HST WITH (NOLOCK) GROUP BY ASSIGN_NUM
) H
ON B.ASSIGN_NUM = H.ASSIGN_NUM WHERE B.REPORT_DATE BETWEEN '2016-01-24T00:00:00' AND '2016-01-26T23:59:59'
Could you please provide me an example queries for READ COMMITED including the alternatives SNAPSHOT and RCSI . What is the best one to use in this scenario where the source tables are always updating and I am reading the data from the source at the same time and I want to ensure that I want to avoid reading the same data twice to avoid duplicates while reading the data from the source .
Thanks,
Sreekanth.
ste5an
What does "source tables are always updating" mean?
n_srikanth4
ASKER
Ste5an , source tables are always updating means they are transactional OLTP systems which are updated all the time and I am the consumer of the data and I read the transactional tables while they are being updated .
For a normal OLTP using READ COMMITED is imho sufficient.
The problem is your definition of "always" and "all the time".
Without a concrete analysis of your system it's hard to tell. There is for sure a tipping point, where using log shipping to a second instance or stream processing must be considered.
n_srikanth4
ASKER
ste5an, Could you please give a simple example(query) with regards to my scenario how the READ COMMITED should work (or) considering any other aletrnatives like using sharedlocks\log shipping\stream processing , if so how to use them ?
ste5an
Don't use NOLOCK and check your used isolation level. When you don't have specified one, then it's READ COMMITED. You can run DBCC USEROPTIONS; to check it.
For the usage of different isolation levels or strategies an concrete analysis is required, but this is out-of-scope for at least of me.
In the ETL process there should be no need for NOLOCK at all, cause the E-T-L is an ordered process with discrete steps.