Having duplicates with NOLOCK

n_srikanth4
n_srikanth4 used Ask the Experts™
on
Hi Team,

     I am getting duplicate records because of NOLOCk , I understand that this is doing a uncommitted (or) dirty reads and ending up with duplicates.
Could you please suggest the Isloation level to be used as I am querying the OLTP Source  table below (BASE_ASSIGNMENTS) which is continuously being updated all the time. I am reading this table and perform the ETL operations and store the aggregated data in the warehouse . Before I store in the warehouse , I do the ETL staging operation where I encounter the duplicate issue .

Please correct this query (used to load Staging  from source table  ) and help me with the solution.

SELECT DISTINCT 21 DATASOURCE_ID, B.KVISUMMARYINTID KVISUMMARYINTID, CAST(B.WAREHOUSE_ID AS NVARCHAR(32)) WH_ID,
FROM BASE_ASSIGNMENTS B WITH (NOLOCK)
LEFT OUTER JOIN
(SELECT ASSIGN_NUM, MIN(TRNDTE) TRNDTE FROM KVI_SUMMARY_HST WITH (NOLOCK) GROUP BY ASSIGN_NUM
) H
 ON B.ASSIGN_NUM = H.ASSIGN_NUM WHERE B.REPORT_DATE BETWEEN '2016-01-24T00:00:00' AND '2016-01-26T23:59:59'

Thanks,
Sreekanth.
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
ste5anSenior Developer

Commented:
So you're getting those dups while reading the staging table?

In the ETL process there should be no need for NOLOCK at all, cause the E-T-L is an ordered process with discrete steps.

Author

Commented:
Hi Ste5an ,

   I am getting the duplicate records while reading the data from the source and loading to the stage table using the query below :

Query :

SELECT DISTINCT 21 DATASOURCE_ID, B.KVISUMMARYINTID KVISUMMARYINTID, CAST(B.WAREHOUSE_ID AS NVARCHAR(32)) WH_ID,
FROM BASE_ASSIGNMENTS B WITH (NOLOCK)
LEFT OUTER JOIN
(SELECT ASSIGN_NUM, MIN(TRNDTE) TRNDTE FROM KVI_SUMMARY_HST WITH (NOLOCK) GROUP BY ASSIGN_NUM
) H
 ON B.ASSIGN_NUM = H.ASSIGN_NUM WHERE B.REPORT_DATE BETWEEN '2016-01-24T00:00:00' AND '2016-01-26T23:59:59'


Thanks,
Sreekanth.
EE Topic Advisor
Most Valuable Expert 2014
Awarded 2013
Commented:
Do NOT use "with (nolock)", just remove it from the query.

Also, you should NOT use "between" for your date range. The way you are currently using it creates a one second gap. Instead of
BETWEEN '2016-01-24T00:00:00' AND '2016-01-26T23:59:59'
use
column >= '20160124'  AND column < '20160127' --YYYYMMDD is the safest date format in SQL Server
SELECT DISTINCT
      21 DATASOURCE_ID
    , B.KVISUMMARYINTID KVISUMMARYINTID
    , CAST(B.WAREHOUSE_ID AS nvarchar(32)) WH_ID
   ....
   , H.TRNDTE
FROM BASE_ASSIGNMENTS B
      LEFT OUTER JOIN (
            SELECT
                  ASSIGN_NUM
                , MIN(TRNDTE) TRNDTE
            FROM KVI_SUMMARY_HST WITH (NOLOCK)
            GROUP BY
                  ASSIGN_NUM
      ) H ON B.ASSIGN_NUM = H.ASSIGN_NUM
WHERE B.REPORT_DATE >= '20160124'
      AND B.REPORT_DATE < '20160127'

Open in new window

for more on the between topic please see: "Beware of Between"
I assume H.TRNDTE is actually used in the query otherwise you should just stop using the derived table
Ensure you’re charging the right price for your IT

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden using our free interactive tool and use it to determine the right price for your IT services. Start calculating Now!

ste5anSenior Developer

Commented:
So the source is a different system. In this case don't use NOLOCK or READ UNCOMMITED.

What's wrong with READ COMMITED?

Alternatives are SNAPSHOT and RCSI.

Author

Commented:
Hi Ste5an ,

 Could you please provide me an example queries for READ COMMITED including the alternatives SNAPSHOT and RCSI . What is the best one to use in this scenario where the source tables are always updating and I am reading the data from the source at the same time  and I want to ensure that I want to avoid reading the same data twice  to avoid duplicates while reading the data from the source .

Thanks,
Sreekanth.
ste5anSenior Developer

Commented:
What does "source tables are always updating" mean?

Author

Commented:
Ste5an , source tables are always updating means they are transactional OLTP systems which are updated all the time and I am the consumer of the data and I read the transactional tables while they are being updated .
ste5anSenior Developer

Commented:
For a normal OLTP using READ COMMITED is imho sufficient.

The problem is your definition of "always" and "all the time".

Without a concrete analysis of your system it's hard to tell. There is for sure a tipping point, where using log shipping to a second instance or stream processing must be considered.

Author

Commented:
ste5an, Could you please give a simple example(query) with regards to my scenario how the READ COMMITED should work (or) considering any other aletrnatives like using sharedlocks\log shipping\stream processing , if so how to use them ?
ste5anSenior Developer

Commented:
Don't use NOLOCK and check your used isolation level. When you don't have specified one, then it's READ COMMITED. You can run DBCC USEROPTIONS; to check it.

For the usage of different isolation levels or strategies an concrete analysis is required, but this is out-of-scope for at least of me.
Vitor MontalvãoIT Engineer
Distinguished Expert 2017

Commented:
n_srikanth4, did you try to run the process without the NOLOCK hint as others Experts already suggested?

Author

Commented:
good

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial