• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 112
  • Last Modified:

Deterimine old rows to delete

Hi, I am currently building a data warehouse and have an issue with records to delete from the fact table.
The source data does not update records but rather creates a new row for the change and uses an audit number together with other columns to form a composite key. I am using the following code to get the most recent record as the row may have been changed several times since the last load:

SELECT       * FROM (
 SELECT      *  ROW_NUMBER() OVER (PARTITION BY [SampleCode] ORDER BY [AuditNumber] desc) rn
 FROM [DataMart].[Dim_Sample]
) tmp WHERE rn = 1
AND AuditNumber IS NOT NULL;

(In the live code the column are listed)

If this returns the 4th version of the row, I need to be able to identify the previous 3 and remove them from the fact table.

Anyone any ideas?

Many thanks
0
Morpheus7
Asked:
Morpheus7
  • 2
  • 2
3 Solutions
 
Máté FarkasDatabase Developer and AdministratorCommented:
Use CTE:

WITH CTE AS (
        SELECT      *, ROW_NUMBER() OVER (PARTITION BY [SampleCode] ORDER BY [AuditNumber] desc) rn
        FROM [DataMart].[Dim_Sample]
        AND AuditNumber IS NOT NULL
)
DELETE FROM CTE WHERE Rn > 1

Open in new window


However you want to delete from Fact table but you did not provide the structure and relation between your Dim and Fact table.
0
 
Olaf DoschkeSoftware DeveloperCommented:
The obvious first: Since you only query for rn=1, you always only get one row (per partition), that's also what you're saying anyway. You can't detect how many audit rows there are from this result, also not from the value of AuditNumber.

You'll not really want to delete 3 of 4 accumulated rows, you'll want to keep 3 rows at max, or your audit history  jumps between 0,1,2,and 3 rows. Du you really want that?

Your idea seems to be to react to AuditNumber passing over 4. But you should instead just do a similar DELETE query filtering for rn>3, as simple as that:

Declare @auditdata as Table ([SampleCode] int, data char(3), auditnumber int null);

Insert Into @auditdata values (1,'foo',1),(1,'bar',2),(1,'nul',3),(1,'dev',4);

With old as (
 SELECT ROW_NUMBER() OVER (PARTITION BY [SampleCode] ORDER BY [AuditNumber] desc) rn 
 FROM @auditdata WHERE auditnumber IS NOT NULL
)

DELETE FROM old WHERE rn>3;

--expected result: row (1,'foo',1) is deleted, 3 rows remaining.

Select * from @auditdata

Open in new window


That said, I'd perhaps also make it depend on the real age of the audit data, not only how many rows are in the audit.

PS: When you work with such queries to pick out the newest record it makes sense to keep audit data shorter, so the query for rn=1 remains performant, but I wouldn't make it too tight, because too short audit data makes it less valuable to keep at all.

Bye, Olaf.
0
 
Olaf DoschkeSoftware DeveloperCommented:
A completely different idea is to keep audit data separate anyway, have one main table with current data and a history table with audit data. That way you don't need the initial query for rn=1 at all, you have a table with current rows and no others. For whatever reason you choose to not go that route, it'd enable you to keep much more audit data, as that will have no effect on the performance of queries of the main table with only current data.

I can understand a single audit table if you also sometimes want to query older states. SQL Server also offers a new feature of temporal tables since version 2016, they work with that very concept of a separate history table: https://www.mssqltips.com/sqlservertip/4674/managing-temporal-table-history-in-sql-server-2016/

You may consider using that replacing your own concept.

And last not least,you might like some new query clauses for temporal tables,especially querying data AS OF a certain date:
https://docs.microsoft.com/en-us/sql/relational-databases/tables/querying-data-in-a-system-versioned-temporal-table

Bye, Olaf.
0
 
Pawan KumarDatabase ExpertCommented:
>>If this returns the 4th version of the row, I need to be able to identify the previous 3 and remove them from the fact table.
Anyone any ideas?
Yes there are many options for this. First option is given by Máté Farkas using ROW_NUMBER AND CTE. This is one of the mostly used method to remove the duplicate rows from the table leaving one behind.

Another method below. It will also perform the same way. It will leave the latest audit number for each sample code and removes the other ones.

--
DELETE tbl1 FROM [DataMart].[Dim_Sample] tbl1
CROSS APPLY 
(
       SELECT TOP 1 MAX([AuditNumber]) [AuditNumber] from [DataMart].[Dim_Sample] tbl 
	   WHERE tbl.[SampleCode] = tbl1.[SampleCode]  
	   AND AuditNumber IS NOT NULL     
) tbl
WHERE tbl.[AuditNumber] <> tbl1.[AuditNumber]
--

Open in new window


If you want to read more about how we can remove duplicate please use my blog given below. Here I have given 5 methods to achieve this.
https://msbiskills.com/2016/06/03/sql-puzzle-delete-duplicate-data-from-table-without-using-aggregate-ranking-functions/
0
 
Pawan KumarDatabase ExpertCommented:
Question abandoned
Provided solution
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Cloud Class® Course: SQL Server Core 2016

This course will introduce you to SQL Server Core 2016, as well as teach you about SSMS, data tools, installation, server configuration, using Management Studio, and writing and executing queries.

  • 2
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now