Deterimine old rows to delete

Hi, I am currently building a data warehouse and have an issue with records to delete from the fact table.
The source data does not update records but rather creates a new row for the change and uses an audit number together with other columns to form a composite key. I am using the following code to get the most recent record as the row may have been changed several times since the last load:

SELECT       * FROM (
 SELECT      *  ROW_NUMBER() OVER (PARTITION BY [SampleCode] ORDER BY [AuditNumber] desc) rn
 FROM [DataMart].[Dim_Sample]
) tmp WHERE rn = 1
AND AuditNumber IS NOT NULL;

(In the live code the column are listed)

If this returns the 4th version of the row, I need to be able to identify the previous 3 and remove them from the fact table.

Anyone any ideas?

Many thanks
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Máté FarkasDatabase Developer and AdministratorCommented:
Use CTE:

        SELECT      *, ROW_NUMBER() OVER (PARTITION BY [SampleCode] ORDER BY [AuditNumber] desc) rn
        FROM [DataMart].[Dim_Sample]
        AND AuditNumber IS NOT NULL

Open in new window

However you want to delete from Fact table but you did not provide the structure and relation between your Dim and Fact table.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Olaf DoschkeSoftware DeveloperCommented:
The obvious first: Since you only query for rn=1, you always only get one row (per partition), that's also what you're saying anyway. You can't detect how many audit rows there are from this result, also not from the value of AuditNumber.

You'll not really want to delete 3 of 4 accumulated rows, you'll want to keep 3 rows at max, or your audit history  jumps between 0,1,2,and 3 rows. Du you really want that?

Your idea seems to be to react to AuditNumber passing over 4. But you should instead just do a similar DELETE query filtering for rn>3, as simple as that:

Declare @auditdata as Table ([SampleCode] int, data char(3), auditnumber int null);

Insert Into @auditdata values (1,'foo',1),(1,'bar',2),(1,'nul',3),(1,'dev',4);

With old as (
 SELECT ROW_NUMBER() OVER (PARTITION BY [SampleCode] ORDER BY [AuditNumber] desc) rn 
 FROM @auditdata WHERE auditnumber IS NOT NULL


--expected result: row (1,'foo',1) is deleted, 3 rows remaining.

Select * from @auditdata

Open in new window

That said, I'd perhaps also make it depend on the real age of the audit data, not only how many rows are in the audit.

PS: When you work with such queries to pick out the newest record it makes sense to keep audit data shorter, so the query for rn=1 remains performant, but I wouldn't make it too tight, because too short audit data makes it less valuable to keep at all.

Bye, Olaf.
Olaf DoschkeSoftware DeveloperCommented:
A completely different idea is to keep audit data separate anyway, have one main table with current data and a history table with audit data. That way you don't need the initial query for rn=1 at all, you have a table with current rows and no others. For whatever reason you choose to not go that route, it'd enable you to keep much more audit data, as that will have no effect on the performance of queries of the main table with only current data.

I can understand a single audit table if you also sometimes want to query older states. SQL Server also offers a new feature of temporal tables since version 2016, they work with that very concept of a separate history table:

You may consider using that replacing your own concept.

And last not least,you might like some new query clauses for temporal tables,especially querying data AS OF a certain date:

Bye, Olaf.
Pawan KumarDatabase ExpertCommented:
>>If this returns the 4th version of the row, I need to be able to identify the previous 3 and remove them from the fact table.
Anyone any ideas?
Yes there are many options for this. First option is given by Máté Farkas using ROW_NUMBER AND CTE. This is one of the mostly used method to remove the duplicate rows from the table leaving one behind.

Another method below. It will also perform the same way. It will leave the latest audit number for each sample code and removes the other ones.

DELETE tbl1 FROM [DataMart].[Dim_Sample] tbl1
       SELECT TOP 1 MAX([AuditNumber]) [AuditNumber] from [DataMart].[Dim_Sample] tbl 
	   WHERE tbl.[SampleCode] = tbl1.[SampleCode]  
	   AND AuditNumber IS NOT NULL     
) tbl
WHERE tbl.[AuditNumber] <> tbl1.[AuditNumber]

Open in new window

If you want to read more about how we can remove duplicate please use my blog given below. Here I have given 5 methods to achieve this.
Pawan KumarDatabase ExpertCommented:
Question abandoned
Provided solution
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Microsoft SQL Server

From novice to tech pro — start learning today.