Link to home
Start Free TrialLog in
Avatar of Rupert Eghardt
Rupert EghardtFlag for South Africa

asked on

Delete Statement with Select

Hi Guys

I have a SQL select query which are producing the correct results.

Is there a simple Delete statement that I could add to the Select query to delete all result rows from the select?
Avatar of Norie
Norie

Rupert

Does  the SELECT query return the primary key of the records you want to delete?

P.S Have you tried simply changing SELECT to DELETE?
Avatar of Rupert Eghardt

ASKER

The results does not currently return the primary key, but I could include this column in the result set?

The Select statement is a little complex, and simply replacing Select with Delete does not seem to work.

I was hoping for a post-delete statement .. combined with my current Select
SELECT PKcol, someOthercols
  FROM YourTable
 WHERE SomeCondition

DELETE FROM YourTable WHERE PKcol IN (
SELECT PKcol
  FROM YourTable
 WHERE SomeCondition
)
-- or you may leave the original query unchanged
DELETE FROM YourTable WHERE PKcol IN (SELECT PKcol FROM (
SELECT PKcol, someOthercols
  FROM YourTable
 WHERE SomeCondition
) )

Open in new window

My current Select statement includes "group by" and count(), such as below:

SELECT Col1, Col2, Col3, count(*)
   FROM MyTable
   GROUP BY Col1, Col2, Col3
   HAVING count(*) > 1;

* The statement needs to identify duplicate records by "Col1", which is not my primary column

How do I combine Delete statement with Select statement as above?
Are you sure you want to delete ALL rows where col1 value occurrence is > 1?  Normally one row should remain in the table.

But if you really need to delete all rows simply use col1 in the place of PKcol.

UPDATE: col2 and col3 represent some description of col1 value hopefully.
Hi Pcelba,

This statement is giving me all the duplicate rows in the result set.

In other words, record 1 will remain and any other "duplicate" records (> 1) with the same data will be deleted.
OK, then don't use the above DELETE commands.
Let me put it this way ...

When I take a record from the above Select result-set and run a query on one of the row results (Col1)
I pick up both records in the database,

Whereas my result set is only showing 1, which is (> 1) ... and this is the record I wish to delete.
First of all, there is no post-delete statement.

The simple idea is as Pavel Celba outlines, modify your query, so it only returns the primary key of records you want to delete, that way you can use the WHERE key IN (query) approach, as that requires the query to only contain the key and no other fields.

If you are like me you use a syntax like DELETE d FROM table as d inner (left/right/outer) join anothertable as a .... more complex subquery.

With that approach it only matters you get the correct records fetched, you don't specify a select_list of columns, but your starting point for that also can be any complex query. with any number of joins, so in general, this is mainly replacing the SELECT select_list FROM maintable with DELETE d FROM maintaible d, i.e.:

Select col1,col2, col3 FROM maintable WHERE condition
 -- becomes
DELETE d FROM maintable d WHERE condition

- but also in more complex queries

Select col1,col2, col3 FROM maintable LEFT JOIN detailtable ON condition -...further joins... WHERE condition
 -- becomes
DELETE d FROM maintable d LEFT JOIN detailtable ON condition ...further join... WHERE condition

Open in new window


Bye, Olaf.
It would be better to forget the original query and delete based on row numbers:
DELETE FROM MyTable WHERE PKcol IN (SELECT PKcol FROM (
       SELECT PKcol, ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3 ORDER BY PKcol) rn
         FROM MyTable) dupl
        WHERE rn > 1 )

Open in new window

Of course you should test it first:
SELECT PKcol, col1, col2, col3, ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3 ORDER BY PKcol) rn
  FROM MyTable 
 ORDER BY col1, col2, col3, 5

Open in new window

Thank you Pcelba,

Thus your suggested query is replacing my Select statement altogether?
I will delete all duplicate results?  Based on value of Col1?
The SELECT query is still used as a subquery in a slightly different format. The important thing is DELETE command deletes based on PK and row number. PKs are selected based on value col1, col2, and col3. You may use just col1 if necessary.
In the end your question is more about the necessary SELECT to get rows of a table, not GROUPs, a GROUP BY query groups by some head data not a PK (otherwise groups would only consist of single rows), so such queries are no candidate for deletes. You only know the group you want to modify, still not which records. A query with PARTITION keeps single records single records. Now you can also number them, as Pavel did with ROW_NUMBER() and then also order them to tell which is first of more important the one to keep and which are second or higher duplicates to delete.

So it's also important which ORDER you specify. ORDER BY PKcol, as Pavel suggests, and DELETEing WHERE rn>1 later means with an IDENTITY integer key you keep the oldest record, is that really what you want to keep? That's not working good in tables having a uniqueidentifier as pk, as those are not ascending. You will likely want to order by some datetime column.

Bye, Olaf.
Hi Pcelba,

When I run this query to test, I get a couple of million records, which is the total records in the database.
I should only get the ones to be deleted - right?

SELECT PKcol, col1, col2, col3, ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3 ORDER BY PKcol) rn
  FROM MyTable
 ORDER BY col1, col2, col3, 5

The column on which I want to "filter" duplicates is for example "col1", not the PK.
Col1 holds the "student number", and I need to remove all duplicate student numbers.
PK is a sequential number for each row and is unique.
I've noticed that the duplicate records are indicated as rn = 2
Is this correct?

Can I only show the rows where rn = 2?
>The column on which I want to "filter" duplicates is for example "col1", not the PK.
>Col1 holds the "student number", and I need to remove all duplicate student numbers.

That's understood, but it matters which record you want to keep, this can't be just any of those. If you'd delete where col1=seom value, you'd delete all, therefore you number the records per group or better said per partition, as groups always aggregate but you need individual rows with PK for deletion. You can't say you want to depete dupllicates of some value in col1, you need to furhter specify which of the two (or even more) to delete and which to keep.

As always, a computer doesn't make a "sensible" choice for you, you have to do that.

>Can I only show the rows where rn = 2?
WHERE rn=2
If that tells you rn is no valid column name, use the full expression:
WHERE ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3 ORDER BY PKcol) = 2

Even better would be a query showing you both records for the partitions having rn=1 and rn=2, so you cannot only  see which records are deleted, but also whether the one with rn=1 are the ones better to keep.

Unless it doesn't matter and the whole row is the same in all its columns, also col4 and more.

Bye, Olaf.
Thanks Olaf,

It makes sense to keep the newer record.
Thus, because PK is sequential record numbers, sorting by PK should work?

Thus rn=2 can be deleted in all cases?

Unless I misunderstood?

We are talking of a couple of hundred thousand duplicate records to be deleted, thus it will unfortunately be an automated process with little intervension.
As last check before deleting I'd just make sure records with rn=2 have a different PKcol value tahn records with rn=1, because the complicated query you did and we never fully saw might just duplicate your rows because of having two related detail records in child detail data.

So if you have rn=1 and rn=2 with the same PkCol value you'd end up deleting "both" records, or - in other words - there would be no real duplicates, at least not in the one table you "found" them.

On the other hand I have to assume doublee values in a column of a single table were the root of going for their deletion. It's obviously easy to find out whether the original table has the double rows and not just the result of your query. Take one of the double values in col1 and do a simple SELECT count(*) as thecount from thetable WHERE col1='thatvalue' and see whether you reallget a 2 at that pure level . too.

Bye, Olaf-
ASKER CERTIFIED SOLUTION
Avatar of Pavel Celba
Pavel Celba
Flag of Czechia image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thank you Pcelba,

SELECT PKcol, COUNT(*) FROM MyTable GROUP BY PKcol HAVING COUNT(*) > 1
* I get no results


SELECT PKcol, col1, col2, col3, ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3 ORDER BY PKcol DESC) rn
  FROM MyTable
 WHERE ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3 ORDER BY PKcol DESC) > 1
 ORDER BY col1, col2, col3, 5


* I get:  Windowed functions can only appear in the SELECT or ORDER BY clauses.  Problem with "Where" statement.
LOL, it needs a slight update...
SELECT * FROM (
   SELECT PKcol, col1, col2, col3, ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3 ORDER BY PKcol DESC) rn
    FROM MyTable ) allRows
   WHERE rn > 1
 ORDER BY col1, col2, col3, 5

Open in new window

Thank you Pcelba,

I am picking up 170,000 rows, which seems correct
All of them have rn = 2

I assume these are all the duplicated rows, which were added first?
As I would like to keep the latest added record?

I made a backup, will now try the delete statement

Would you mind giving me a short idea of what this part of the Select statement means?

OVER (PARTITION BY Col1, Col2, Col3 ORDER BY PKcol DESC) rn

Over?
Partition by?
Order by?  (as another order statement follows)?
DESC?
Yes, the above 170,000 rows are the ones entered first.

ROW_NUMBER() can calculate row order number for the whole data set or for defined data groups inside the data set.

The "OVER (PARTITION BY Col1, Col2, Col3 ORDER BY PKcol DESC) rn"  is integral part of the ROW_NUMBER() windowing function. It defines how to number the results, what groups to use, and what is the sort order inside the group.

In particular "PARTITION BY Col1, Col2, Col3" is similar to GROUP BY used at the SELECT command level and ORDER BY PKcol DESC sorts the rows inside each group. The DESC clause ensures the newest PKs are sorted at the top (with lower row numbers).

"rn"  is just the column name on output.

Additional ORDER BY takes into effect at the whole command level and defines the row order on output which can be different from the order used for row numbering.

More info about ROW_NUMBER() function: https://docs.microsoft.com/en-us/sql/t-sql/functions/row-number-transact-sql?view=sql-server-2017
This is the major point of that query. Funny that you're asking NOW at last.

I already hinted on the major difference of GROUP BY vs PARTITION. Both group data, but GROUP BY aggregates it, PARTITION creates partitions. As said your DELETE needs the PK of specific rows to delete them, so you can't GROUP BY and then know the rows to delete, you lose the most important information by grouping, the key of the row to delete.

PARITION on the other hand doies NOT aggregate data, rows are kept single rows, but things like ROW_NUMBER() are reset at each new partition, ecah now group of records with same values Col1,2,3. And that's what you want.

Earlier on you said:
It makes sense to keep the newer record.
Thus, because PK is sequential record numbers, sorting by PK should work?

Thus rn=2 can be deleted in all cases?

Sorting by PK should work with sequential record numbers. rn=2 ( Pavel is correct, better talk of rn>1 for triple and more records) is the newer record in case you order by PK ASCENDING (the default ORDER), so you already were OK with deleting the newer instead of older records. You have to understand and know wjat this means to be able to decide, therefore I put up some warnings and tests.

Again, pick out some specific Col1 values, those are the double values you want to remove, right? And then check, whether the table you want to delete from reallys has double values, or whether they only come from joining data.

Because the simple normal JOIn can duplicate your parent data in 1:n relationships, and if you end up with same PK value in the partitions they still will be numbered rn=1,2,3 and rn>1 still can have the same PK value as rn=1. So, DANGERZONE. Well, since you have a backup, you can restore, but it'll get cumbersome, so you better ensure deleting the right portion of data.

Make sure, you donÄt just have duplicates from your complex query and not from the original data, or you don't just delete the duplicates but all data about some student.

So again, take some value you see reported double from the group by query and then go back to the pure simple single table you want to delete it in, whether it really is double in there or just get's doubled with your SELECT query:

SELECT * FROM yourtable Where col1='some specific value' -quotes or not depend on data type.

Open in new window


Do you really see that value the number of times the COUNT(*) result tells? Then you can go for deleting.

Whether you want to keep the newest or oldest records also depends on the case. If you say data came in double by accident, you rather want to purge that latest data, don't you?
Take a look at some examples and you can better judge that, obviously you can't look at all 170,000 cases, but at 2 or 3 should be managable, shouldn't it? Just don't pick the first and last case, pick some values from the middle.

Bye, Olaf.