mariita
asked on
SQL query that sums the number of duplicates
I want to return the number of duplicate records per institution. My query looks something like the snippet below. I'm getting an error message that there is a missing right parenthesis, even though the number of left and right parentheses are equal. Even if the parenthesis issue is resolved, I'm not sure that query below will give me the results I want.
SELECT year, institution, SUM(COUNT(student_ID) > 1) AS duplicate_record_count
FROM enrolment
GROUP BY year, institution
HAVING COUNT(student_ID) > 1
SELECT year, institution, SUM(COUNT(student_ID) > 1) AS duplicate_record_count
FROM enrolment
GROUP BY year, institution
HAVING COUNT(student_ID) > 1
ASKER
My original query had HAVING COUNT(*) > 1 and this did not give an error. But I don't want to list duplicate records individually; I want to know how many sets of duplicates each institution has.
Try this:
SELECT year, institution, duplicate_record_count
FROM (
SELECT year, institution, COUNT(student_ID) duplicate_record_count
FROM enrolment
GROUP BY year, institution
)
where duplicate_record_count > 1
/
SELECT year, institution, duplicate_record_count
FROM (
SELECT year, institution, COUNT(student_ID) duplicate_record_count
FROM enrolment
GROUP BY year, institution
)
where duplicate_record_count > 1
/
ASKER
SELECT year, institution, COUNT(student_ID) - 1 AS duplicate_record_count
FROM enrolment
GROUP BY year, institution
HAVING COUNT(student_ID) > 1
FROM enrolment
GROUP BY year, institution
HAVING COUNT(student_ID) > 1
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
If you need 'C' try this:
select year, institution, sum( case when duplicate_record_count > 1 then 1 else 0 end) total_dupes
from
(
select year, institution, student_id, count(*) duplicate_record_count
from enrollment
group by year, institution, student_id
)
group by year, institution
/
This should also work and may be more efficient:
select year, institution, sum(case when student_id=next_student_id then 1 else 0 end)
from (
select year, institution, student_id, lead(student_id) over(partition by year, institution order by year, institution) next_student_id
from enrollment
)
group by year, institution
/
| INSTITUTION | ALL_IDS | UNIQUE_IDS | DIFFERENCE |
|-------------|---------|------------|------------|
| A | 4 | 2 | 2 |
| B | 2 | 1 | 1 |
Produced by comparing count(*) to count(distinct ...)Like this:
SELECT
Institution
, COUNT(Student_ID) ALL_IDS
, COUNT(DISTINCT Student_ID) UNIQUE_IDS
, COUNT(Student_ID) - COUNT(DISTINCT Student_ID) DIFFERENCE
FROM enrolment
GROUP BY
Institution
HAVING
COUNT(Student_ID) <> COUNT(DISTINCT Student_ID)
;
Further details:
CREATE TABLE ENROLMENT
(INSTITUTION VARCHAR2(1), STUDENT_ID INT)
;
INSERT ALL
INTO ENROLMENT (INSTITUTION, STUDENT_ID)
VALUES ('A', 123)
INTO ENROLMENT (INSTITUTION, STUDENT_ID)
VALUES ('A', 123)
INTO ENROLMENT (INSTITUTION, STUDENT_ID)
VALUES ('A', 456)
INTO ENROLMENT (INSTITUTION, STUDENT_ID)
VALUES ('A', 456)
INTO ENROLMENT (INSTITUTION, STUDENT_ID)
VALUES ('B', 789)
INTO ENROLMENT (INSTITUTION, STUDENT_ID)
VALUES ('B', 789)
INTO ENROLMENT (INSTITUTION, STUDENT_ID)
VALUES ('C', 147)
SELECT * FROM DUAL
;
see: http://sqlfiddle.com/#!4/048b4/3
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thanks for all these responses! It's going to take me a while to go through them and figure out which ones work best.
Remove the '>1'
SUM(COUNT(student_ID))
Why do you think you need to SUM the COUNT?