Link to home
Start Free TrialLog in
Avatar of mariita
mariitaFlag for Canada

asked on

SQL query that sums the number of duplicates

I want to return the number of duplicate records per institution. My query looks something like the snippet below. I'm getting an error message that there is a missing right parenthesis, even though the number of left and right parentheses are equal. Even if the parenthesis issue is resolved, I'm not sure that query below will give me the results I want.

SELECT year, institution, SUM(COUNT(student_ID) > 1) AS duplicate_record_count
FROM enrolment
GROUP BY year, institution
HAVING COUNT(student_ID) > 1
Avatar of slightwv (䄆 Netminder)
slightwv (䄆 Netminder)

>>SUM(COUNT(student_ID) > 1)

Remove the '>1'

SUM(COUNT(student_ID))


Why do you think you need to SUM the COUNT?
Avatar of mariita

ASKER

My original query had HAVING COUNT(*) > 1 and this did not give an error.  But I don't want to list duplicate records individually; I want to know how many sets of duplicates each institution has.
Try this:
SELECT year, institution, duplicate_record_count
FROM (
   SELECT year, institution, COUNT(student_ID) duplicate_record_count
    FROM enrolment
    GROUP BY year, institution
)
where duplicate_record_count > 1
/
Avatar of mariita

ASKER

That doesn't return an error, but I don't think that will return the number of duplicate records. To use the example below, I want to show that Institution A has 2 duplicates, Institution B has 1, while Institution C has none.

User generated image
SELECT year, institution, COUNT(student_ID) - 1 AS duplicate_record_count
FROM enrolment
GROUP BY year, institution
HAVING COUNT(student_ID) > 1
ASKER CERTIFIED SOLUTION
Avatar of slightwv (䄆 Netminder)
slightwv (䄆 Netminder)

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
If you need 'C' try this:
select year, institution, sum( case when duplicate_record_count > 1 then 1 else 0 end) total_dupes
from
(
select year, institution, student_id,  count(*) duplicate_record_count
from enrollment
group by year, institution, student_id
)
group by year, institution
/

Open in new window

This should also work and may be more efficient:
select year, institution, sum(case when student_id=next_student_id then 1 else 0 end)
from (
	select year, institution, student_id, lead(student_id) over(partition by year, institution order by year, institution) next_student_id
	from enrollment
)
group by year, institution
/

Open in new window

| INSTITUTION | ALL_IDS | UNIQUE_IDS | DIFFERENCE |
|-------------|---------|------------|------------|
|           A |       4 |          2 |          2 |
|           B |       2 |          1 |          1 |

Open in new window

Produced by comparing count(*) to count(distinct ...)
Like this:
SELECT
      Institution
    , COUNT(Student_ID)                              ALL_IDS
    , COUNT(DISTINCT Student_ID)                     UNIQUE_IDS
    , COUNT(Student_ID) - COUNT(DISTINCT Student_ID) DIFFERENCE
FROM enrolment
GROUP BY 
      Institution
HAVING
       COUNT(Student_ID) <> COUNT(DISTINCT Student_ID)
;

Open in new window

Further details:
CREATE TABLE ENROLMENT
	(INSTITUTION VARCHAR2(1), STUDENT_ID INT)
;

INSERT ALL 
	INTO ENROLMENT (INSTITUTION, STUDENT_ID)
		 VALUES ('A', 123)
	INTO ENROLMENT (INSTITUTION, STUDENT_ID)
		 VALUES ('A', 123)
	INTO ENROLMENT (INSTITUTION, STUDENT_ID)
		 VALUES ('A', 456)
	INTO ENROLMENT (INSTITUTION, STUDENT_ID)
		 VALUES ('A', 456)
	INTO ENROLMENT (INSTITUTION, STUDENT_ID)
		 VALUES ('B', 789)
	INTO ENROLMENT (INSTITUTION, STUDENT_ID)
		 VALUES ('B', 789)
	INTO ENROLMENT (INSTITUTION, STUDENT_ID)
		 VALUES ('C', 147)
SELECT * FROM DUAL
;

Open in new window

see: http://sqlfiddle.com/#!4/048b4/3
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of mariita

ASKER

Thanks for all these responses! It's going to take me a while to go through them and figure out which ones work best.