asked on

Problem with cte (executing slowly)

Hi everyone

I have attached a piece of code which using a cte list all the students who have missed 2 consecutive classes.

There are about 2 million records in the table eRegisters_allstudents_tb

The problem is it takes the code about 14 minutes to execute and I was wondering would there be a way of amending this code so that it executes quicker

Using sql server 2005

Help appreciated

thanks

;with cte as 
 (SELECT 
    
    acad_period, student_id, student_name, student_surname, stage_code, campus_desc, dept_code, dept_desc, 
	c_title, c_initial, c_surname, course_code, course_period, course_desc, register_id, register_group,
	module_code, module_period, module_desc, day_num, day, week_no, start_time, end_time, 
	date as 'date_of_first_absence', absence_code, moa,

	ROW_NUMBER() OVER (PARTITION BY register_id, register_group, student_id 
        ORDER BY date,start_time) AS rownum
FROM eRegisters_allstudents_tb

where
stage_code like 'ENR'

)

, XX as (select a.register_id,a.register_group,a.student_id,a.rownum,count(*) as Missed
          From Cte as a
          Inner join cte as B
            on a.rownum<=b.rownum
           and a.register_id=b.register_id
           and a.register_group=b.register_group
           and a.student_id=b.student_id
           and a.absence_code=b.absence_code
          Where A.absence_code='O'
           and not exists (select x.student_id from cte as x
                            where a.student_id=x.student_id
                             and a.register_id=x.register_id
                             and a.register_group=x.register_group
                             and x.rownum between a.rownum and b.rownum
                             and ( (x.absence_code <> 'O') or (x.absence_code is NULL))  )
            group by a.register_id,a.register_group,a.student_id,a.rownum
            having count(*) >= 2
         )

, yy as (select a.register_id,a.register_group,a.student_id,a.rownum,a.missed
          From xx as a
          Where not exists (select x.student_id from xx as x
                              where a.student_id=x.student_id  
                                and a.register_id=x.register_id
                                and a.register_group=x.register_group
                                and X.rownum < a.rownum  
                                and a.rownum <=x.rownum + x.missed) 
         )

Select A.*,b.missed 

  from Cte as A
  Inner Join YY as b
   on a.register_id=b.register_id
   and a.register_group=b.register_group
   and a.student_id=b.student_id
   and a.rownum=b.rownum

Open in new window

Brendt Hess

Two major issues can crop up with using a CTE in the way you are using it - lack of indexes, and data memory overload. Both have the same cure - Reducing memory usage at various stages

Normally, I have found that using a temp table instead of a CTE in one or more stages can alleviate this issue. However, as things are configured for this query, that doesn't seem to be a useful tactic unless absolutely necessary. It is possible to turn the CTE named 'cte' into a table, then correct the saved row numbers with an UPDATE statement. With proper indexing, this could be what is needed. But let's start with a simpler option - modifying 'cte'.

Change the query as written so that (1) 'cte' only has the essential data elements for processing the intermediate XX and yy CTE queries + a unique pointer back to the eRegisters_allstudents_tb table, and that the final query joins back to the eRegisters_allstudents_tb table for all other data. This will remove the extra data from caching and manipulation when it is not needed. If this change doesn't suffice, then we need to change the 'cte' to write to a table.

lisa_mc

ASKER

hi

Thanks for reply I will try that now so will reply asap

Alpesh Patel

;with cte as
(SELECT

acad_period, student_id, student_name, student_surname, stage_code, campus_desc, dept_code, dept_desc,
c_title, c_initial, c_surname, course_code, course_period, course_desc, register_id, register_group,
module_code, module_period, module_desc, day_num, day, week_no, start_time, end_time,
date as 'date_of_first_absence', absence_code, moa,

ROW_NUMBER() OVER (PARTITION BY register_id, register_group, student_id
ORDER BY date,start_time) AS rownum
FROM eRegisters_allstudents_tb

where
stage_code like 'ENR'

)

, XX as (select a.register_id,a.register_group,a.student_id,a.rownum,count(1) as Missed
From Cte as a
Inner join cte as B
on a.rownum<=b.rownum
and a.register_id=b.register_id
and a.register_group=b.register_group
and a.student_id=b.student_id
and a.absence_code=b.absence_code
Where A.absence_code='O'
and not exists (select 1 from cte as x
where a.student_id=x.student_id
and a.register_id=x.register_id
and a.register_group=x.register_group
and x.rownum between a.rownum and b.rownum
and ( (x.absence_code <> 'O') or (x.absence_code is NULL)) )
group by a.register_id,a.register_group,a.student_id,a.rownum
having count(1) >= 2
)

, yy as (select a.register_id,a.register_group,a.student_id,a.rownum,a.missed
From xx as a
Where not exists (select 1 from xx as x
where a.student_id=x.student_id
and a.register_id=x.register_id
and a.register_group=x.register_group
and X.rownum < a.rownum
and a.rownum <=x.rownum + x.missed)
)

Select A.*,b.missed

from Cte as A
Inner Join YY as b
on a.register_id=b.register_id
and a.register_group=b.register_group
and a.student_id=b.student_id
and a.rownum=b.rownum

lisa_mc

ASKER

hi PatelAlpesh

That didnt work still took over 14 minutes to run

lisa_mc

ASKER

hi bhess1

When I use the information required to go into the cte I get 5738 records and it now runs in approx 5 minutes

when I try to link my eRegisters table in using an inner join I get 6012 records, I am trying to figure out why this is

Is 5 minutes the best I am going to get or could it be faster

lisa_mc

ASKER

hi bhess

sorry to say I have encountered a problem

I have worked out why the data has to go through the cte is because this only holds the first record of when the student was absent

eg if student missed all of the same class on
missed mon week 1
missed mon week 2
missed mon week 3
missed mon week 4

then by using the cte i only get monday week 1 details as I only want to display the first date when the student was absent the first class of 4 consec classes absent

I cant do an inner join because some students have the same class timetabelled twice in same week so I will still get two records in this case whereas I only want one record

So any suggestions on how to move forward

David Todd

Hi,

I'm curious - what does the execution plan look like? Does it suggest any missing indexes?

HTH
David

8080_Diver

Something else to remember is that virtually any time you see the word NOT in a query, you can bet that there will be Table Scans involved. Since there are 2 million rows involved and you are, apparently, doing several self joins , both XX and yy are liable to be doing multiple Table Scans on the CTE results.

Also, unless the eRegisters_allstudents_tb already has an index that consists of date, start_time, and Stage_code, the results of the CTE are likely to be sorted prior to being used and, lacking that index, you will be doing a Table Scan on the eRegisters_allstudents_tb table.

Mark Wills

Well, having a compond CTE like that where XX is selecting from CTE and YY is slecting from XX and then CTE is joining to YY it will be fairly intense.

Can have a look, but first have to get my head around what you are actually doing.

despite the 14 minutes, are you getting the correct results ?

lisa_mc

ASKER

Hi guys thanks for the replies much appreciated

dtodd - I dont have any indexes - will the execution plan still show any indexes missing

8080 Diver - eRegisters_allstudents_tb doesn't have any indexes. I created this table by linking all the db tables together to get all the information I need. The original tables that are linked cannot be changed or modified in any way

mark willis - Remeber the question you helped me with before to find 3 weeks absent you had to check each day absent. This question is very similar only we are now checking a particular class 2 weeks in a row, row partion is used to sort the information in class order. If a student is off 2 classes in a row then they would be shown in my report. Yeah 14 minutes but I am getting the correct results I just would like to speed up the execution time

example 1

tues 2-5 computing week 1 - absent
tues 2-5 computing week 2 - absent
tues 2-5 computing week 3 - absent

the student would be displayed

thurs 1-3 databases week 1 absent
thurs 1-3 databases week 2 present
thurs 1-3 databases week 3 absent

the student would not be displayed as there are not off 2 classes in a row

Also mark is there any chance that you could come back to my other question students absent >=3 weeks but < 4 weeks as I really need your help

thanks very much

Brendt Hess

A couple of questions:

1) is eRegisters_allstudents_tb a physical table, a view, or something else
a) if it is a physical table, please add an index on stage_code, register_id, register_group, student_id, date, start_time.

2) How long does the initial query (the one referenced as 'cte' take to run as a standalone?

I'm glad to see the high degree of speedup from placing the execution of cte into a temp table. Did you think to index the temp table after creating it? Looking, you would need an index on register_id, register_group, student_id, rownum, absence_code (in that order). You can also include stage_code in the index as well - probably as the first field if you decide to include it. You would also want to insert the output from the query into the temp table in order for ease of searching. That way, a SQL Statement like...
SELECT ... JOIN
a.register_id=b.register_id
and a.register_group=b.register_group
and a.student_id=b.student_id
and a.rownum = b.rownum - 1
and a.absence_code=b.absence_code

Will automatically associate record 1 with record 2, and not record 4 (because the rownum values don't match correctly). Start thinking in this pattern, and you'll get away from a lot of false matches in your working set - index

Optimizing large queries that resemble cross-joins is a matter of thinking from the result set you want back to the original data most efficiently. If you want row 2 for a student to always be matched only with row 1 or row 3, then make your join do that explicitly and save a lot of extra calculation time.

lisa_mc

ASKER

Hi bhess1

1) I use a stored procedure to link all the tables. I pick out all the data I need from each table and store this information in the eRegisters_allStudents_tb.
Would I put the index in this stored procedure when I am creating this table

ie
select a.name, b.name2

from table a
inner join table1 b on
a.id = b.id

where a.add = 1

-- is this where I put the index on my eRegisters_allStudents table ??

2) the query takes 1min 52 sec and there are 1873874 records

I think you took me up wrong I didnt create a temp table in my last post I just used the essential data to go through cte then I linked back to the eRegisters table to get the rest of my data.

This scenario doesn't work as I only want one record for each student

How would I change this code to incorporate a temporary table - I could try that to see if it executes any quicker

Thanks

David Todd

Hi,

>> dtodd - I dont have any indexes - will the execution plan still show any indexes missing

Okay, so the indexes you should start with are:
Clustered index on primary key
Each column in a foreign key constraint - its just a constraint and not an index, unlike the primary key constraint which is an index.
Each column involved in a join clause if not already indexed above.

I forget if 2005 will show missing indexes, but definitely 2008 query plans will.

If no idexes forget about speed as everything becomes table scans vs index seeks. The difference in speed is several orders of magnitude. That is, 15 minutes down to seconds.

HTH
David

8080_Diver

Okay, so the indexes you should start with are:
Clustered index on primary key

lisa_mc,

Since you don't have a Primary Key (remember, David, the table has no indexes ;-), you will need to figure out which column or columns will uniquely identify a given row. Once you do that, you will need to set that as the Primary Key. If all lse fails, add an Identity Column (e.g. ID [int] and then make sure that Identity Specification is set to Yes.

Also, please make a note of the fact that, if you do not have any indexes on a table, then any query on that table will involve a table scan. (That may not be a problem if the table only has a few rows; however, if you have even a few hundred rows, it will rapidly degrade performance.) If you ever want to be able to delete one of a set of duplicate rows, you need to have a unique index on the table. So, as a general rule of thumb, ALWAYS put at least one index on any table and make sure, if you only have one index, that it is a unique index.

David Todd

Hi Diver,

I was assuming that Lisa knew how to select a primary key. If not that should have been the next question/comment.

Yeah, I know what assume means, but you have to assume something ...

Regards
David

lisa_mc

ASKER

hi guys

Im not sure if you seen my last post

>> I use a stored procedure to link all the tables. I pick out all the data I need from each table and store this information in the eRegisters_allStudents_tb.
Would I put the index in this stored procedure when I am creating this table

ie
select a.name, b.name2

from table a
inner join table1 b on
a.id = b.id

where a.add = 1

-- is this where I put the index on my eRegisters_allStudents table <<

maybe someone could help me with this???

I can modify the eRegisters tables or add anything I want I just cant modify or change any of the tables that I link to create the table eRegisters.

The unique identifier for each row would be student_id

Help appreciated guys

8080_Diver

Just out of curiosity, is eRegisters_allStudents a table or a view? (It makes BIG difference, by the way.)

Honestly, I thought that the response that suggested the appropriate indexes answered you basic question. However, now that you bring it up again and I realize that you are trying to put it in the SP, I am wondering whether you may actually be working with views If you aren't, then why are you trying to pout the indexes in the SP? (You create an index on a table one time and then it is there.)

ASKER CERTIFIED SOLUTION

lisa_mc

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

lisa_mc

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

lisa_mc

ASKER

question has been neglected with no answer to any of my problems

eventually worked it out myself so have decided to accept my own answer