Karl_mark
asked on
SQL Server Performance - Physical Table v Temp v CTE
I run the database systems in a further education college. On August 15th the A Level results are released and I'm rolling out a new system for delivering the results which will be entirely web-based. Given the nature of the release, I'm wondering what will be the best way of delivering them in terms of performance.
Basically, I process the results and they are available online from 8am. At this time I would expect a huge rush as all the students view their results, and the teaching staff review their class results.
The actual process for working out the results is fairly complex; I receive a bunch of flat files from the exam boards which are processed using a stored procedure. These results are then linked to individual course modules with the results showing the individual results and an overall grade for each subject.
The question I have is whether it is best to display the results using an on-demand method where the calculation occurs each time someone views them and then filters are applied, or whether to dump the whole lot into two tables (one for students, one for teachers) and then apply a filter at that point. I suspect that for speed and availability using physical tables would be the best option, but does anyone have any particular views on using temp tables or even CTEs to produce the data?
Basically, I process the results and they are available online from 8am. At this time I would expect a huge rush as all the students view their results, and the teaching staff review their class results.
The actual process for working out the results is fairly complex; I receive a bunch of flat files from the exam boards which are processed using a stored procedure. These results are then linked to individual course modules with the results showing the individual results and an overall grade for each subject.
The question I have is whether it is best to display the results using an on-demand method where the calculation occurs each time someone views them and then filters are applied, or whether to dump the whole lot into two tables (one for students, one for teachers) and then apply a filter at that point. I suspect that for speed and availability using physical tables would be the best option, but does anyone have any particular views on using temp tables or even CTEs to produce the data?
ASKER CERTIFIED SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
for a peak load as you describe: tables
assuming you have time to prepare them in advance of that peak load.
it also reduces the risk of calculations bugs in dynamic code as you should be able to verify the results - and know with certainty that these are the ones to be displayed.
assuming you have time to prepare them in advance of that peak load.
it also reduces the risk of calculations bugs in dynamic code as you should be able to verify the results - and know with certainty that these are the ones to be displayed.
ASKER
Thanks.
The speed is required when they view the results. They all know that the results are released at 8am, so the vast majority will login to view them within about 20 minutes (I guess).
I'm actually looking at setting up a data warehouse by next year (I've just started in this position recently, so just getting to grips with current systems and when the peak periods are) which obviously involves processing the data up front, so it makes sense to do as much processing as possible.
The actual timeline is:
1. Receive results files on morning of August 14th. These cannot be visible to anyone other than myself at that time.
2. Process results files and then insert raw data into database tables
3. Further processing to break results down into student and class level
4. Further analysis for performance tables etc.
Steps 1 and 2 can be done on the day before release (14th) as only myself and the IT team have access to the database tables. Step 3 used to be done on the morning of the results (15th this year) in a mad panic (see below) and step 4 would follow soon after. We are looking at redirecting the results webpage to a standard "Results will appear here at 8am on August 15th" so that gives me time to process and have it all ready the day before.
In previous years, my understanding is that the results were printed using Access Reports, all done on the morning of the results. This uses many reams of paper (I would estimate not far short of six reams!) so I have pushed for an online system and it is vital that it works!
In terms of the table, I've set up an index on AdmissionNumber (identifies the student) and course. The data is presented as:
1. Students - Course, Overall Grade
2. Staff - Class, Individual exam marks, Overall Grade
The class is not currently held on the table I've created; given that it will be used extensively during querying I'm assuming that it would be a good idea to add this to the index?
The speed is required when they view the results. They all know that the results are released at 8am, so the vast majority will login to view them within about 20 minutes (I guess).
I'm actually looking at setting up a data warehouse by next year (I've just started in this position recently, so just getting to grips with current systems and when the peak periods are) which obviously involves processing the data up front, so it makes sense to do as much processing as possible.
The actual timeline is:
1. Receive results files on morning of August 14th. These cannot be visible to anyone other than myself at that time.
2. Process results files and then insert raw data into database tables
3. Further processing to break results down into student and class level
4. Further analysis for performance tables etc.
Steps 1 and 2 can be done on the day before release (14th) as only myself and the IT team have access to the database tables. Step 3 used to be done on the morning of the results (15th this year) in a mad panic (see below) and step 4 would follow soon after. We are looking at redirecting the results webpage to a standard "Results will appear here at 8am on August 15th" so that gives me time to process and have it all ready the day before.
In previous years, my understanding is that the results were printed using Access Reports, all done on the morning of the results. This uses many reams of paper (I would estimate not far short of six reams!) so I have pushed for an online system and it is vital that it works!
In terms of the table, I've set up an index on AdmissionNumber (identifies the student) and course. The data is presented as:
1. Students - Course, Overall Grade
2. Staff - Class, Individual exam marks, Overall Grade
The class is not currently held on the table I've created; given that it will be used extensively during querying I'm assuming that it would be a good idea to add this to the index?
I'm assuming that it would be a good idea to add this to the index?Yes, but take a look at the execution plan before adding new indexes or adding columns to existing indexes.
You will get a more precise situation on how your queries are working.
sounds like a good candidate for an index.
I'd suggest running planned sql queries against the tables and inspect the execution plans.
Given the these tables won't be subject to frequent update/insert having indexes for speed during peak load will be damn handy. You might apply non-clustered indexes after the bulk load/update - if time permits.
sounds like you be under some time pressure - you may want to consider doing a "dummy run" prior to d-day.
not sure if this is going to be relevant, but here it is anyway:
SQL Server Backup, Integrity Check, and Index and Statistics Maintenance
I'd suggest running planned sql queries against the tables and inspect the execution plans.
Given the these tables won't be subject to frequent update/insert having indexes for speed during peak load will be damn handy. You might apply non-clustered indexes after the bulk load/update - if time permits.
sounds like you be under some time pressure - you may want to consider doing a "dummy run" prior to d-day.
not sure if this is going to be relevant, but here it is anyway:
SQL Server Backup, Integrity Check, and Index and Statistics Maintenance
The SQL Server Maintenance Solution comprises scripts for running backups, integrity checks, and index and statistics maintenance on all editions of Microsoft SQL Server 2005, SQL Server 2008, SQL Server 2008 R2, and SQL Server 2012. The solution is based on stored procedures, the sqlcmd utility, and SQL Server Agent jobs. I designed the solution for the most mission-critical environments, and it is used in many organizations around the world. Numerous SQL Server community experts recommend the SQL Server Maintenance Solution, which has been a Gold and Silver winner in the 2012, 2011, and 2010 SQL Server Magazine Awards. The SQL Server Maintenance Solution is free.
You are describing a problem but you need to be more specific on the level of performance you expect the target platform to provide to determine its limits:
> How many users can connect simultaneously ?
> How big do you expect your database to have ?
> What kind of physical reads will the consulting of the results trigger ?
> What kind of budget do you have at your disposal to build an architecture ?
> Do you require high availability ?
A stress test run is definitely a good idea.
Hope this helps.
> How many users can connect simultaneously ?
> How big do you expect your database to have ?
> What kind of physical reads will the consulting of the results trigger ?
> What kind of budget do you have at your disposal to build an architecture ?
> Do you require high availability ?
A stress test run is definitely a good idea.
Hope this helps.
ASKER
Hi Racimo,
Up to 2000 could (in theory) connect simultaneously. I would expect that up to 1000 will connect within a 20 minute period, with the remainder connecting over the following hours.
As far as architecture goes, that's out of my hands and fixed. I'm not responsible for the hardware aspects, although I may have some input when I start pushing the Data Warehouse solution.
High availability is key, as we do not want the system to go down when students/staff are accessing results. I suspect the risk is greater on the web delivery side rather than the database side, but again that's a problem for the IT team.
The crux of the question (rather than problem to be honest) is to do with the most efficient way to extract the result set from the database and hand it over to the web developers who will be doing the actual presentation of the data. I'm using stored procedures to do this and am working on the performance aspects as I expect a lot of connections requesting relatively small amounts of data, but which requires fairly complex queries to provide.
I've already managed to speed up one part of the process by using a Table Valued function and then incorporated the results into a physical table. It's all about crunching the numbers prior to delivery rather than crunching on demand (Hoping that makes sense)...
Up to 2000 could (in theory) connect simultaneously. I would expect that up to 1000 will connect within a 20 minute period, with the remainder connecting over the following hours.
As far as architecture goes, that's out of my hands and fixed. I'm not responsible for the hardware aspects, although I may have some input when I start pushing the Data Warehouse solution.
High availability is key, as we do not want the system to go down when students/staff are accessing results. I suspect the risk is greater on the web delivery side rather than the database side, but again that's a problem for the IT team.
The crux of the question (rather than problem to be honest) is to do with the most efficient way to extract the result set from the database and hand it over to the web developers who will be doing the actual presentation of the data. I'm using stored procedures to do this and am working on the performance aspects as I expect a lot of connections requesting relatively small amounts of data, but which requires fairly complex queries to provide.
I've already managed to speed up one part of the process by using a Table Valued function and then incorporated the results into a physical table. It's all about crunching the numbers prior to delivery rather than crunching on demand (Hoping that makes sense)...
I would expect that up to 1000 will connect within a 20 minute period, with the remainder connecting over the following hours.
This is not huge. I have seen and worked with systems where 200000 users would connect in the same timeframe. Please take into consideration the following aspects:
> Make sure that server side code is correctly cached. For higher concurrency, make sure all the code is wrapped into stored procedures. If you do not have control over this, make sure you have plenty of RAM.
> Make sure you have sufficient and expendable CPU power available. When there are rushes, it is important that you have the possibility to increase CPU power on demand, else you will have to migrate you application to a more powerful box.
> Make sure your application has sufficient RAM to process all the data sent to it by SQL Server
> Make sure you IO sybsytem can cope with the peak IO request during the timeframe. Having IO contention will create locking that will make users wait
> Make sure no network based congestion prevents the server to send all the data requested
As far as architecture goes, that's out of my hands and fixed.
Infrastructure adequacy is essential into sizing up a system. Make sure you know the level of disk stripping you can rely on for knowing the level of IO power you have available. Also make sure that you know what kind of processing power and frequency you have. Performance and Hardware simply can not be dissociated.
I expect a lot of connections requesting relatively small amounts of data, but which requires fairly complex queries to provide.
Then make sure you have higher frequency processors.
It's all about crunching the numbers prior to delivery rather than crunching on demand
Then make sure that you can fit all calculations into specific timeslots.
Hope this helps.
This is not huge. I have seen and worked with systems where 200000 users would connect in the same timeframe. Please take into consideration the following aspects:
> Make sure that server side code is correctly cached. For higher concurrency, make sure all the code is wrapped into stored procedures. If you do not have control over this, make sure you have plenty of RAM.
> Make sure you have sufficient and expendable CPU power available. When there are rushes, it is important that you have the possibility to increase CPU power on demand, else you will have to migrate you application to a more powerful box.
> Make sure your application has sufficient RAM to process all the data sent to it by SQL Server
> Make sure you IO sybsytem can cope with the peak IO request during the timeframe. Having IO contention will create locking that will make users wait
> Make sure no network based congestion prevents the server to send all the data requested
As far as architecture goes, that's out of my hands and fixed.
Infrastructure adequacy is essential into sizing up a system. Make sure you know the level of disk stripping you can rely on for knowing the level of IO power you have available. Also make sure that you know what kind of processing power and frequency you have. Performance and Hardware simply can not be dissociated.
I expect a lot of connections requesting relatively small amounts of data, but which requires fairly complex queries to provide.
Then make sure you have higher frequency processors.
It's all about crunching the numbers prior to delivery rather than crunching on demand
Then make sure that you can fit all calculations into specific timeslots.
Hope this helps.
Best would likely be indexed views. That is, pre-compute and physically store all the common expected query results. Then the users, whether 1000 or 5000, would receive almost instant results, with no waiting for any joins, etc..
Indexed views would also need that are created "with schemabinding".
If the queries are really complex, then I think that creating new tables with plain "ready" data should be the fastest solution.
If the queries are really complex, then I think that creating new tables with plain "ready" data should be the fastest solution.
ASKER
Hi Racino,
I think you've misunderstood my question! I'm not involved at all in the server side architecture or hardware. This is purely an SQL question as to how to best optimise the delivery of data; the actual processing of the delivery and DBA side or handled elsewhere. I know that 1000 users isn't necessarily a lot (I've also worked on sites with millions of rows of data and thousands of users), but bear in mind that this is an education establishment, so expenditure on high-spec hardware is not the highest priority and we have to work with what we are given!
As far as the queries go, yes they are complex. The nature of the data passed from exam boards is such that it reminds me of the old days of extracting text from a mainframe and trying to manipulate it into usable table format. Once this is done, there are a multitude of joins to perform along with complex criteria and calculations. I've managed to strip out all of the subqueries and replace with CTEs and that makes quite a difference and using physical tables definitely seems to be the way to go. Now it's just a question of trying to use SSRS to develop reports for the staff!
I think you've misunderstood my question! I'm not involved at all in the server side architecture or hardware. This is purely an SQL question as to how to best optimise the delivery of data; the actual processing of the delivery and DBA side or handled elsewhere. I know that 1000 users isn't necessarily a lot (I've also worked on sites with millions of rows of data and thousands of users), but bear in mind that this is an education establishment, so expenditure on high-spec hardware is not the highest priority and we have to work with what we are given!
As far as the queries go, yes they are complex. The nature of the data passed from exam boards is such that it reminds me of the old days of extracting text from a mainframe and trying to manipulate it into usable table format. Once this is done, there are a multitude of joins to perform along with complex criteria and calculations. I've managed to strip out all of the subqueries and replace with CTEs and that makes quite a difference and using physical tables definitely seems to be the way to go. Now it's just a question of trying to use SSRS to develop reports for the staff!
If you want a fast response when each student tries to view their own results, then do as much of the processing as possible in advance. (And ensure that the data is placed in tables with suitable indexes to support the type of searches that you expect to be carried out - presumably, by candidate/course ; by course/candidate ; and maybe others depending on how people can search it. )
If the time you need it to be fast is when you are preparing the underlying data, so that you can publish the results quickly, you might choose to just defer any calculation and processing until each student requests it. It probably takes longer overall, but allows you to get the results up faster.