evanburen
asked on
Variable in TABLESAMPLE Clause
Hi
I am trying to base the size of my table sample based on the number of records in the table. In this example, if my record count is between 2 and 8, then I want my sample size to be 2 records from the all of the records in the table. If the record count is between 9 and 15, then the sample size is 3. The error message I get is "Variables are not allowed in the TABLESAMPLE or REPEATABLE clauses"
CREATE PROCEDURE p_CashSample
@RecordCount Int = NULL OUTPUT
AS
BEGIN
SET @RecordCount = (SELECT COUNT(CashID) FROM PopulationBondsCash)
IF (@RecordCount BETWEEN 2 AND 8)
SELECT TOP 2 * FROM PopulationBondsCash TABLESAMPLE (@RecordCount ROWS)
ELSE IF (@RecordCount BETWEEN 9 AND 15)
SELECT TOP 3 * FROM PopulationBondsCash TABLESAMPLE (@RecordCount ROWS)
ELSE IF (@RecordCount BETWEEN 16 AND 25)
SELECT TOP 5 * FROM PopulationBondsCash TABLESAMPLE (@RecordCount ROWS)
END
GO
I am trying to base the size of my table sample based on the number of records in the table. In this example, if my record count is between 2 and 8, then I want my sample size to be 2 records from the all of the records in the table. If the record count is between 9 and 15, then the sample size is 3. The error message I get is "Variables are not allowed in the TABLESAMPLE or REPEATABLE clauses"
CREATE PROCEDURE p_CashSample
@RecordCount Int = NULL OUTPUT
AS
BEGIN
SET @RecordCount = (SELECT COUNT(CashID) FROM PopulationBondsCash)
IF (@RecordCount BETWEEN 2 AND 8)
SELECT TOP 2 * FROM PopulationBondsCash TABLESAMPLE (@RecordCount ROWS)
ELSE IF (@RecordCount BETWEEN 9 AND 15)
SELECT TOP 3 * FROM PopulationBondsCash TABLESAMPLE (@RecordCount ROWS)
ELSE IF (@RecordCount BETWEEN 16 AND 25)
SELECT TOP 5 * FROM PopulationBondsCash TABLESAMPLE (@RecordCount ROWS)
END
GO
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
I think using the CASE statement with the NEWID() function is easier to read which is the one I'm going with although using dynamic SQL will also work. Using TOP instead of TABLESAMPLE or NEWID() does not give you a random sample of records. It just gives you the first x records.
@acperkins
Ordering by NEWID() is brilliant! Thank you for posting that. I learned something new.
Greg
Ordering by NEWID() is brilliant! Thank you for posting that. I learned something new.
Greg
Greg,
Thanks. As it turns out I learned something too, as I had to read up on TABLESAMPLE clause. I am not sure if it was just me, but I was less than impressed by the results produced by it and in fact SQL Server BOL seemed to agree with me:
TABLESAMPLE SYSTEM returns an approximate percentage of rows and generates a random value for each physical 8-KB page in the table. Based on the random value for a page and the percentage specified in the query, a page is either included in the sample or excluded. Each page that is included returns all rows in the sample result set. For example, if you specify TABLESAMPLE SYSTEM 10 PERCENT, SQL Server returns all the rows on approximately 10 percent of the specified data pages of the table. If the rows are evenly distributed on the pages of the table, and if there is a sufficient number of pages in the table, the number of rows returned should approximate the sample size that is requested. However, because the random value that is generated for each page is independent of the values that are generated for any other page, a larger, or smaller, percentage of pages than have been requested might be returned. The TOP(n) operator can be used to limit the number of rows to a specified maximum.
...
The TABLESAMPLE SYSTEM clause should be used with some caution, and with some understanding of some of the implications of using sampling. For example, a join of two tables is likely to return a match for each row in both tables; however, if TABLESAMPLE SYSTEM is specified for either of the two tables, some rows returned from the unsampled table are unlikely to have a matching row in the sampled table. This behavior might lead you to suspect that a data consistency problem exists in the underlying tables, when the data is actually valid. Similarly, if TABLESAMPLE SYSTEM is specified for both tables that are joined, the perceived problem could be even worse.
Anthony
Thanks. As it turns out I learned something too, as I had to read up on TABLESAMPLE clause. I am not sure if it was just me, but I was less than impressed by the results produced by it and in fact SQL Server BOL seemed to agree with me:
TABLESAMPLE SYSTEM returns an approximate percentage of rows and generates a random value for each physical 8-KB page in the table. Based on the random value for a page and the percentage specified in the query, a page is either included in the sample or excluded. Each page that is included returns all rows in the sample result set. For example, if you specify TABLESAMPLE SYSTEM 10 PERCENT, SQL Server returns all the rows on approximately 10 percent of the specified data pages of the table. If the rows are evenly distributed on the pages of the table, and if there is a sufficient number of pages in the table, the number of rows returned should approximate the sample size that is requested. However, because the random value that is generated for each page is independent of the values that are generated for any other page, a larger, or smaller, percentage of pages than have been requested might be returned. The TOP(n) operator can be used to limit the number of rows to a specified maximum.
...
The TABLESAMPLE SYSTEM clause should be used with some caution, and with some understanding of some of the implications of using sampling. For example, a join of two tables is likely to return a match for each row in both tables; however, if TABLESAMPLE SYSTEM is specified for either of the two tables, some rows returned from the unsampled table are unlikely to have a matching row in the sampled table. This behavior might lead you to suspect that a data consistency problem exists in the underlying tables, when the data is actually valid. Similarly, if TABLESAMPLE SYSTEM is specified for both tables that are joined, the perceived problem could be even worse.
Anthony
I think you are using 2 techniques simultaneously to narrow down the resultset. Since you are already using TOP keyword with values like 2,3,5 etc, you do not need to use TABLESAMPLE keyword as it would also do the same as TOP keyword. So simply remove the TABLESAMPLE clause and your results will still be good as you want them to narrow down to.
Regards
Kamal