creating a stratified systematic sample of a table

Posted on 2006-05-18
Medium Priority
Last Modified: 2008-03-04
I have a large table from which i write-out a text file so that users of the data can read the text file into the statistical software of their choice (e.g. SAS, SPSS, Stata, etc).  The dataset i have is rather large (3.1 million records; 145 columns), and, in testing, i found the statistical software to be very slow.  I'd like to create a sample dataset that has a thousandth of the records (3.1 K records), but is still representative of the table as a whole.  This would allow users to more easily develop sytax in the language of their stat software and test it in a reasonable amount of time, then when they're confident that it works, run it on the whole dataset.

I want to create a query that orders the table by state, then county, select a random sample from the first 1000 of these ordered records, then take every thousandth record starting from the randomly-selected record.  I want write these data to a text file, either directly or from a temporary table or a view.

Any ideas on how to write the query that selects a random sample of the first 1000 ordered records, then selects every thousandth record after that?

thanks in advance,

Question by:med39
  • 2
  • 2
LVL 50

Expert Comment

ID: 16709717
which version of sql server are you using,...

using the NEWID() function to order the table and taking the top 3100 rows maybe the cheapest "random" method

select top 3100 *
  from yourtable
 order by newid()

if you really want to start at a particular row and then tale every 1000th then we need to determine what we
can use for sequencing...

you can use the bcp utility   to create the actual file from the eventual query..

Author Comment

ID: 16710610

I'm using vesion 8.0 of the Query Analyzer and the Enterprise Manager.

LVL 50

Accepted Solution

Lowfatspread earned 1000 total points
ID: 16717668
if you must go the random start then every 1000th.
then this is probably one of those times i'd actually recommend using a cursor..

so declare a cursor for the select statement

open the cursor

and then use
to get the random start position

and then in a while loop

to get the other 3099ish rows..


declare cursor samplecur for
  select * from yourtable
     order by state,county,...

open samplecur

(Declare @startpos int
 set @startpos = rand(datepart(ms,getdate())* 1000

Fetch Absolute @startpos
  into .... (list of bvariables)

while @@fetch_status = 0
Insert into SampleTable
  values (list of variables)

   fetch relative 1000
     into ... list of variable

then you can use the BCP command to output the sample table to a text file..


Author Comment

ID: 16734542

Thanks, LowFatSpread!  (love the name!)


Featured Post

What is SQL Server and how does it work?

The purpose of this paper is to provide you background on SQL Server. It’s your self-study guide for learning fundamentals. It includes both the history of SQL and its technical basics. Concepts and definitions will form the solid foundation of your future DBA expertise.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In the first part of this tutorial we will cover the prerequisites for installing SQL Server vNext on Linux.
Ready to get certified? Check out some courses that help you prepare for third-party exams.
Via a live example combined with referencing Books Online, show some of the information that can be extracted from the Catalog Views in SQL Server.
Viewers will learn how to use the UPDATE and DELETE statements to change or remove existing data from their tables. Make a table: Update a specific column given a specific row using the UPDATE statement: Remove a set of values using the DELETE s…
Suggested Courses

809 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question