• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 400
  • Last Modified:

creating a stratified systematic sample of a table

I have a large table from which i write-out a text file so that users of the data can read the text file into the statistical software of their choice (e.g. SAS, SPSS, Stata, etc).  The dataset i have is rather large (3.1 million records; 145 columns), and, in testing, i found the statistical software to be very slow.  I'd like to create a sample dataset that has a thousandth of the records (3.1 K records), but is still representative of the table as a whole.  This would allow users to more easily develop sytax in the language of their stat software and test it in a reasonable amount of time, then when they're confident that it works, run it on the whole dataset.

I want to create a query that orders the table by state, then county, select a random sample from the first 1000 of these ordered records, then take every thousandth record starting from the randomly-selected record.  I want write these data to a text file, either directly or from a temporary table or a view.

Any ideas on how to write the query that selects a random sample of the first 1000 ordered records, then selects every thousandth record after that?

thanks in advance,

-med29
0
med39
Asked:
med39
  • 2
  • 2
1 Solution
 
LowfatspreadCommented:
which version of sql server are you using,...


using the NEWID() function to order the table and taking the top 3100 rows maybe the cheapest "random" method

select top 3100 *
  from yourtable
 order by newid()


if you really want to start at a particular row and then tale every 1000th then we need to determine what we
can use for sequencing...


you can use the bcp utility   to create the actual file from the eventual query..
0
 
med39Author Commented:
Lowfatspread,

I'm using vesion 8.0 of the Query Analyzer and the Enterprise Manager.

-med39
0
 
LowfatspreadCommented:
if you must go the random start then every 1000th.
then this is probably one of those times i'd actually recommend using a cursor..

so declare a cursor for the select statement

open the cursor

and then use
 
FETCH ABSOLUTE number
to get the random start position

and then in a while loop

FETCH RELATIVE 1000
 
to get the other 3099ish rows..

hth

declare cursor samplecur for
  select * from yourtable
     order by state,county,...

open samplecur

(Declare @startpos int
 set @startpos = rand(datepart(ms,getdate())* 1000

Fetch Absolute @startpos
  into .... (list of bvariables)


while @@fetch_status = 0
begin
Insert into SampleTable
  values (list of variables)

   fetch relative 1000
     into ... list of variable
End

then you can use the BCP command to output the sample table to a text file..



0
 
med39Author Commented:
Perfect!

Thanks, LowFatSpread!  (love the name!)

-med39
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 2
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now