Random Samples of a Universe

Nick Wolf
Nick Wolf used Ask the Experts™
I am pulling a random sample from a list of prescription drugs and I want to make sure that my process is logical/valid.

I have a spreadsheet with 8100 rows of unique NDC numbers. Sometimes, in this list, more than one NDC number correspond to the same prescription drug. (For example, NDC #s 0093-4127-74 and 0093-4127-73 together count as 1 Penicillin).

I am thinking of using RAT-STATS to generate 50 numbers in sequential order (Samples) and 10 numbers in Random Order (spares) from the sampling frame 1 (low number) to 8100 (high number).

However, if by chance more than one of the 8100 NDC numbers, that correspond to the same drug, are randomly selected in the sample of 50, can I then use a spare that does not correspond to the same drug in order to have a total of 50 unique samples from the universe?

This is to avoid having to go through 8100 rows of NDC numbers and remove those that mean the same drug, and THEN run RAT-STATS. I hope I am making sense. Thank you in advance for the help.
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Most Valuable Expert 2014
Top Expert 2015

I don't know if there are other correlations between sequential numbers.  Why not generate 60 random number or 60 sequential numbers for your samples and spares?
If you are asking for the probability that 50+10 will be enough, that would depend on how often different numbers corespond to the same drug.
It depends if you want a random nuber of NDC numbers or of drugs. I gather the latter.
Also I am unsure of the difference between 50 numbers in sequential order and 10 numbers in Random Order.
Is there any order to the NDC numbers?  Why are your first 50 numbers not random?
When you draw your sample can you not check then for duplication? (and discard the duplicate.
Nick WolfEverything IT


ozo and aburr,
   Thank you for the prompt responses. Sorry for the confusion.

Apparently what the program does is create two separate sheets of numbers. The first "sequential" sheet has 50 "random numbers to be generated in sequential order" from the sample, labeled 1-50, and sorted from smallest number to largest. The second "spares" sheet of 10 lists 10 samples, labeled 51-60, and sorted in a random order (not smallest to largest).

Now I am confused. Here is what the RAT-STATS manual says:

"Enter the quantity of numbers to be generated in:

Sequential Order
The quantity of random numbers to be generated in sequential order should be entered in this
box.  After the quantity indicated has been generated by the program, the random numbers will
be sorted and the output will be arranged in ascending order to assist the user in retrieving the
sample items.  The order of selection will be printed with the random numbers.  If the quantity
desired is zero, then this box can be left blank or a “0” (zero) can be entered.

Spares in Random Order
The quantity of numbers to be generated in random order should be entered in this box.  The
random numbers will be displayed in the order selected.  If the quantity desired is zero, then this
box can be left blank or a “0” (zero) can be entered."
C++ 11 Fundamentals

This course will introduce you to C++ 11 and teach you about syntax fundamentals.

Nick WolfEverything IT


I do want a random number of drugs sampled, not NDC numbers. Yes, I can check for duplicates and then use one of the "spares" in place of the duplicate sample. I believe I am over-thinking this. I just don't want to unknowingly be manipulating the random selection and thereby not having a truly random selection.

Can you tell I don't do this every day? :-/ Thanks for hanging in there with me...
It seems to me that you are “resampling”. Your initial list is a sample, and you sample the sample. In bootstrapping, for example, it is usual to create a sample with replacement, meaning that the same item (drug or number) could be selected more than once. In that case, having the same drug under different numbers will raise the relative weight of that drug in the results. In resampling with replacement, you need to clean up the list first, you cannot reject any item after sampling.

From the description of the random generator you are using, it would appear it's a resampling without replacement, explaining why you need to generate all numbers at once. In that case, it it valid to reject duplicates after sampling (using the “spares”). You are basically selecting different drugs each with the same probability, accepting any representative (any of its numbers).

Whether any inference based on unweighted sampling is valid it another question. In any case, you are assuming the universe (your list) to be a valid representation of the reality you are studying, or you are studying the list itself.

If I understand what you are saying, I think the simple answer to your question is

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial