'randomly' distributing elements into buckets

ugeb
ugeb used Ask the Experts™
on
Hi,

I have the task of placing users of a game into buckets based on their user ID.  The number of buckets is set for a given trial, but could vary from trial to trial.

For example, let's say I have 1000 users of a game, and I would like to place them randomly but evenly into 5 buckets, 1 through 5. However, whenever I compute the bucket for a given user, I always need to place them in the exact same bucket, so I need a deterministic way of 'randomly' placing a user in one of N buckets.  I would like each bucket to be representational of the whole data set.

Another follow on issue is that I have a certain % chance that they won't be placed in ANY bucket, but be left out of the experiment.  For example, maybe there's a 70% chance they will be placed in a bucket, and 30% chance they won't. If they ARE placed in a bucket, then I have to know which bucket.  It has to be repeatable for a given user.

Any suggestions on how I can approach this problem or on solutions already available?

Thanks!





Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
ozo
Most Valuable Expert 2014
Top Expert 2015

Commented:
you might use a hash function

Author

Commented:
that's not really a possibility in my situation
ozo
Most Valuable Expert 2014
Top Expert 2015

Commented:
How does your situation prevent that possibility?
Why Diversity in Tech Matters

Kesha Williams, certified professional and software developer, explores the imbalance of diversity in the world of technology -- especially when it comes to hiring women. She showcases ways she's making a difference through the Colors of STEM program.

Awarded 2010
Top Expert 2013

Commented:
If you want an exact number in each bucket (say 150 in each bucket and 250 not in any), then the best way I can think of is to shuffle the list of users. Then just put the first 150 in the first bucket, etc. If you manually choose a random number to seed the generator with, then it will shuffle it the same way every time or you can maintain the list of shuffled users.

Quick reminder of the O(n) shuffle algorithm:
Shuffle
for i = 2 to 1000
  j = rand(from 1 to i)
  swap user i with user j

If the number of users in each bucket shoud be an approximate, not fixed, number, then the hash codes will work just fine.

Author

Commented:
Hi,

Sorry, I read "hash function" as "hash table".  Yes, a hash function (or algorithm) is precisely what I am searching for.   I know hashes of the sort "a mod n" are often used, but I have two components.  The first component is determining whether something will be placed in the bucket or not, and that is an uneven distribution.  The second component determines which bucket.  Do you know of hash functions that can assign based on uneven distributions?

Tommy:  Unfortunately I don't think that solution will work as I do not have the entire data set available at a given time.  Numbers (buckets) need to be assigned dynamically.

Thanks!
Most Valuable Expert 2014
Top Expert 2015
Commented:
a hash of the sort a mod n would distribute as evenly as a shuffle without having  the entire data set available.
Perhaps that is too even and regular.
If you know at a given time what the eventual size of the  dataset will be you can still get the effect of a shuffle.
Or if the distribution does not have to be perfectly even at a given time, you might use a more random hash function,
For the two components, you could either use two hash functions, or treat non-in-a-bucket as another big bucket that you never put anything in.
use the digits of pi as a random number sequence - you can download it somewhere, (google it) - then 0,1 is bucket 1, 2,3 bucket 2 etc

where you have 6 choices you'd do something different, like ignore any digits not 1-6

since pi is constant, you are producing the same choices each time.

pi was tested for 'non randomness' and I believe that no measure of its digits not appearing random could be shown (although obviously it is not actually random as such)
Awarded 2010
Top Expert 2013

Commented:
You know that the random number generator requires a seed to work and that if you give it the same seed every time, it gives you the same 'random' numbers.
So for each item, take the item's unique ID (or whatever) and seed the random number generator with it. Then it can generate the same random number for the same item each time and it doesn't matter how the data is distributed.

I would just do it with one random number using Ozo's idea of the non-bucket as a sixth bucket but if it makes it easier, you could also get two random numbers. One for in/out of the buckets and one for which bucket if in.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial