generates a random integer satisfying a discrete probability distribution

Assume you have access to a function that, in constant time, generates a real number uniformly at random between 0 and 1. Suppose you are given a discrete probability distribution, that is: A set of n real numbers p1,p2,...,pn such that sigma from i=1 to n of p(i) = 1.
Give an algorithm that generates a random integer x between 1 and n satisfying the condition that Pr(x = i) = p(i), and runs in time O(n + log n) (the log n is left in there as a hint). So, for example, if n = 3 with p1 = 0.3, p2 = 0.5, p3 = 0.2 then it should generate 1 with probability 0.3, 2 with probability 0.5, and 3 with probability 0.2.
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

mccarlIT Business Systems Analyst / Software DeveloperCommented:
I assume that this is an academic question, and so I will ask: What in particular are you having difficulty with? You should show that you have at least made an attempt at this and ask a specific question about a specific aspect that you are needing help with. We are not allowed (and it is immoral) to just give you answers to academic questions.
Mr_LeeAuthor Commented:
logn appears in the running time, so I thought about divide and conquer. But I don't know where to start because I am not familiar with probability distribution. I just want a more understandable hint.
O(n + log n) = O(n) but I presume the hint is meant to suggest a linear scan followed by a binary search.
OWASP: Forgery and Phishing

Learn the techniques to avoid forgery and phishing attacks and the types of attacks an application or network may face.

I for one don't get why the logn is there. You can do it with one search through the P(i)s.
The 'probability distribution' part isn't as confusing as it may sound. The p(1) = 0.3, p(2) = 0.5, p(3) = 0.2 thing is the entire probabilty distribution (or an example of one at least).
If you have the probability distribution in the given example and your random number from 0 to 1 that you generate is .25 what do you change it to? What if it's .35?
That's my hint.
I don't get what the log(n) shall hint to either.
The method that appears obvious and naive consists of a single linear scan ... ??
Better than naive methods are only needed if one wants to generate many numbers according to the given distribution
You build an array in linear time PArray = [0, .3, .8] and then use a binary search on that array for the random number.
That makes sense because if you want to generate k random numbers, then the time complexity is O(n + klogn) instead of O(kn) which is obviosly much better. So that may be what was intended.
ok my solution involves some C
  1  2  3 ...N    	possible outcomes
 P1 P2 P3   PN		discrete probability

let’s consider 

R()  //random generator 0-1

R1=R() 				//returns 0 - 1 	equiprobable
R2=R()+R()<<1			//returns 0 - 3 	equiprobable
R3=R()+R()<<1+R()<<2		//returns 0 - 7 	equiprobable
Rn=R()+R()<<1+R()<<2+...+R()<<n	//returns 0 - 2^n -1 	equiprobable

int RN(void) //this function returns a number form 1 to N equiprobable
unsigned int i;

  i=R()+R()<<1+R()<<2+...+R()<<n      //we chose n verifying that  N <= 2^n -1   
 }while (i< 1 || i >N)

return i;

char Prob[N]{ 3,5,6,0,..,5}; //every number represents “proportionally” P1,P2…PN that 
                             //mean that the sum of them does not have to equal 1…
char Trigger[N]{0,0,0,0..0}; //auxiliary array

//this function returns a number form 1 to N weighted with the discrete prob distribution
int WRN(void) {
int i;

  if (Prob[i-1]>0 ) Trigger[i-1]++;  //increment the i appearance counter
  if (Prob[i-1]>0 && Trigger[i-1]==Prob[i-1])
    Trigger[i]=0; //reset trigger
    return i; //time tu return a value

Open in new window

It could have some errors, I hope you understand the idea...
No offense, but that's fairly inefficient and only works with certain distributions. What if I said p1=.11111, p2 = .22222. p3 = .33333, and p4 = 33334?
The solution we are discussing is something like
Build an array [0, .11111, .33333, .66666]
Get R()
If R() is between 0 and .11111 then it's p1
If R() is between .11111 and .33333 then it's p2
If R() is between .33333 and .66666 then it's p3
If R() > .66666 then it's p4

This could all be done in C with a couple for loops, but no one really asked for a code solution.
no offense at all Genius.

My mistake then, sorry I thought it  has been asked a code algorithm ..

About inefficiencies well your example talks about a discreet distribution with a 0.0001 resolution…
About your example with p1=.11111, p2 = .22222. p3 = .33333, and p4 = 33334  My algorithm expressly deals with proportional integer values of P; it speeds things up…
About not working for every discreet distribution; I don’t quite get why not?
About your example I think your R() is not what we are looking for… we need a number pump from 1 to n where Pj=P[j]

The worst problem is that your method for getting a random number from 1 to n isn't efficient and it doesn't give the same weight to all possible values. All you need to do is
(int)(R()*n) + 1
R() gives a random number between 0 <R() < 1. R()*n gives a random real number between 0 < R()*n < n. Casting it to an int (truncation) gives a uniform random integer from 0 to n-1. And adding 1 gives you 1 to n. So you don't need all those definitions and loop (the ... on line 21 implies a loop or a hard coded n)

Actually, now that I reread your method I think I see what you are trying to do. For discrete distributions you've traded time for storage. The storage complexity is proportional to the precision. So a very precise distribution would have a very high memory cost, but the generation of the random numbers is constant.

Tell me if this is what you tried to do. For the original p(1), p(2), p(3) = .3, .5, .2 you build an array
{1,1,1,2,2,2,2,2,3,3} then you generate r = random from 0 to 9 and the random number is array[r]. This is actually a cool approach for any distribution with low precision. What you would really be concerned with would be to express each probability as a fraction and find the least common multiple of all the denominators. Then the complexity of creation is O(n + LCM + generatingLCM) But the complexity of generating a random number is O(1). So for many applications, it's much better than my O(n) for creation and O(lg(n)) for generation.

Of course, the original question asked for only getting one random number so mine is O(n + lg(n)) and yours is O(n + LCM + generatingLCM + 1). But who only gets one number from a distribution?

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
1)For the original p(1), p(2), p(3) = .3, .5, .2 I build the best array that it is
[7,5,8] that are the smallest integer numbers that are proportional to (1 – component of original distribution)

I2)f p(1), p(2), p(3) = .2, .4, . 4
then the best array is [8,6,6] reduced to [4,3,3]

3)if p(1), p(2), p(3) = .333333, .333333, .333333
then the best array is [66666, 66666, 66666] reduced to [1,1,1]

4)if p(1), p(2), p(3) = .9999, .00005, . 00005
then the best array is [.0001, 0.99995, 0.99995] reduced to [1,99995,99995]

these array says that
I need to count 7 randomly generated  1s to pump a weighted 1
I need to count 5 randomly generated  2s to pump a weighted 2
I need to count 5 randomly generated  3s to pump a weighted 3
I need to count 4 randomly generated  1s to pump a weighted 1
I need to count 3 randomly generated  2s to pump a weighted 2
I need to count 3 randomly generated  3s to pump a weighted 3
I need to count 1 randomly generated  1s to pump a weighted 1
I need to count 1 randomly generated  2s to pump a weighted 2
I need to count 1 randomly generated  3s to pump a weighted 3
I need to count 1 randomly generated  1s to pump a weighted 1
I need to count 99995 randomly generated  2s to pump a weighted 2
I need to count 99995 randomly generated  3s to pump a weighted 3

the size of the array is not a problem
Oh. In that case, I'll take credit for my second solution (which I thought was yours).
Yours now doesn't make sense. How did .2  .4  .4 get to be 4 3 3 (shouldn't it be 2 1 1?)
On 4) how come the smallest number isn't the largest this time? Anyway your time complexity for generation (if it works, which I am not seeing) is then O(n*m) where m is the max proportion thing. Mine is O(lg(n)) or O(1) both of which are much better. Do you see how they work?
.2,.4.4 ->.8,.6,.6->8,6,6->4,3,3
Pn->1-Pn-> the smallest proportional iteger serie...
Before you say that does not make sense, why don't you try to understand what I did? (if you want ;-) )

on 4 the smalles number P(1) is very high thats why the counter is the minimum 1, the counter for the very infrecuent 2 & 3 is very high... what is what you do not understand???

if you do not see how it works please say nothing at all until you can say why it fails or why it works...It's amazing you are giving equations for code that you do not understand...
It is not O(n*m): see the case 4, the pump produce a valid weighted random number on average almost every 3 loop of the WRN function and that it is not O(n*m) as you said

bottom line, take the time to understand what I did and we can keep analyzing the thing later ;-)
on average almost every 3 loop
big O notation is for worst case. I can understand "I need to count 99995 randomly generated  2s to pump a weighted 2".Which means in the worst case you need 99995 random numbers at least. The other approaches all use 1 random number. If you want to discuss this further, perhaps we should open a new thread.
I see how you got the proportions now, but it's still not clear why you are using 1-p in each case instead of p. In the .2 .4 .4 one, the 2 should appear twice as often as the 1 (as should the 3) so why use 4 3 3 and not 2 1 1?
yes, you need to count 99995 randomly generated  2s to pump a weighted 2 cause that represents the probability of pumping out a "2", but that DOES NOT mean that the pump is not pumping out the very frequent (high prob) "1" at the same time.
The algorithm worst scenario is how many iterations maximum takes to get "ANY" number out of the pump... in this case that is not 99995 but 3 (average)instead. I hope you can see it now.

I use 1-P cause the counter aproach requieres it, can't you see that the most frequent number is the one that has the higest prob but that requieres the smallest counter trigger?
If I use P the pump would be holding longer the most frequent number when what I want is just the oposite. that's why I use the complemnet of P
I had this same homework assignment and this thread was very helpful!

What I wrote is effectively the same as what TommySzalapski posted, but generalized to work in cases in which the values are not sorted in order, such as the one in the example. Just posting it in case someone else comes along looking for this same solution.

1) Generate a random number R and build an array a
   such that a[0] = 0 and a[i+1] = (sum from n=0 -> i) of P(j).
2) Find a[i] such that a[i-1]<R<a[i].
3) Return P(i-1).

Open in new window

It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Programming Theory

From novice to tech pro — start learning today.