generates a random integer satisfying a discrete probability distribution

Assume you have access to a function that, in constant time, generates a real number uniformly at random between 0 and 1. Suppose you are given a discrete probability distribution, that is: A set of n real numbers p1,p2,...,pn such that sigma from i=1 to n of p(i) = 1.
Give an algorithm that generates a random integer x between 1 and n satisfying the condition that Pr(x = i) = p(i), and runs in time O(n + log n) (the log n is left in there as a hint). So, for example, if n = 3 with p1 = 0.3, p2 = 0.5, p3 = 0.2 then it should generate 1 with probability 0.3, 2 with probability 0.5, and 3 with probability 0.2.
Who is Participating?

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

IT Business Systems Analyst / Software DeveloperCommented:
I assume that this is an academic question, and so I will ask: What in particular are you having difficulty with? You should show that you have at least made an attempt at this and ask a specific question about a specific aspect that you are needing help with. We are not allowed (and it is immoral) to just give you answers to academic questions.
Author Commented:
logn appears in the running time, so I thought about divide and conquer. But I don't know where to start because I am not familiar with probability distribution. I just want a more understandable hint.
Commented:
O(n + log n) = O(n) but I presume the hint is meant to suggest a linear scan followed by a binary search.
Commented:
I for one don't get why the logn is there. You can do it with one search through the P(i)s.
The 'probability distribution' part isn't as confusing as it may sound. The p(1) = 0.3, p(2) = 0.5, p(3) = 0.2 thing is the entire probabilty distribution (or an example of one at least).
Commented:
If you have the probability distribution in the given example and your random number from 0 to 1 that you generate is .25 what do you change it to? What if it's .35?
That's my hint.
Commented:
I don't get what the log(n) shall hint to either.
The method that appears obvious and naive consists of a single linear scan ... ??
Better than naive methods are only needed if one wants to generate many numbers according to the given distribution
Commented:
You build an array in linear time PArray = [0, .3, .8] and then use a binary search on that array for the random number.
That makes sense because if you want to generate k random numbers, then the time complexity is O(n + klogn) instead of O(kn) which is obviosly much better. So that may be what was intended.
Commented:
``````ok my solution involves some C
1  2  3 ...N    	possible outcomes
P1 P2 P3   PN		discrete probability

let’s consider

R()  //random generator 0-1

R1=R() 				//returns 0 - 1 	equiprobable
R2=R()+R()<<1			//returns 0 - 3 	equiprobable
R3=R()+R()<<1+R()<<2		//returns 0 - 7 	equiprobable
Rn=R()+R()<<1+R()<<2+...+R()<<n	//returns 0 - 2^n -1 	equiprobable

int RN(void) //this function returns a number form 1 to N equiprobable
{
unsigned int i;

do
{
i=R()+R()<<1+R()<<2+...+R()<<n      //we chose n verifying that  N <= 2^n -1
}while (i< 1 || i >N)

return i;
}

char Prob[N]{ 3,5,6,0,..,5}; //every number represents “proportionally” P1,P2…PN that
//mean that the sum of them does not have to equal 1…
char Trigger[N]{0,0,0,0..0}; //auxiliary array

//this function returns a number form 1 to N weighted with the discrete prob distribution
int WRN(void) {
int i;

While(1)
{
i=RN();
if (Prob[i-1]>0 ) Trigger[i-1]++;  //increment the i appearance counter
if (Prob[i-1]>0 && Trigger[i-1]==Prob[i-1])
{
Trigger[i]=0; //reset trigger
return i; //time tu return a value
}
}
``````

It could have some errors, I hope you understand the idea...
Commented:
No offense, but that's fairly inefficient and only works with certain distributions. What if I said p1=.11111, p2 = .22222. p3 = .33333, and p4 = 33334?
The solution we are discussing is something like
Build an array [0, .11111, .33333, .66666]
Get R()
If R() is between 0 and .11111 then it's p1
If R() is between .11111 and .33333 then it's p2
If R() is between .33333 and .66666 then it's p3
If R() > .66666 then it's p4

This could all be done in C with a couple for loops, but no one really asked for a code solution.
Commented:
no offense at all Genius.

My mistake then, sorry I thought it  has been asked a code algorithm ..

About your example with p1=.11111, p2 = .22222. p3 = .33333, and p4 = 33334  My algorithm expressly deals with proportional integer values of P; it speeds things up…
About not working for every discreet distribution; I don’t quite get why not?
About your example I think your R() is not what we are looking for… we need a number pump from 1 to n where Pj=P[j]

Commented:
The worst problem is that your method for getting a random number from 1 to n isn't efficient and it doesn't give the same weight to all possible values. All you need to do is
(int)(R()*n) + 1
R() gives a random number between 0 <R() < 1. R()*n gives a random real number between 0 < R()*n < n. Casting it to an int (truncation) gives a uniform random integer from 0 to n-1. And adding 1 gives you 1 to n. So you don't need all those definitions and loop (the ... on line 21 implies a loop or a hard coded n)

Actually, now that I reread your method I think I see what you are trying to do. For discrete distributions you've traded time for storage. The storage complexity is proportional to the precision. So a very precise distribution would have a very high memory cost, but the generation of the random numbers is constant.

Tell me if this is what you tried to do. For the original p(1), p(2), p(3) = .3, .5, .2 you build an array
{1,1,1,2,2,2,2,2,3,3} then you generate r = random from 0 to 9 and the random number is array[r]. This is actually a cool approach for any distribution with low precision. What you would really be concerned with would be to express each probability as a fraction and find the least common multiple of all the denominators. Then the complexity of creation is O(n + LCM + generatingLCM) But the complexity of generating a random number is O(1). So for many applications, it's much better than my O(n) for creation and O(lg(n)) for generation.

Of course, the original question asked for only getting one random number so mine is O(n + lg(n)) and yours is O(n + LCM + generatingLCM + 1). But who only gets one number from a distribution?

Experts Exchange Solution brought to you by

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Commented:
1)For the original p(1), p(2), p(3) = .3, .5, .2 I build the best array that it is
[7,5,8] that are the smallest integer numbers that are proportional to (1 – component of original distribution)

I2)f p(1), p(2), p(3) = .2, .4, . 4
then the best array is [8,6,6] reduced to [4,3,3]

3)if p(1), p(2), p(3) = .333333, .333333, .333333
then the best array is [66666, 66666, 66666] reduced to [1,1,1]

4)if p(1), p(2), p(3) = .9999, .00005, . 00005
then the best array is [.0001, 0.99995, 0.99995] reduced to [1,99995,99995]

these array says that
1)
I need to count 7 randomly generated  1s to pump a weighted 1
I need to count 5 randomly generated  2s to pump a weighted 2
I need to count 5 randomly generated  3s to pump a weighted 3
2)
I need to count 4 randomly generated  1s to pump a weighted 1
I need to count 3 randomly generated  2s to pump a weighted 2
I need to count 3 randomly generated  3s to pump a weighted 3
3)
I need to count 1 randomly generated  1s to pump a weighted 1
I need to count 1 randomly generated  2s to pump a weighted 2
I need to count 1 randomly generated  3s to pump a weighted 3
4)
I need to count 1 randomly generated  1s to pump a weighted 1
I need to count 99995 randomly generated  2s to pump a weighted 2
I need to count 99995 randomly generated  3s to pump a weighted 3

the size of the array is not a problem
Commented:
Oh. In that case, I'll take credit for my second solution (which I thought was yours).
Yours now doesn't make sense. How did .2  .4  .4 get to be 4 3 3 (shouldn't it be 2 1 1?)
On 4) how come the smallest number isn't the largest this time? Anyway your time complexity for generation (if it works, which I am not seeing) is then O(n*m) where m is the max proportion thing. Mine is O(lg(n)) or O(1) both of which are much better. Do you see how they work?
Commented:
.2,.4.4 ->.8,.6,.6->8,6,6->4,3,3
Pn->1-Pn-> the smallest proportional iteger serie...
Before you say that does not make sense, why don't you try to understand what I did? (if you want ;-) )

on 4 the smalles number P(1) is very high thats why the counter is the minimum 1, the counter for the very infrecuent 2 & 3 is very high... what is what you do not understand???

if you do not see how it works please say nothing at all until you can say why it fails or why it works...It's amazing you are giving equations for code that you do not understand...
It is not O(n*m): see the case 4, the pump produce a valid weighted random number on average almost every 3 loop of the WRN function and that it is not O(n*m) as you said

bottom line, take the time to understand what I did and we can keep analyzing the thing later ;-)
Commented:
on average almost every 3 loop
big O notation is for worst case. I can understand "I need to count 99995 randomly generated  2s to pump a weighted 2".Which means in the worst case you need 99995 random numbers at least. The other approaches all use 1 random number. If you want to discuss this further, perhaps we should open a new thread.
I see how you got the proportions now, but it's still not clear why you are using 1-p in each case instead of p. In the .2 .4 .4 one, the 2 should appear twice as often as the 1 (as should the 3) so why use 4 3 3 and not 2 1 1?
Commented:
yes, you need to count 99995 randomly generated  2s to pump a weighted 2 cause that represents the probability of pumping out a "2", but that DOES NOT mean that the pump is not pumping out the very frequent (high prob) "1" at the same time.
The algorithm worst scenario is how many iterations maximum takes to get "ANY" number out of the pump... in this case that is not 99995 but 3 (average)instead. I hope you can see it now.

I use 1-P cause the counter aproach requieres it, can't you see that the most frequent number is the one that has the higest prob but that requieres the smallest counter trigger?
If I use P the pump would be holding longer the most frequent number when what I want is just the oposite. that's why I use the complemnet of P
Commented:
``````1) Generate a random number R and build an array a