Link to home
Start Free TrialLog in
Avatar of Mr_Lee
Mr_Lee

asked on

generates a random integer satisfying a discrete probability distribution

Assume you have access to a function that, in constant time, generates a real number uniformly at random between 0 and 1. Suppose you are given a discrete probability distribution, that is: A set of n real numbers p1,p2,...,pn such that sigma from i=1 to n of p(i) = 1.
Give an algorithm that generates a random integer x between 1 and n satisfying the condition that Pr(x = i) = p(i), and runs in time O(n + log n) (the log n is left in there as a hint). So, for example, if n = 3 with p1 = 0.3, p2 = 0.5, p3 = 0.2 then it should generate 1 with probability 0.3, 2 with probability 0.5, and 3 with probability 0.2.
Avatar of mccarl
mccarl
Flag of Australia image

I assume that this is an academic question, and so I will ask: What in particular are you having difficulty with? You should show that you have at least made an attempt at this and ask a specific question about a specific aspect that you are needing help with. We are not allowed (and it is immoral) to just give you answers to academic questions.
Avatar of Mr_Lee
Mr_Lee

ASKER

logn appears in the running time, so I thought about divide and conquer. But I don't know where to start because I am not familiar with probability distribution. I just want a more understandable hint.
O(n + log n) = O(n) but I presume the hint is meant to suggest a linear scan followed by a binary search.
I for one don't get why the logn is there. You can do it with one search through the P(i)s.
The 'probability distribution' part isn't as confusing as it may sound. The p(1) = 0.3, p(2) = 0.5, p(3) = 0.2 thing is the entire probabilty distribution (or an example of one at least).
If you have the probability distribution in the given example and your random number from 0 to 1 that you generate is .25 what do you change it to? What if it's .35?
That's my hint.
I don't get what the log(n) shall hint to either.
The method that appears obvious and naive consists of a single linear scan ... ??
Better than naive methods are only needed if one wants to generate many numbers according to the given distribution
You build an array in linear time PArray = [0, .3, .8] and then use a binary search on that array for the random number.
That makes sense because if you want to generate k random numbers, then the time complexity is O(n + klogn) instead of O(kn) which is obviosly much better. So that may be what was intended.
ok my solution involves some C
  1  2  3 ...N    	possible outcomes
 P1 P2 P3   PN		discrete probability

let’s consider 

R()  //random generator 0-1

R1=R() 				//returns 0 - 1 	equiprobable
R2=R()+R()<<1			//returns 0 - 3 	equiprobable
R3=R()+R()<<1+R()<<2		//returns 0 - 7 	equiprobable
Rn=R()+R()<<1+R()<<2+...+R()<<n	//returns 0 - 2^n -1 	equiprobable


int RN(void) //this function returns a number form 1 to N equiprobable
{
unsigned int i;

do
 {
  i=R()+R()<<1+R()<<2+...+R()<<n      //we chose n verifying that  N <= 2^n -1   
 }while (i< 1 || i >N)

return i;
}


char Prob[N]{ 3,5,6,0,..,5}; //every number represents “proportionally” P1,P2…PN that 
                             //mean that the sum of them does not have to equal 1…
char Trigger[N]{0,0,0,0..0}; //auxiliary array

//this function returns a number form 1 to N weighted with the discrete prob distribution
int WRN(void) {
int i;

While(1)
 {
  i=RN();
  if (Prob[i-1]>0 ) Trigger[i-1]++;  //increment the i appearance counter
  if (Prob[i-1]>0 && Trigger[i-1]==Prob[i-1])
   {
    Trigger[i]=0; //reset trigger
    return i; //time tu return a value
   }
 } 

Open in new window


It could have some errors, I hope you understand the idea...
No offense, but that's fairly inefficient and only works with certain distributions. What if I said p1=.11111, p2 = .22222. p3 = .33333, and p4 = 33334?
The solution we are discussing is something like
Build an array [0, .11111, .33333, .66666]
Get R()
If R() is between 0 and .11111 then it's p1
If R() is between .11111 and .33333 then it's p2
If R() is between .33333 and .66666 then it's p3
If R() > .66666 then it's p4

This could all be done in C with a couple for loops, but no one really asked for a code solution.
no offense at all Genius.

My mistake then, sorry I thought it  has been asked a code algorithm ..

About inefficiencies well your example talks about a discreet distribution with a 0.0001 resolution…
About your example with p1=.11111, p2 = .22222. p3 = .33333, and p4 = 33334  My algorithm expressly deals with proportional integer values of P; it speeds things up…
About not working for every discreet distribution; I don’t quite get why not?
About your example I think your R() is not what we are looking for… we need a number pump from 1 to n where Pj=P[j]

ASKER CERTIFIED SOLUTION
Avatar of TommySzalapski
TommySzalapski
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
1)For the original p(1), p(2), p(3) = .3, .5, .2 I build the best array that it is
[7,5,8] that are the smallest integer numbers that are proportional to (1 – component of original distribution)

I2)f p(1), p(2), p(3) = .2, .4, . 4
then the best array is [8,6,6] reduced to [4,3,3]

3)if p(1), p(2), p(3) = .333333, .333333, .333333
then the best array is [66666, 66666, 66666] reduced to [1,1,1]

4)if p(1), p(2), p(3) = .9999, .00005, . 00005
then the best array is [.0001, 0.99995, 0.99995] reduced to [1,99995,99995]


these array says that
1)      
I need to count 7 randomly generated  1s to pump a weighted 1
I need to count 5 randomly generated  2s to pump a weighted 2
I need to count 5 randomly generated  3s to pump a weighted 3
2)
I need to count 4 randomly generated  1s to pump a weighted 1
I need to count 3 randomly generated  2s to pump a weighted 2
I need to count 3 randomly generated  3s to pump a weighted 3
3)
I need to count 1 randomly generated  1s to pump a weighted 1
I need to count 1 randomly generated  2s to pump a weighted 2
I need to count 1 randomly generated  3s to pump a weighted 3
4)
I need to count 1 randomly generated  1s to pump a weighted 1
I need to count 99995 randomly generated  2s to pump a weighted 2
I need to count 99995 randomly generated  3s to pump a weighted 3

the size of the array is not a problem
Oh. In that case, I'll take credit for my second solution (which I thought was yours).
Yours now doesn't make sense. How did .2  .4  .4 get to be 4 3 3 (shouldn't it be 2 1 1?)
On 4) how come the smallest number isn't the largest this time? Anyway your time complexity for generation (if it works, which I am not seeing) is then O(n*m) where m is the max proportion thing. Mine is O(lg(n)) or O(1) both of which are much better. Do you see how they work?
.2,.4.4 ->.8,.6,.6->8,6,6->4,3,3
Pn->1-Pn-> the smallest proportional iteger serie...
Before you say that does not make sense, why don't you try to understand what I did? (if you want ;-) )

on 4 the smalles number P(1) is very high thats why the counter is the minimum 1, the counter for the very infrecuent 2 & 3 is very high... what is what you do not understand???

if you do not see how it works please say nothing at all until you can say why it fails or why it works...It's amazing you are giving equations for code that you do not understand...
It is not O(n*m): see the case 4, the pump produce a valid weighted random number on average almost every 3 loop of the WRN function and that it is not O(n*m) as you said

bottom line, take the time to understand what I did and we can keep analyzing the thing later ;-)
on average almost every 3 loop
big O notation is for worst case. I can understand "I need to count 99995 randomly generated  2s to pump a weighted 2".Which means in the worst case you need 99995 random numbers at least. The other approaches all use 1 random number. If you want to discuss this further, perhaps we should open a new thread.
I see how you got the proportions now, but it's still not clear why you are using 1-p in each case instead of p. In the .2 .4 .4 one, the 2 should appear twice as often as the 1 (as should the 3) so why use 4 3 3 and not 2 1 1?
yes, you need to count 99995 randomly generated  2s to pump a weighted 2 cause that represents the probability of pumping out a "2", but that DOES NOT mean that the pump is not pumping out the very frequent (high prob) "1" at the same time.
The algorithm worst scenario is how many iterations maximum takes to get "ANY" number out of the pump... in this case that is not 99995 but 3 (average)instead. I hope you can see it now.

I use 1-P cause the counter aproach requieres it, can't you see that the most frequent number is the one that has the higest prob but that requieres the smallest counter trigger?
If I use P the pump would be holding longer the most frequent number when what I want is just the oposite. that's why I use the complemnet of P
I had this same homework assignment and this thread was very helpful!

What I wrote is effectively the same as what TommySzalapski posted, but generalized to work in cases in which the values are not sorted in order, such as the one in the example. Just posting it in case someone else comes along looking for this same solution.

1) Generate a random number R and build an array a
   such that a[0] = 0 and a[i+1] = (sum from n=0 -> i) of P(j).
2) Find a[i] such that a[i-1]<R<a[i].
3) Return P(i-1).

Open in new window