# How many different colors in sample

Dear experts,

first of all I should point out, that this is not homework. I rephrased a problem I have at work in classical analogies of probability theory.

Suppose you have an urn with balls of different colors. There are n colors and k balls of each color. You draw m (m << n) balls. Lets say with replacement, but it is not really important.

Question: What is the probability that the sample contains balls of exactly c different colors (not matter which ones).

Given:
n = number of colors
k = number of balls for each color
m = number of balls drawn
c = number of different colors in the sample

Probability of c given n,k and m = P(c | n,k,m)

I am really looking for a closed form,like a formula/method which takes n,k,m and c.

I am not interested in specific examples like

Math / Science

Last Comment
derduff

8/22/2022 - Mon
Dave Baldwin

If there was a "random" distribution, you might be able to get a formula.  But with a real urn full of balls, you can't reach past the balls on top so it is not random but is dependent on the order that the urn is filled.
derduff

The urn is merely a theoretical construct. The probability to be drawn is the same for each ball in the urn: 1/(n*k)
TommySzalapski

If it is with replacement, or n is so much greater than m that it approximates replacement, then n and k don't matter. Only the number of different colors (n/k) which I shall refer to as d.

So you have a probability of 1/d that any selected ball is of a certain color.

I'm going to post some initial observations and then work on it some more later.

If m = c = d (you want 1 of each color), then
p = 1 * (d-1)/d * (d-2)/d * ... * 2/d * 1/d= d!/d^d

If m = c < d (you want 1 per color, but not all colors will be there)
p = 1*(d-1)/d * (d-m+1)/d = d!/((m+1)! * d^m)
Experts Exchange is like having an extremely knowledgeable team sitting and waiting for your call. Couldn't do my job half as well as I do without it!
James Murphy
TommySzalapski

Oops. For m=c<d it's really d!/( (d-m)! * d^m ) and if you set d=m, you get the same equation as the one for m=c=d
derduff

Hi Tommy, thank you for working on this.  Just a note: n already denominates the number of colors. The total number of ball is n*k.

TommySzalapski

Oops. Sorry. I'm so used to n meaning the total number of balls that I read it that way. Hah!

So you also look at this as a state diagram (the number by the arrow is the probability of moving to that state from the current one).

with any generic portion in this form  (where x is the number of different colors so far)
And you want the probability of m transitions landing you on the C state.
Get an unlimited membership to EE for less than \$4 a week.
Unlimited question asking, solutions, articles and more.
derduff

Hmm I see how you think. I see one problem though. There is a large number of ways to achieve m transitions. For c=4 m=8 you could have

1=>1=>1=>1=>1=>2=>3=>4

1=>1=>1=>1=>2=>2=>3=>4

1=>2=>2=>2=>2=>3=>4=>4

These are basically integer partitions (the number of ways you can write an positive integer n as the sum of k other positive integers). The number of this partitions can be calculated with the partition function.

http://en.wikipedia.org/wiki/Partition_(number_theory)#Partition_function

At this point I have to admit that I already have a solution, which goes in this direction but requires the explicit enumeration of the partitions. It works, however it is insanely complicated. I didn't want to bring it up before, in order not to steer the discussion to much in that path.
TommySzalapski

Yes, but the beauty of the state diagram is that you don't need to worry about what order the states happen in. You just multiply out the probabilities for each step and after m matrix multiplications you get the probability of being in each state. To get it into a closed form solution would take some clever algebra (which I will have time for tomorrow I hope). If you actually have numbers for m, c, and n, then it's just plug and chug into MATLAB or an Excel sheet.
derduff

No, the order is important, because the probabilities change, depending how long one remains in one state.

1=>1=>1=>1=>1=>2=>3=>4

= (1/n)^5 * (n-1)/n *(n-2)/n * (n-3)/n

1=>1=>1=>1=>2=>2=>3=>4

= (1/n)^4 * (n-1)/n * (2/n)^1 * (n-2)/n * (n-3)/n

1=>2=>2=>2=>2=>3=>4=>4

= (1/n)^1 * (n-1)/n * (2/n)^3 *  (n-2)/n * (n-3)/n * (4/n)^1

Same path length , different results.
All of life is about relationships, and EE has made a viirtual community a real community. It lifts everyone's boat
William Peck
TommySzalapski

Become an EE member today7-DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.
Not exactly the question you had in mind?
Sign up for an EE membership and get your own personalized solution. With an EE membership, you can ask unlimited troubleshooting, research, or opinion questions.
derduff

You are right, as far I see it! I didn't look at that way. Please let me check it more thoroughly today, before I finally accept it. But it is looking very good! Thanks!
derduff

I accept the solution and I thank you for removing this giant knot in my brain.

P(c,m|n) =  (n-c+1)/n * P(c-1,m-1|n) + c/n *  P(c,m-1|n)

P(1,1|n) = 1

for c <= m <= n

So recursion is key :D. I have attached a simplified version of the table. Yours is very nice and sophisticated, but in my version it easier to track the formula ;). You just enter n at the top, and you can lookup the value for m and c at the respective position.

Thanks again!

simpletablecolors.xlsx
derduff