Probability problem

This "Probability" problem has stumped me for several weeks:

If I randomly select a 20 character password from a set of 256 characters composed of:
   4 sets of lower case letters (104),
   3 sets of Upper case letters  (78),
   7 of each number in 0..5      (42) and
   8 of each number in 6..9      (32)
                        total   (256)

What are the chances a second random 20 character selection will match the first?

  P(M,N) means the probability of M items taken N at a time, order counts.
  C(M,N) means the # of combinations of M items taken N at a time (order is n/a)

Many thanks . . .

- Ed
Ed CovneyRetiredAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

My first guess is zero.

My first estimate is odds of 1 in

(4!)^26 * (3!)^26 * (7!)^6 * (8!)^4 * (236!)

Open in new window

1 in 256^20

i.e. 1 in 1.4615016373309029182036848327163e+48
or approximately 1 in 1,461,501,637,330,900,000,000,000,000,000,000,000,000,000,000,000
4 lower case letters is 26^4 not 104, right?

   4 sets of lower case letters 26^4,
   3 sets of Upper case letters  26^3,
   7 of each number in 0..5    6^7
   8 of each number in 6..9    4^8

So multiply all those together to get the total number of possible sets of characters.

Then you can interleave them many ways.

Total possible ordering of the characters is 20! but you can divide that by 4!3!7!8!.

So 20! *26^4*26^7*6^7*4^8

Note that you could simplify the two 26^ terms to 26^11, but it's easier to see what went where above.
Cloud Class® Course: Microsoft Azure 2017

Azure has a changed a lot since it was originally introduce by adding new services and features. Do you know everything you need to about Azure? This course will teach you about the Azure App Service, monitoring and application insights, DevOps, and Team Services.

Note that that is the total possible combinations, so flip the fraction to get the odds of a match with one try.
Sorry, I didn't understand the question. Please ignore my answer.
Well, my first estimate was way off, but I think I know why.

256!/236! is the number of passwords for 256 unique characters.
Next you have to calculate the probable number of repeated characters and the probable number of repetitions for each.

Commuting time has arrived.  Hope to do more before ozo makes me irrelevant.
Oh, I understand the problem more fully now. My answer was for if you had a password composed of 4 lowercase characters, 3 uppercase, etc.

   4 sets of lower case letters (104),
   3 sets of Upper case letters  (78),
   7 of each number in 0..5      (42)
   8 of each number in 6..9      (32)

So your character set to choose from is aaaaAAAbbbbBBB etc.

So there are C(256,20) different collections of characters you could use but some are repeats since 'g' is the same as 'g' regardless of order. So d-glitch's first suggestion with a 20! term added to the denominator is the total possible sets of characters you could use.

Making order matter is tricky though for the 20 since you don't know how many repeats there are.
Your problem (Case 1) is tricky, and an exact solution will be hard to find.

Try simplifying it a little.
For Case 2, use four each of [a..z] [A..Z] [0..9]
So we have the same 62 symbols, 248 elements to choose from.
This is very close to Case 1, but still hard to solve.

So for Case 3, use the same 62 symbols, but unlimited repetition.
Now it is easy to calculate the exact number of possible passwords.
     (62)^20 = 7.04 x 10^35

But some of these are not possible in Case 2.  
The most common violation (by far) will be 5 of a kind.
An upper bound on the number of ways to draw a set of 5 can be calculated as well.
   62 ways to pick the symbol.
   C(20, 5) = 15504 ways to distribute them among 20 slots.
   (61)^15 = 6.02 x 10^28 ways to fill the rest of the slots.

Multiplying these factors together gives 5.79 x 10^32, which I believe substantially over counts all possible violations.

     7.0442 x 10^35  -  5.79 x 10^32  = 7.0384 x 10^35  which is on the order of a 0.1% correction.

Since your alphabet is so much larger than your password, you can justify using (62)^20 as well.

You could also calculate the most common violation for Case 1, which would be 4 of the UC letters.  
But I will leave that  as an exercise to the reader.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
I solved it for a smaller problem [aaaabcc] with a state diagram. I was hoping something would pop. It didn't.

It can be solved like that, but it would take a long time to do on paper.

I don't know if there is a closed form solution.
Ed CovneyRetiredAuthor Commented:
Tommy -  
In reality . . .
Enumerating possibilities > 2^32  (4,294,967,296) is a data base nightmare and  
Enumerating possibilities > 2^64  is a physical impossibility given CPU and storage (DB) limitations.
In theory, it IS enumerable which means the answer is (a very large) integer number.

Your  ".. exact solution will be hard to find" may be correct but I think you're bordering on near perfect in both your analysis and the answer that you provided 25 hours ago.

I approached from a boundaries point of view: Imagine a set of 62 different characters and no mater what character you draw first, you put it back and shuffle again. So your choice for every draw is from 62 elements.  Hence exactly 62^20 possibilities. Of course my problem set isn't that "rich", so  (the real answer)  X must be < 62^20, the upper boundary.

For a lower boundary, allow the set of 62 to deplete itself as you draw and now there are exactly P(62,20) possibilities. My problem set is a little richer than this one so this time X is > P(62,20).

   62^20  = 7.04 x 10^35 = 704423425546998022968330264616370176  
 P(62,20) = 2.24 x 10^34 =  22398459951704670955315895500800000    and
 P(62,20) <   X   <   62^20

Had I realized the bounds were this close 3 days ago, I'd have never bother posting the question. Your correction to the upper limit is genius (so I'm glad I did post it) although I don't like the 61^15 (?) thingy.  In any event, your answer is in "range" and very reasonable.

Thank you.

- Ed
Thanks for the points and your vote of confidence.
I don't like the 61^15 factor very much either.

I have since realized one way to approach the problem is with Monte Carlo methods.
Generate random strings using the Case 3 (unlimited repetition) alphabet and keep track of the stats on the violations.  
This would work with your original (or any arbitrary) alphabet restrictions.

Python can run 1 or 10 or 100 million cases pretty quickly.
This will give a lot better precision that my 0.1% estimate of invalid strings.

This continues to be an interesting problem.
I will keep you posted.
Ed CovneyRetiredAuthor Commented:
Estimate instead of enumerate?
A very clever concept.
Thanks again d-g.

( I tend to get stuck on enumerations too large to be . . enumerated. )
Python Output -- 1 Million Random StringsHello Ed,

I have revised this post.  My first attempt was with 19-character strings instead of 20.  The differences are small but noticeable.  Embarrassing.

Looking at the last line in the screen shot:
   31383  is the number of strings with not even one pair
 743661  have one or more pairs but nothing better
 208137  have one or more sets of three but nothing better
   15665  have one or more sets of four but nothing better [Some of these will be invalid.]
   and so on . . . .

Back to your original question:

If you use a 62 symbol alphabet with unlimited repetition, there are
     (62)^20 = 7.04 x 10^35 possible strings.

If you allow only 3 copies of UC letters, and 4 copies of LC letters, the number of invalid strings in 1M random samples will be:
                                      (78/256)*15665   =  4773   <= Sets of 4
    ((78+104)/256)*(832 + 20 + 1)  =  (182/256)*853    =   606   <= Sets of 5, 6, and 7
                                                TOTAL  =  5379
    99.462% of the strings are valid. 

Open in new window

The correction due to the limits on the number of copies of the integers is less than 1 ppm.

I will post my Python code if any one is interested.

And I did insert one non-random string with 20 of the same character to test the distribution calculations.
Ed CovneyRetiredAuthor Commented:

I remember doing Monte Carlo estimates of Pi 25 years ago - DOS, turbo pascal. If I recall, it produced an excellent estimation of Pi but was able to determine the quality of the random number generator, in sort of a back-handed manner.

Regarding your python code - I'd love to play with it although I don't (currently) have any way to run it. Hopefully by days end, I will. What version do you use? If otherwise allowed by EE, can you sign i.e. post credits "By X, @y"  (email) . If I pass it to my brother, I don't want to take credit.

I do have a very nice "test" PC. Lots of RAM and easily configurable RAM Drive.
I am running Python 3.3.0 on a Win7 64-bit machine.  
The RandString program is as vanilla as you can get, and should run fine on any installation.

The current Python random number generator is based on the Mersenne Twister.
I don't even know what that means, but I believe it is world-class and state-of-the-art.

I have cleaned up the program, taken out the tedious poker hand analysis, and added some comments.
It runs 1M strings in 138 seconds.
I have another correction to make.

I still trust my Python program code, but my last manual calculations were off.

I should have been considering fractions of the 62 char alphabet, not the 256.
                                      (26/62)*15665  =  6569   <= Sets of 4
    ((78+104)/256)*(832 + 20 + 1)  =  (52/62)*853    =   715   <= Sets of 5, 6, and 7
                                              TOTAL  =  7284
    99.271% of the strings are valid.  

Open in new window

That is recalculating from the old data.

I also changed the program to validate strings directly.  The results agree to 4 decimal places.Screen
Ed CovneyRetiredAuthor Commented:
d-glitch  -
This part I understood: ( "Looking at the last line in the screen shot: ")
   31383  is the number of strings with not even one pair
 743661  have one or more pairs but nothing better
 208137  have one or more sets of three but nothing better
   15665  have one or more sets of four but nothing better [Some of these will be invalid.]

Does the 0 before 31383 mean every random assembly had at least one pair?  And what is the 21st entry, a "1"?
My remaining concern is that our generated "random" results look suspiciously alike, perhaps they're supposed to?

In the second script you provided, could you explain the 2nd and 3rd numbers of "string" and also
the "Values", "Sorted" and "Pattern" arrays? Again, I got very similar results.

You can't imagine how much I appreciate all of this - thank you so much - again!!
The 0 before the 31383 is a place holder.
Python lists start at 0, but I wanted the index to refer to set size.
I wanted index values from 1 to 20, but I had to include the zero as well.

pattern[ 0, 14, 1, 0, 1, 0, 0, 0, . . . ]  refers to a particular string.
       pattern[  0] = 0   means nothing
       pattern[ 1] = 14   means 14 unmatched characters or singletons
       pattern[ 2] = 1   means 1 pair
       pattern[ 3] = 0   means no sets of 3
       pattern[ 4] = 1   means 1 set of 4   ==>   14*1 + 1*2 + 0*3 + 1*4 = 20   So the rest of the terms are all 0's

  dist[ 0,  31683,  743661,  208137, . . .]   refers to all the strings generated and analyzed so far.
          dist[ 0] = 0   means nothing
          dist[ 1]  =   31683   means that many strings have 20 singletons, not even 1 pair.
          dist[ 2]  = 743661   means that many strings have one or more pairs, but no sets of 3 or greater.
          dist[20] = 1              means that one string had a set of 20,  but I forced that as a test.  It is noted in the code.

If you add up ( 31683 + 743661 + 208137 + . . .   you will get the number of strings.
I actually ran 100M strings, and got exactly one set of 8 (none larger), and no occurrences of zero singletons.
If you add up the distribution numbers  31683 + 743661 + 208137 + . . .  you will get the number of strings (1M).

Values  [ 47, 13, 13, 31, 5, 26,  . . .  ]    are the twenty random numbers in the order they were generated.
Sorted  [ 5, 6, 1, 13, 13, 13, 13, . . . ]    are the same numbers sorted.  You can check.  You have to sort to do the analysis.

Pattern [ 0, 14,  1,  0,  1,  0,  0,  . .  . ]   describes the sorted list:
        14 singletons     one pair ( 31, 31)     no sets of 3     one set of 4 ( 13, 13, 13, 13)

String   1000247   992999     99.2753
     1000247  is the number of strings generated and analyzed so far.
       992999  is the number that are valid.
             99.2753  is the percentage of strings that are valid.

I wanted the program print out intermediate results every 100K strings, but it set it up so that it would print out the next bad string after 100K, 200K . . .   In this case, it overran the 1M mark by 247.

We are using random numbers to simulate a complex but well defined problem.
We are using an infinite depth alphabet to model a finite one.
The result:  that 99.275% of the strings in a 1M random sample could come from the limited alphabet is persuasive.
I believe it is the correct answer to four or five significant figures.
Ed CovneyRetiredAuthor Commented:
Adding your descriptions as comments to your code.  

Thanks again.

- Ed
>>  Adding your descriptions as comments to your code.  
Thanks.  Data description is really a critical part of the documentation.

Sounds like you have the programs up an running, which is great.  
Let me know if you have any more interesting problems.
Ed CovneyRetiredAuthor Commented:
If I have a shuffled deck of 256 (numbers) cards, how can I test if they are in fact "randomly" ordered?

I'll pose this as a new question for you experts this aft.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Math / Science

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.