My first estimate is odds of 1 in

```
256!
--------------------------------------------
(4!)^26 * (3!)^26 * (7!)^6 * (8!)^4 * (236!)
```

Solved

Posted on 2014-08-08

This "Probability" problem has stumped me for several weeks:

If I randomly select a 20 character password from a set of 256 characters composed of:

4 sets of lower case letters (104),

3 sets of Upper case letters (78),

7 of each number in 0..5 (42) and

8 of each number in 6..9 (32)

total (256)

What are the chances a second random 20 character selection will match the first?

Notes:

P(M,N) means the probability of M items taken N at a time, order counts.

C(M,N) means the # of combinations of M items taken N at a time (order is n/a)

Many thanks . . .

- Ed

If I randomly select a 20 character password from a set of 256 characters composed of:

4 sets of lower case letters (104),

3 sets of Upper case letters (78),

7 of each number in 0..5 (42) and

8 of each number in 6..9 (32)

total (256)

What are the chances a second random 20 character selection will match the first?

Notes:

P(M,N) means the probability of M items taken N at a time, order counts.

C(M,N) means the # of combinations of M items taken N at a time (order is n/a)

Many thanks . . .

- Ed

21 Comments

My first estimate is odds of 1 in

```
256!
--------------------------------------------
(4!)^26 * (3!)^26 * (7!)^6 * (8!)^4 * (236!)
```

i.e. 1 in 1.461501637330902918203684

or approximately 1 in 1,461,501,637,330,900,000,

4 sets of lower case letters 26^4,

3 sets of Upper case letters 26^3,

7 of each number in 0..5 6^7

8 of each number in 6..9 4^8

So multiply all those together to get the total number of possible sets of characters.

Then you can interleave them many ways.

Total possible ordering of the characters is 20! but you can divide that by 4!3!7!8!.

So 20! *26^4*26^7*6^7*4^8

--------------------------

4!*3!*7!*8!

Note that you could simplify the two 26^ terms to 26^11, but it's easier to see what went where above.

256!/236! is the number of passwords for 256 unique characters.

Next you have to calculate the probable number of repeated characters and the probable number of repetitions for each.

Commuting time has arrived. Hope to do more before ozo makes me irrelevant.

Probability-for-ExEx.PNG

4 sets of lower case letters (104),

3 sets of Upper case letters (78),

7 of each number in 0..5 (42)

8 of each number in 6..9 (32)

So your character set to choose from is aaaaAAAbbbbBBB etc.

So there are C(256,20) different collections of characters you could use but some are repeats since 'g' is the same as 'g' regardless of order. So d-glitch's first suggestion with a 20! term added to the denominator is the total possible sets of characters you could use.

Making order matter is tricky though for the 20 since you don't know how many repeats there are.

Try simplifying it a little.

For Case 2, use four each of [a..z] [A..Z] [0..9]

So we have the same 62 symbols, 248 elements to choose from.

This is very close to Case 1, but still hard to solve.

So for Case 3, use the same 62 symbols, but unlimited repetition.

Now it is easy to calculate the exact number of possible passwords.

(62)^20 = 7.04 x 10^35

But some of these are not possible in Case 2.

The most common violation (by far) will be 5 of a kind.

An upper bound on the number of ways to draw a set of 5 can be calculated as well.

62 ways to pick the symbol.

C(20, 5) = 15504 ways to distribute them among 20 slots.

(61)^15 = 6.02 x 10^28 ways to fill the rest of the slots.

Multiplying these factors together gives 5.79 x 10^32, which I believe substantially over counts all possible violations.

7.0442 x 10^35 - 5.79 x 10^32 = 7.0384 x 10^35 which is on the order of a 0.1% correction.

Since your alphabet is so much larger than your password, you can justify using (62)^20 as well.

You could also calculate the most common violation for Case 1, which would be 4 of the UC letters.

But I will leave that as an exercise to the reader.

It can be solved like that, but it would take a long time to do on paper.

I don't know if there is a closed form solution.

In reality . . .

Enumerating possibilities > 2^32 (4,294,967,296) is a data base nightmare and

Enumerating possibilities > 2^64 is a physical impossibility given CPU and storage (DB) limitations.

In theory, it IS enumerable which means the answer is (a very large) integer number.

d-glitch:

Your ".. exact solution will be hard to find" may be correct but I think you're bordering on near perfect in both your analysis and the answer that you provided 25 hours ago.

I approached from a boundaries point of view: Imagine a set of 62 different characters and no mater what character you draw first, you put it back and shuffle again. So your choice for every draw is from 62 elements. Hence exactly 62^20 possibilities. Of course my problem set isn't that "rich", so (the real answer) X must be < 62^20, the upper boundary.

For a lower boundary, allow the set of 62 to deplete itself as you draw and now there are exactly P(62,20) possibilities. My problem set is a little richer than this one so this time X is > P(62,20).

62^20 = 7.04 x 10^35 = 70442342554699802296833026

P(62,20) = 2.24 x 10^34 = 22398459951704670955315895

P(62,20) < X < 62^20

Had I realized the bounds were this close 3 days ago, I'd have never bother posting the question. Your correction to the upper limit is genius (so I'm glad I did post it) although I don't like the 61^15 (?) thingy. In any event, your answer is in "range" and very reasonable.

Thank you.

- Ed

I don't like the 61^15 factor very much either.

I have since realized one way to approach the problem is with Monte Carlo methods.

Generate random strings using the Case 3 (unlimited repetition) alphabet and keep track of the stats on the violations.

This would work with your original (or any arbitrary) alphabet restrictions.

Python can run 1 or 10 or 100 million cases pretty quickly.

This will give a lot better precision that my 0.1% estimate of invalid strings.

This continues to be an interesting problem.

I will keep you posted.

A very clever concept.

Thanks again d-g.

( I tend to get stuck on enumerations too large to be . . enumerated. )

I have revised this post. My first attempt was with 19-character strings instead of 20. The differences are small but noticeable. Embarrassing.

Looking at the last line in the screen shot:

31383 is the number of strings with not even one pair

743661 have one or more pairs but nothing better

208137 have one or more sets of three but nothing better

15665 have one or more sets of four but nothing better [Some of these will be invalid.]

and so on . . . .

==========================

Back to your original question:

If you use a 62 symbol alphabet with unlimited repetition, there are

(62)^20 = 7.04 x 10^35 possible strings.

If you allow only 3 copies of UC letters, and 4 copies of LC letters, the number of invalid strings in 1M random samples will be:

```
(78/256)*15665 = 4773 <= Sets of 4
((78+104)/256)*(832 + 20 + 1) = (182/256)*853 = 606 <= Sets of 5, 6, and 7
-----------------------
TOTAL = 5379
99.462% of the strings are valid.
```

The correction due to the limits on the number of copies of the integers is less than 1 ppm.

I will post my Python code if any one is interested.

And I did insert one non-random string with 20 of the same character to test the distribution calculations.

I remember doing Monte Carlo estimates of Pi 25 years ago - DOS, turbo pascal. If I recall, it produced an excellent estimation of Pi but was able to determine the quality of the random number generator, in sort of a back-handed manner.

Regarding your python code - I'd love to play with it although I don't (currently) have any way to run it. Hopefully by days end, I will. What version do you use? If otherwise allowed by EE, can you sign i.e. post credits "By X, @y" (email) . If I pass it to my brother, I don't want to take credit.

I do have a very nice "test" PC. Lots of RAM and easily configurable RAM Drive.

The RandString program is as vanilla as you can get, and should run fine on any installation.

The current Python random number generator is based on the Mersenne Twister.

I don't even know what that means, but I believe it is world-class and state-of-the-art.

I have cleaned up the program, taken out the tedious poker hand analysis, and added some comments.

It runs 1M strings in 138 seconds.

RandStrings.py

I still trust my Python program code, but my last manual calculations were off.

I should have been considering fractions of the 62 char alphabet, not the 256.

```
(26/62)*15665 = 6569 <= Sets of 4
((78+104)/256)*(832 + 20 + 1) = (52/62)*853 = 715 <= Sets of 5, 6, and 7
----------------------
TOTAL = 7284
99.271% of the strings are valid.
```

That is recalculating from the old data. I also changed the program to validate strings directly. The results agree to 4 decimal places.RandString-Validation.py

This part I understood: ( "Looking at the last line in the screen shot: ")

31383 is the number of strings with not even one pair

743661 have one or more pairs but nothing better

208137 have one or more sets of three but nothing better

15665 have one or more sets of four but nothing better [Some of these will be invalid.]

Does the 0 before 31383 mean every random assembly had at least one pair? And what is the 21st entry, a "1"?

My remaining concern is that our generated "random" results look suspiciously alike, perhaps they're supposed to?

In the second script you provided, could you explain the 2nd and 3rd numbers of "string" and also

the "Values", "Sorted" and "Pattern" arrays? Again, I got very similar results.

You can't imagine how much I appreciate all of this - thank you so much - again!!

Out-2.txt

Python lists start at 0, but I wanted the index to refer to set size.

I wanted index values from 1 to 20, but I had to include the zero as well.

pattern[ 0] = 0 means nothing

pattern[ 1] = 14 means 14 unmatched characters or singletons

pattern[ 2] = 1 means 1 pair

pattern[ 3] = 0 means no sets of 3

pattern[ 4] = 1 means 1 set of 4 ==> 14*1 + 1*2 + 0*3 + 1*4 = 20 So the rest of the terms are all 0's

dist[ 0] = 0 means nothing

dist[ 1] = 31683 means that many strings have 20 singletons, not even 1 pair.

dist[ 2] = 743661 means that many strings have one or more pairs, but no sets of 3 or greater.

dist[20] = 1 means that one string had a set of 20, but I forced that as a test. It is noted in the code.

If you add up ( 31683 + 743661 + 208137 + . . . you will get the number of strings.

I actually ran 100M strings, and got exactly one set of 8 (none larger), and no occurrences of zero singletons.

If you add up the distribution numbers

Values [ 47, 13, 13, 31, 5, 26, . . . ] are the twenty random numbers in the order they were generated.

Sorted [ 5, 6, 1, 13, 13, 13, 13, . . . ] are the same numbers sorted. You can check. You have to sort to do the analysis.

Pattern [ 0, 14, 1, 0, 1, 0, 0, . . . ] describes the sorted list:

14 singletons one pair ( 31, 31) no sets of 3 one set of 4 ( 13, 13, 13, 13)

1000247 is the number of strings generated and analyzed so far.

992999 is the number that are valid.

99.2753 is the percentage of strings that are valid.

I wanted the program print out intermediate results every 100K strings, but it set it up so that it would print out the next bad string after 100K, 200K . . . In this case, it overran the 1M mark by 247.

We are using random numbers to simulate a complex but well defined problem.

We are using an infinite depth alphabet to model a finite one.

The result: that 99.275% of the strings in a 1M random sample could come from the limited alphabet is persuasive.

I believe it is the correct answer to four or five significant figures.

By clicking you are agreeing to Experts Exchange's Terms of Use.

Title | # Comments | Views | Activity |
---|---|---|---|

sending sweets to asia!! | 5 | 57 | |

Math Stumper | 6 | 32 | |

Graph Function | 6 | 48 | |

Interest rate formula | 7 | 42 |

Join the community of 500,000 technology professionals and ask your questions.

Connect with top rated Experts

**17** Experts available now in Live!