We help IT Professionals succeed at work.

Probability problem

Ed Covney
Ed Covney asked
on
288 Views
Last Modified: 2014-08-15
This "Probability" problem has stumped me for several weeks:

If I randomly select a 20 character password from a set of 256 characters composed of:
   4 sets of lower case letters (104),
   3 sets of Upper case letters  (78),
   7 of each number in 0..5      (42) and
   8 of each number in 6..9      (32)
                        total   (256)

What are the chances a second random 20 character selection will match the first?

Notes:
  P(M,N) means the probability of M items taken N at a time, order counts.
  C(M,N) means the # of combinations of M items taken N at a time (order is n/a)

Many thanks . . .

- Ed
Comment
Watch Question

CERTIFIED EXPERT

Commented:
My first guess is zero.

My first estimate is odds of 1 in

                      256!
--------------------------------------------
(4!)^26 * (3!)^26 * (7!)^6 * (8!)^4 * (236!)

Open in new window

akb
CERTIFIED EXPERT

Commented:
1 in 256^20

i.e. 1 in 1.4615016373309029182036848327163e+48
or approximately 1 in 1,461,501,637,330,900,000,000,000,000,000,000,000,000,000,000,000
CERTIFIED EXPERT
Awarded 2010
Top Expert 2013

Commented:
4 lower case letters is 26^4 not 104, right?

   4 sets of lower case letters 26^4,
   3 sets of Upper case letters  26^3,
   7 of each number in 0..5    6^7
   8 of each number in 6..9    4^8

So multiply all those together to get the total number of possible sets of characters.

Then you can interleave them many ways.

Total possible ordering of the characters is 20! but you can divide that by 4!3!7!8!.

So 20! *26^4*26^7*6^7*4^8
---------------------------------------
              4!*3!*7!*8!

Note that you could simplify the two 26^ terms to 26^11, but it's easier to see what went where above.
CERTIFIED EXPERT
Awarded 2010
Top Expert 2013

Commented:
Note that that is the total possible combinations, so flip the fraction to get the odds of a match with one try.
akb
CERTIFIED EXPERT

Commented:
Sorry, I didn't understand the question. Please ignore my answer.
CERTIFIED EXPERT

Commented:
Well, my first estimate was way off, but I think I know why.

256!/236! is the number of passwords for 256 unique characters.
Next you have to calculate the probable number of repeated characters and the probable number of repetitions for each.

Commuting time has arrived.  Hope to do more before ozo makes me irrelevant.
Probability-for-ExEx.PNG
CERTIFIED EXPERT
Awarded 2010
Top Expert 2013

Commented:
Oh, I understand the problem more fully now. My answer was for if you had a password composed of 4 lowercase characters, 3 uppercase, etc.

   4 sets of lower case letters (104),
   3 sets of Upper case letters  (78),
   7 of each number in 0..5      (42)
   8 of each number in 6..9      (32)

So your character set to choose from is aaaaAAAbbbbBBB etc.

So there are C(256,20) different collections of characters you could use but some are repeats since 'g' is the same as 'g' regardless of order. So d-glitch's first suggestion with a 20! term added to the denominator is the total possible sets of characters you could use.

Making order matter is tricky though for the 20 since you don't know how many repeats there are.
CERTIFIED EXPERT
Commented:
This one is on us!
(Get your first solution completely free - no credit card required)
UNLOCK SOLUTION
CERTIFIED EXPERT
Awarded 2010
Top Expert 2013

Commented:
I solved it for a smaller problem [aaaabcc] with a state diagram. I was hoping something would pop. It didn't.

It can be solved like that, but it would take a long time to do on paper.

I don't know if there is a closed form solution.
Ed CovneyRetired

Author

Commented:
Tommy -  
In reality . . .
Enumerating possibilities > 2^32  (4,294,967,296) is a data base nightmare and  
Enumerating possibilities > 2^64  is a physical impossibility given CPU and storage (DB) limitations.
In theory, it IS enumerable which means the answer is (a very large) integer number.

d-glitch:
Your  ".. exact solution will be hard to find" may be correct but I think you're bordering on near perfect in both your analysis and the answer that you provided 25 hours ago.

I approached from a boundaries point of view: Imagine a set of 62 different characters and no mater what character you draw first, you put it back and shuffle again. So your choice for every draw is from 62 elements.  Hence exactly 62^20 possibilities. Of course my problem set isn't that "rich", so  (the real answer)  X must be < 62^20, the upper boundary.

For a lower boundary, allow the set of 62 to deplete itself as you draw and now there are exactly P(62,20) possibilities. My problem set is a little richer than this one so this time X is > P(62,20).

   62^20  = 7.04 x 10^35 = 704423425546998022968330264616370176  
 P(62,20) = 2.24 x 10^34 =  22398459951704670955315895500800000    and
 P(62,20) <   X   <   62^20

Had I realized the bounds were this close 3 days ago, I'd have never bother posting the question. Your correction to the upper limit is genius (so I'm glad I did post it) although I don't like the 61^15 (?) thingy.  In any event, your answer is in "range" and very reasonable.

Thank you.

- Ed
CERTIFIED EXPERT

Commented:
Thanks for the points and your vote of confidence.
I don't like the 61^15 factor very much either.

I have since realized one way to approach the problem is with Monte Carlo methods.
Generate random strings using the Case 3 (unlimited repetition) alphabet and keep track of the stats on the violations.  
This would work with your original (or any arbitrary) alphabet restrictions.

Python can run 1 or 10 or 100 million cases pretty quickly.
This will give a lot better precision that my 0.1% estimate of invalid strings.

This continues to be an interesting problem.
I will keep you posted.
Ed CovneyRetired

Author

Commented:
Estimate instead of enumerate?
A very clever concept.
Thanks again d-g.

( I tend to get stuck on enumerations too large to be . . enumerated. )
CERTIFIED EXPERT

Commented:
Python Output -- 1 Million Random StringsHello Ed,

I have revised this post.  My first attempt was with 19-character strings instead of 20.  The differences are small but noticeable.  Embarrassing.

Looking at the last line in the screen shot:
   31383  is the number of strings with not even one pair
 743661  have one or more pairs but nothing better
 208137  have one or more sets of three but nothing better
   15665  have one or more sets of four but nothing better [Some of these will be invalid.]
   and so on . . . .

======================================================
Back to your original question:

If you use a 62 symbol alphabet with unlimited repetition, there are
     (62)^20 = 7.04 x 10^35 possible strings.

If you allow only 3 copies of UC letters, and 4 copies of LC letters, the number of invalid strings in 1M random samples will be:
                                      (78/256)*15665   =  4773   <= Sets of 4
    ((78+104)/256)*(832 + 20 + 1)  =  (182/256)*853    =   606   <= Sets of 5, 6, and 7
                                       -----------------------
                                                TOTAL  =  5379
    99.462% of the strings are valid. 

Open in new window


The correction due to the limits on the number of copies of the integers is less than 1 ppm.

I will post my Python code if any one is interested.

And I did insert one non-random string with 20 of the same character to test the distribution calculations.
Ed CovneyRetired

Author

Commented:
d-g,

I remember doing Monte Carlo estimates of Pi 25 years ago - DOS, turbo pascal. If I recall, it produced an excellent estimation of Pi but was able to determine the quality of the random number generator, in sort of a back-handed manner.

Regarding your python code - I'd love to play with it although I don't (currently) have any way to run it. Hopefully by days end, I will. What version do you use? If otherwise allowed by EE, can you sign i.e. post credits "By X, @y"  (email) . If I pass it to my brother, I don't want to take credit.

I do have a very nice "test" PC. Lots of RAM and easily configurable RAM Drive.
CERTIFIED EXPERT

Commented:
I am running Python 3.3.0 on a Win7 64-bit machine.  
The RandString program is as vanilla as you can get, and should run fine on any installation.

The current Python random number generator is based on the Mersenne Twister.
I don't even know what that means, but I believe it is world-class and state-of-the-art.

I have cleaned up the program, taken out the tedious poker hand analysis, and added some comments.
It runs 1M strings in 138 seconds.
RandStrings.py
CERTIFIED EXPERT

Commented:
I have another correction to make.

I still trust my Python program code, but my last manual calculations were off.

I should have been considering fractions of the 62 char alphabet, not the 256.
                                      (26/62)*15665  =  6569   <= Sets of 4
    ((78+104)/256)*(832 + 20 + 1)  =  (52/62)*853    =   715   <= Sets of 5, 6, and 7
                                       ----------------------
                                              TOTAL  =  7284
    99.271% of the strings are valid.  

Open in new window

That is recalculating from the old data.

I also changed the program to validate strings directly.  The results agree to 4 decimal places.Screen shotRandString-Validation.py
Ed CovneyRetired

Author

Commented:
d-glitch  -
This part I understood: ( "Looking at the last line in the screen shot: ")
   31383  is the number of strings with not even one pair
 743661  have one or more pairs but nothing better
 208137  have one or more sets of three but nothing better
   15665  have one or more sets of four but nothing better [Some of these will be invalid.]

Does the 0 before 31383 mean every random assembly had at least one pair?  And what is the 21st entry, a "1"?
My remaining concern is that our generated "random" results look suspiciously alike, perhaps they're supposed to?

In the second script you provided, could you explain the 2nd and 3rd numbers of "string" and also
the "Values", "Sorted" and "Pattern" arrays? Again, I got very similar results.

You can't imagine how much I appreciate all of this - thank you so much - again!!
Out-2.txt
CERTIFIED EXPERT

Commented:
The 0 before the 31383 is a place holder.
Python lists start at 0, but I wanted the index to refer to set size.
I wanted index values from 1 to 20, but I had to include the zero as well.

pattern[ 0, 14, 1, 0, 1, 0, 0, 0, . . . ]  refers to a particular string.
       pattern[  0] = 0   means nothing
       pattern[ 1] = 14   means 14 unmatched characters or singletons
       pattern[ 2] = 1   means 1 pair
       pattern[ 3] = 0   means no sets of 3
       pattern[ 4] = 1   means 1 set of 4   ==>   14*1 + 1*2 + 0*3 + 1*4 = 20   So the rest of the terms are all 0's

  dist[ 0,  31683,  743661,  208137, . . .]   refers to all the strings generated and analyzed so far.
          dist[ 0] = 0   means nothing
          dist[ 1]  =   31683   means that many strings have 20 singletons, not even 1 pair.
          dist[ 2]  = 743661   means that many strings have one or more pairs, but no sets of 3 or greater.
          dist[20] = 1              means that one string had a set of 20,  but I forced that as a test.  It is noted in the code.

If you add up ( 31683 + 743661 + 208137 + . . .   you will get the number of strings.
I actually ran 100M strings, and got exactly one set of 8 (none larger), and no occurrences of zero singletons.
If you add up the distribution numbers  31683 + 743661 + 208137 + . . .  you will get the number of strings (1M).

Values  [ 47, 13, 13, 31, 5, 26,  . . .  ]    are the twenty random numbers in the order they were generated.
Sorted  [ 5, 6, 1, 13, 13, 13, 13, . . . ]    are the same numbers sorted.  You can check.  You have to sort to do the analysis.

Pattern [ 0, 14,  1,  0,  1,  0,  0,  . .  . ]   describes the sorted list:
        14 singletons     one pair ( 31, 31)     no sets of 3     one set of 4 ( 13, 13, 13, 13)

String   1000247   992999     99.2753
     1000247  is the number of strings generated and analyzed so far.
       992999  is the number that are valid.
             99.2753  is the percentage of strings that are valid.

I wanted the program print out intermediate results every 100K strings, but it set it up so that it would print out the next bad string after 100K, 200K . . .   In this case, it overran the 1M mark by 247.

We are using random numbers to simulate a complex but well defined problem.
We are using an infinite depth alphabet to model a finite one.
The result:  that 99.275% of the strings in a 1M random sample could come from the limited alphabet is persuasive.
I believe it is the correct answer to four or five significant figures.
Ed CovneyRetired

Author

Commented:
Adding your descriptions as comments to your code.  

Thanks again.

- Ed
CERTIFIED EXPERT

Commented:
>>  Adding your descriptions as comments to your code.  
Thanks.  Data description is really a critical part of the documentation.

Sounds like you have the programs up an running, which is great.  
Let me know if you have any more interesting problems.
Ed CovneyRetired

Author

Commented:
If I have a shuffled deck of 256 (numbers) cards, how can I test if they are in fact "randomly" ordered?

I'll pose this as a new question for you experts this aft.

Gain unlimited access to on-demand training courses with an Experts Exchange subscription.

Get Access
Why Experts Exchange?

Experts Exchange always has the answer, or at the least points me in the correct direction! It is like having another employee that is extremely experienced.

Jim Murphy
Programmer at Smart IT Solutions

When asked, what has been your best career decision?

Deciding to stick with EE.

Mohamed Asif
Technical Department Head

Being involved with EE helped me to grow personally and professionally.

Carl Webster
CTP, Sr Infrastructure Consultant
Empower Your Career
Did You Know?

We've partnered with two important charities to provide clean water and computer science education to those who need it most. READ MORE

Ask ANY Question

Connect with Certified Experts to gain insight and support on specific technology challenges including:

  • Troubleshooting
  • Research
  • Professional Opinions
Unlock the solution to this question.
Join our community and discover your potential

Experts Exchange is the only place where you can interact directly with leading experts in the technology field. Become a member today and access the collective knowledge of thousands of technology experts.

*This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

OR

Please enter a first name

Please enter a last name

8+ characters (letters, numbers, and a symbol)

By clicking, you agree to the Terms of Use and Privacy Policy.