Link to home
Start Free TrialLog in
Avatar of krakatoa
krakatoaFlag for United Kingdom of Great Britain and Northern Ireland

asked on

Random search space philosophy.

This has almost certainly been raised before by others, but there seems to be something uncomfortable using .size() (of a List or Set for example) as the seed parameter as in Random.nextInt(l.size()).

This approach will bind the search space for a random int to the length of the array/ list/set, which is not going to be the same volume of search space on a set such as for example : 4,36,90,203,480017.Here,  nextInt(size()) means there is an equal chance of any one of the five index positions being returned, whereas the true search space ought to be 480017 , i.e. the nextInt on the *largest* value in the list. IOW, the two approaches are conceptually / philosophically different - size() by itself never needing a re-iteration to obtain the int, whereas nextInt on the largest value in a non-contiguous set or any set not having values matching its indexing will require (many) more calls and therefore be nominally more random, than nextInt(size()).

There are plenty of guru sites where .size() is used of course : i.a. here or here.

Perhaps in encryption terms, the potentially far smaller search space of nextInt(size()) will have an impact on the crackability algo, and hence time ?

Any thoughts ?
Avatar of David Johnson, CD
David Johnson, CD
Flag of Canada image

The nextInt(int n) method is used to get a pseudorandom, uniformly distributed int value between 0 (inclusive) and the specified value (exclusive), drawn from this random number generator sequence.
so the value will be a value between 0 and n
Avatar of krakatoa

ASKER

Yes, I understand that 100%. ;)
Avatar of dpearson
dpearson

If the goal is to select one of the values from the list and we did this by the less intuitive method of generating random values between the smallest value and the largest repeatedly until one matched an entry in the list, that would indeed still produce a random selection from the list.

The question is - would it be more random than selecting a random index into the list?

I think the answer is no.  We can see that by working back from the probabilities.  If we generate random values until we find one that is in the list, the probability of hitting any given value is still the same - 1/n if there are n items.  This is the same probability as we get by selecting a random index.

The only way that would not be true is if nextInt(10), say, is less random than nextInt(100000).

If you or I wrote the random class, there's a decent chance nextInt(10) would be less random because we might do

Int value = random.nextInt();
Int result = value % n ;

to generate [0, n) but this method is actually incorrect because of how the modulus operator works around 0.

But the guys in Java land know this so their random function is ok and so both methods will get the same degree of randomness.  Just choosing a random index is much faster.

Doug
Doug, thanks for coming in on this. :) A couple of thoughts on what you've said :

by the less intuitive method of

It's surely only less 'intuitive' when, as a programmer, you know that your container can be dealt with using its length. But in the 'pure' sense, if you had darts to throw randomly at an array of numbers at a fairground, its length would do you no good.

The question is - would it be more random than selecting a random index into the list?

It certainly would be I'd think. Because if you write code based on .size(), (and then remove one element from the list to avoid a duplicate reselection) then a hit per iteration is guaranteed, whereas to obtain a hit from a range over the list's values does not guarantee a hit per op.

Yes, both approaches are in the end a form or implementation of randomness, but the .size() approach is a presentation of order of appearance, whereas a ranged approach is an outright search through the field.

the probability of hitting any given value is still the same - 1/n if there are n items.
But not if the search is done on a range something like  : 1, 2, 4001, 88937, 4000023, 4000027

The only way that would not be true is if nextInt(10), say, is less random than nextInt(100000).

That's just a request for any old int less than 10, or less than 100000. It's not random against an appearance in another range.
the probability of hitting any given value is still the same - 1/n if there are n items.
But not if the search is done on a range something like  : 1, 2, 4001, 88937, 4000023, 4000027
Certainly if you scan a range, each random value has a very low probability of hitting an item in the list.

I meant the probability of the algorithm returning the value from the list using either algorithm:
a) int index = random.nextInt(list.size) ;           // Small range used here
     return list.get(index) ;

b) boolean found = false ;
    while (!found) {
        int possibleValue = random.nextInt(min,max) ;             // Wide range used here to sample from
        if (list.contains(possibleValue))
           return possibleValue ;
    }

Open in new window

With either of these, the probability distribution of the returned values should always be "1/n" - but with the assumption that random.nextInt(size()) is indeed correctly implemented (see the comment about mod).  Which it is for Java's random class, but might not be for other implementations (e.g. old C++ code used to contain bugs like this all the time).

Doug
 int possibleValue = random.nextInt(min,max) ; 

Open in new window


. . . not quite with you here . . . there is no nextInt method that takes two arguments, is there ...?

With either of these, the probability distribution of the returned values should always be "1/n" -

Right, but the "n" is very different for the two coding approaches, is it not ? In the case of your

 a) above, over my (range ( / list impl))  of 1, 2, 4001, 88937, 4000023, 4000027
"n" is 6,

whereas in your

b)
"n" is 4000027,

is that not so ?

-----------------------------------

If the list were comprised of *contiguous* values 1 thru' n (where n is 4000027 for example), then a call of

random.nextInt(n)

would have probability of 1 of returning a value would it not?
Oops - sorry, our local codebase has a wrapped version of Random :)

Should have been:
int possibleValue = random.nextInt(max-min+1)+min ; 

Open in new window


And yes when the list is:
1, 2, 4001, 88937, 4000023, 4000027

then max will be 4000027
so as you say the probability of random.nextInt() returning a value in the list is 6/4000027 (i.e. very very low).  But we're not just making one call to random.nextInt() in the second method.

The distribution of the values returned from the method I wrote which generates random numbers until it finds one in the list, will still be 1/n.  It just will take on average around 4000027/6 = 666,000 times through the loop before it returns.

But the value it returns will still be evenly distributed within the list of values in the original list:
1, 2, 4001, 88937, 4000023, 4000027
just as will be the values returned by the other method where we pick a random index, not a random value.

So given that the two methods both produce a random value from the list with the same probability distribution we can say that both methods are equally random even though they use very different selection methods.

And given that the first is much much faster, it's the one to prefer (the one where we randomly pick an index).
The distribution of probability can be the same, but the enactment of the result - any result - can't be, because if they were equal probabalistically, then they'd take the same amount of time, which can't be true.
because if they were equal probabalistically, then they'd take the same amount of time

Now I'm the one not really following.  Why would the same probability distribution imply they should take the same time to compute?

FYI in general "SecureRandom" which is a "better" random (it's not psuedo random) can take an arbitrary amount of time to respond while it waits for sufficient "entropy".  Fun stuff.  But anyway it's good at producing actual random values on a computer, but does so at the cost of unpredictable response times, unless you buy an "entropy source" for your machine...which means something like a radioactive isotope which decays somewhere and gets measured.
Why would the same probability distribution imply they should take the same time to compute?

I'll add more confusion :)

Your code a) takes less time to return a valid value (by valid, meaning a term from a list, not just a random int in itself). Code b) will take longer to do that - (as you have noted yourself in the comments alongside). So the longer time taken by algo be is symptomatic of more *entropy isn't it ? :)


* At least, a factor of entropy relative to the two algos.
if you want to return one element of your array. i.e  4,36,90,203,480017
all you need to do is select a random number from 1 to (number of elements of array) and then use the returned number to index into your array i.e array[random(1,array.length)] so if random returns 3 you would return element 3 of your array.
So the longer time taken by algo be is symptomatic of more *entropy isn't it ? :)
* At least, a factor of entropy relative to the two algos.

It could be, but I don't think it is :)

Running a psuedo random generator many times in a a row (which is what we're doing in (b)) doesn't increase the randomness of the final value.  If you knew the first value and how the generator works, you'd also know the 1000th value - it's determined by the first value exactly.  Hence it's not "more random" just because we ran it over and over.

(Side note - this is actually how most Two Factor authentication algorithms work - you just need to know the initial seed and how to generate the sequence to generate the latest value, so all you're really proving is that you had the seed.  And although the values in the sequence look random to an observer, say somebody looking over my shoulder, it's critical that they're not actually random or no way for the server to verify them).

Also we can easily make a slower algorithm for (b) with no change in randomness:

a) int index = random.nextInt(list.size) ;           // Small range used here
     return list.get(index) ;

b) int index = random.nextInt(list.size) ;           // Small range used here
     Thread.sleep(2000) ;
     return list.get(index) ;

Open in new window


so dangerous to assume that taking longer or doing more operations somehow increases randomness.

That's why if you need "truly random" you need some sort of external source of that randomness - something which can't be inferred from the internal machine state (since CPUs should always be deterministic).  You may have seen some security programs asking you to move your mouse around for a bit before they generate a key.  They're using those mouse moves as a weak source of "true randomness".
As I say there are better sources of "true randomness" but generally means buying an extra PCI card with a radioactive element inside it and plugging it in.  Sort of thing you would find in the back of a modern slot machine.
David

Yes, as I answered to your first comment, I understand this totally, completely and fully. But that's not my point. It's not really about the result as such - it's about the method of arriving at a result.  One method is to  predicate the search on the array length - which is the shortest way and is bound to produce a result.

 The other (or an other) approach is to check the required value against *any* possible returned value from a range, the ubound of which is the largest value in the array. What's important here is that if the array is not contiguous, there's no guarantee that a value actually contained in the array will be returned. This means that x number of searches will fail, making the time taken to find any or all values longer. So the approach using the length of the array is nowhere near as random as the other approach, and this is the issue.
The irony about radioactive decay as a source of randomness could be said to be the fact that these materials' half-lives are known quite accurately. Tellurium-128 has a half life of 2.2×10 to power 24 (2.2 septillion) years, whereas Hydrogen-7 has a half life of 23 yoctoseconds (2.3×10−23 seconds). So the time to complete decay is known - only the order in which the atoms decay is not in both cases. In the case of the size() approach to picking an int from an array (like which atom will decay) is however deterministic, and therefore in itself not random. If radioisotopes behaved in this way, you need to exhume Newton, Einstein, Pauli, Feinman, Gauss, Fermi, Turing, Teller, Oppenheimer and quite a few others to help re-write the laws of physics.

;)
random numbers have always been the bane of computing. Computers generate pseudorandom numbers.  For program testing we need the same set of random numbers  so we can repeat the tests and get the same results. In production we don't want the repeatability so we have to seed the random number generator with some form of entropy.  This is normally done by measuring something that occurs very slowly using a counter that counts really fast and using the smallest part. Like the move mouse described above.

The random implementation in java is not really random.
public int nextInt(int n)
Returns a pseudorandom, uniformly distributed int value between 0 (inclusive) and the specified value (exclusive), drawn from this random number generator's sequence. The general contract of nextInt is that one int value in the specified range is pseudorandomly generated and returned. All n possible int values are produced with (approximately) equal probability. The method nextInt(int n) is implemented by class
Java 7 Docs

I would not use it for cryptography you'd be better off using one of the publicly available random number generators on the internet https://en.wikipedia.org/wiki/List_of_random_number_generators#Random_number_servers
ASKER CERTIFIED SOLUTION
Avatar of dpearson
dpearson

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks - great input. No gratuitously random comments in sight. ;)