krakatoa
asked on
Random search space philosophy.
This has almost certainly been raised before by others, but there seems to be something uncomfortable using .size() (of a List or Set for example) as the seed parameter as in Random.nextInt(l.size()).
This approach will bind the search space for a random int to the length of the array/ list/set, which is not going to be the same volume of search space on a set such as for example : 4,36,90,203,480017.Here, nextInt(size()) means there is an equal chance of any one of the five index positions being returned, whereas the true search space ought to be 480017 , i.e. the nextInt on the *largest* value in the list. IOW, the two approaches are conceptually / philosophically different - size() by itself never needing a re-iteration to obtain the int, whereas nextInt on the largest value in a non-contiguous set or any set not having values matching its indexing will require (many) more calls and therefore be nominally more random, than nextInt(size()).
There are plenty of guru sites where .size() is used of course : i.a. here or here.
Perhaps in encryption terms, the potentially far smaller search space of nextInt(size()) will have an impact on the crackability algo, and hence time ?
Any thoughts ?
This approach will bind the search space for a random int to the length of the array/ list/set, which is not going to be the same volume of search space on a set such as for example : 4,36,90,203,480017.Here, nextInt(size()) means there is an equal chance of any one of the five index positions being returned, whereas the true search space ought to be 480017 , i.e. the nextInt on the *largest* value in the list. IOW, the two approaches are conceptually / philosophically different - size() by itself never needing a re-iteration to obtain the int, whereas nextInt on the largest value in a non-contiguous set or any set not having values matching its indexing will require (many) more calls and therefore be nominally more random, than nextInt(size()).
There are plenty of guru sites where .size() is used of course : i.a. here or here.
Perhaps in encryption terms, the potentially far smaller search space of nextInt(size()) will have an impact on the crackability algo, and hence time ?
Any thoughts ?
ASKER
Yes, I understand that 100%. ;)
If the goal is to select one of the values from the list and we did this by the less intuitive method of generating random values between the smallest value and the largest repeatedly until one matched an entry in the list, that would indeed still produce a random selection from the list.
The question is - would it be more random than selecting a random index into the list?
I think the answer is no. We can see that by working back from the probabilities. If we generate random values until we find one that is in the list, the probability of hitting any given value is still the same - 1/n if there are n items. This is the same probability as we get by selecting a random index.
The only way that would not be true is if nextInt(10), say, is less random than nextInt(100000).
If you or I wrote the random class, there's a decent chance nextInt(10) would be less random because we might do
Int value = random.nextInt();
Int result = value % n ;
to generate [0, n) but this method is actually incorrect because of how the modulus operator works around 0.
But the guys in Java land know this so their random function is ok and so both methods will get the same degree of randomness. Just choosing a random index is much faster.
Doug
The question is - would it be more random than selecting a random index into the list?
I think the answer is no. We can see that by working back from the probabilities. If we generate random values until we find one that is in the list, the probability of hitting any given value is still the same - 1/n if there are n items. This is the same probability as we get by selecting a random index.
The only way that would not be true is if nextInt(10), say, is less random than nextInt(100000).
If you or I wrote the random class, there's a decent chance nextInt(10) would be less random because we might do
Int value = random.nextInt();
Int result = value % n ;
to generate [0, n) but this method is actually incorrect because of how the modulus operator works around 0.
But the guys in Java land know this so their random function is ok and so both methods will get the same degree of randomness. Just choosing a random index is much faster.
Doug
ASKER
Doug, thanks for coming in on this. :) A couple of thoughts on what you've said :
It's surely only less 'intuitive' when, as a programmer, you know that your container can be dealt with using its length. But in the 'pure' sense, if you had darts to throw randomly at an array of numbers at a fairground, its length would do you no good.
It certainly would be I'd think. Because if you write code based on .size(), (and then remove one element from the list to avoid a duplicate reselection) then a hit per iteration is guaranteed, whereas to obtain a hit from a range over the list's values does not guarantee a hit per op.
Yes, both approaches are in the end a form or implementation of randomness, but the .size() approach is a presentation of order of appearance, whereas a ranged approach is an outright search through the field.
That's just a request for any old int less than 10, or less than 100000. It's not random against an appearance in another range.
by the less intuitive method of
It's surely only less 'intuitive' when, as a programmer, you know that your container can be dealt with using its length. But in the 'pure' sense, if you had darts to throw randomly at an array of numbers at a fairground, its length would do you no good.
The question is - would it be more random than selecting a random index into the list?
It certainly would be I'd think. Because if you write code based on .size(), (and then remove one element from the list to avoid a duplicate reselection) then a hit per iteration is guaranteed, whereas to obtain a hit from a range over the list's values does not guarantee a hit per op.
Yes, both approaches are in the end a form or implementation of randomness, but the .size() approach is a presentation of order of appearance, whereas a ranged approach is an outright search through the field.
the probability of hitting any given value is still the same - 1/n if there are n items.But not if the search is done on a range something like : 1, 2, 4001, 88937, 4000023, 4000027
The only way that would not be true is if nextInt(10), say, is less random than nextInt(100000).
That's just a request for any old int less than 10, or less than 100000. It's not random against an appearance in another range.
the probability of hitting any given value is still the same - 1/n if there are n items.Certainly if you scan a range, each random value has a very low probability of hitting an item in the list.
But not if the search is done on a range something like : 1, 2, 4001, 88937, 4000023, 4000027
I meant the probability of the algorithm returning the value from the list using either algorithm:
a) int index = random.nextInt(list.size) ; // Small range used here
return list.get(index) ;
b) boolean found = false ;
while (!found) {
int possibleValue = random.nextInt(min,max) ; // Wide range used here to sample from
if (list.contains(possibleValue))
return possibleValue ;
}
With either of these, the probability distribution of the returned values should always be "1/n" - but with the assumption that random.nextInt(size()) is indeed correctly implemented (see the comment about mod). Which it is for Java's random class, but might not be for other implementations (e.g. old C++ code used to contain bugs like this all the time).Doug
ASKER
int possibleValue = random.nextInt(min,max) ;
. . . not quite with you here . . . there is no nextInt method that takes two arguments, is there ...?
With either of these, the probability distribution of the returned values should always be "1/n" -
Right, but the "n" is very different for the two coding approaches, is it not ? In the case of your
a) above, over my (range ( / list impl)) of 1, 2, 4001, 88937, 4000023, 4000027
"n" is 6,
whereas in your
b)
"n" is 4000027,
is that not so ?
--------------------------
If the list were comprised of *contiguous* values 1 thru' n (where n is 4000027 for example), then a call of
random.nextInt(n)
would have probability of 1 of returning a value would it not?
Oops - sorry, our local codebase has a wrapped version of Random :)
Should have been:
And yes when the list is:
1, 2, 4001, 88937, 4000023, 4000027
then max will be 4000027
so as you say the probability of random.nextInt() returning a value in the list is 6/4000027 (i.e. very very low). But we're not just making one call to random.nextInt() in the second method.
The distribution of the values returned from the method I wrote which generates random numbers until it finds one in the list, will still be 1/n. It just will take on average around 4000027/6 = 666,000 times through the loop before it returns.
But the value it returns will still be evenly distributed within the list of values in the original list:
1, 2, 4001, 88937, 4000023, 4000027
just as will be the values returned by the other method where we pick a random index, not a random value.
So given that the two methods both produce a random value from the list with the same probability distribution we can say that both methods are equally random even though they use very different selection methods.
And given that the first is much much faster, it's the one to prefer (the one where we randomly pick an index).
Should have been:
int possibleValue = random.nextInt(max-min+1)+min ;
And yes when the list is:
1, 2, 4001, 88937, 4000023, 4000027
then max will be 4000027
so as you say the probability of random.nextInt() returning a value in the list is 6/4000027 (i.e. very very low). But we're not just making one call to random.nextInt() in the second method.
The distribution of the values returned from the method I wrote which generates random numbers until it finds one in the list, will still be 1/n. It just will take on average around 4000027/6 = 666,000 times through the loop before it returns.
But the value it returns will still be evenly distributed within the list of values in the original list:
1, 2, 4001, 88937, 4000023, 4000027
just as will be the values returned by the other method where we pick a random index, not a random value.
So given that the two methods both produce a random value from the list with the same probability distribution we can say that both methods are equally random even though they use very different selection methods.
And given that the first is much much faster, it's the one to prefer (the one where we randomly pick an index).
ASKER
The distribution of probability can be the same, but the enactment of the result - any result - can't be, because if they were equal probabalistically, then they'd take the same amount of time, which can't be true.
because if they were equal probabalistically, then they'd take the same amount of time
Now I'm the one not really following. Why would the same probability distribution imply they should take the same time to compute?
FYI in general "SecureRandom" which is a "better" random (it's not psuedo random) can take an arbitrary amount of time to respond while it waits for sufficient "entropy". Fun stuff. But anyway it's good at producing actual random values on a computer, but does so at the cost of unpredictable response times, unless you buy an "entropy source" for your machine...which means something like a radioactive isotope which decays somewhere and gets measured.
ASKER
Why would the same probability distribution imply they should take the same time to compute?
I'll add more confusion :)
Your code a) takes less time to return a valid value (by valid, meaning a term from a list, not just a random int in itself). Code b) will take longer to do that - (as you have noted yourself in the comments alongside). So the longer time taken by algo be is symptomatic of more *entropy isn't it ? :)
* At least, a factor of entropy relative to the two algos.
if you want to return one element of your array. i.e 4,36,90,203,480017
all you need to do is select a random number from 1 to (number of elements of array) and then use the returned number to index into your array i.e array[random(1,array.lengt h)] so if random returns 3 you would return element 3 of your array.
all you need to do is select a random number from 1 to (number of elements of array) and then use the returned number to index into your array i.e array[random(1,array.lengt
So the longer time taken by algo be is symptomatic of more *entropy isn't it ? :)
* At least, a factor of entropy relative to the two algos.
It could be, but I don't think it is :)
Running a psuedo random generator many times in a a row (which is what we're doing in (b)) doesn't increase the randomness of the final value. If you knew the first value and how the generator works, you'd also know the 1000th value - it's determined by the first value exactly. Hence it's not "more random" just because we ran it over and over.
(Side note - this is actually how most Two Factor authentication algorithms work - you just need to know the initial seed and how to generate the sequence to generate the latest value, so all you're really proving is that you had the seed. And although the values in the sequence look random to an observer, say somebody looking over my shoulder, it's critical that they're not actually random or no way for the server to verify them).
Also we can easily make a slower algorithm for (b) with no change in randomness:
a) int index = random.nextInt(list.size) ; // Small range used here
return list.get(index) ;
b) int index = random.nextInt(list.size) ; // Small range used here
Thread.sleep(2000) ;
return list.get(index) ;
so dangerous to assume that taking longer or doing more operations somehow increases randomness.
That's why if you need "truly random" you need some sort of external source of that randomness - something which can't be inferred from the internal machine state (since CPUs should always be deterministic). You may have seen some security programs asking you to move your mouse around for a bit before they generate a key. They're using those mouse moves as a weak source of "true randomness".
As I say there are better sources of "true randomness" but generally means buying an extra PCI card with a radioactive element inside it and plugging it in. Sort of thing you would find in the back of a modern slot machine.
ASKER
David
Yes, as I answered to your first comment, I understand this totally, completely and fully. But that's not my point. It's not really about the result as such - it's about the method of arriving at a result. One method is to predicate the search on the array length - which is the shortest way and is bound to produce a result.
The other (or an other) approach is to check the required value against *any* possible returned value from a range, the ubound of which is the largest value in the array. What's important here is that if the array is not contiguous, there's no guarantee that a value actually contained in the array will be returned. This means that x number of searches will fail, making the time taken to find any or all values longer. So the approach using the length of the array is nowhere near as random as the other approach, and this is the issue.
Yes, as I answered to your first comment, I understand this totally, completely and fully. But that's not my point. It's not really about the result as such - it's about the method of arriving at a result. One method is to predicate the search on the array length - which is the shortest way and is bound to produce a result.
The other (or an other) approach is to check the required value against *any* possible returned value from a range, the ubound of which is the largest value in the array. What's important here is that if the array is not contiguous, there's no guarantee that a value actually contained in the array will be returned. This means that x number of searches will fail, making the time taken to find any or all values longer. So the approach using the length of the array is nowhere near as random as the other approach, and this is the issue.
ASKER
The irony about radioactive decay as a source of randomness could be said to be the fact that these materials' half-lives are known quite accurately. Tellurium-128 has a half life of 2.2×10 to power 24 (2.2 septillion) years, whereas Hydrogen-7 has a half life of 23 yoctoseconds (2.3×10−23 seconds). So the time to complete decay is known - only the order in which the atoms decay is not in both cases. In the case of the size() approach to picking an int from an array (like which atom will decay) is however deterministic, and therefore in itself not random. If radioisotopes behaved in this way, you need to exhume Newton, Einstein, Pauli, Feinman, Gauss, Fermi, Turing, Teller, Oppenheimer and quite a few others to help re-write the laws of physics.
;)
;)
random numbers have always been the bane of computing. Computers generate pseudorandom numbers. For program testing we need the same set of random numbers so we can repeat the tests and get the same results. In production we don't want the repeatability so we have to seed the random number generator with some form of entropy. This is normally done by measuring something that occurs very slowly using a counter that counts really fast and using the smallest part. Like the move mouse described above.
The random implementation in java is not really random.
I would not use it for cryptography you'd be better off using one of the publicly available random number generators on the internet https://en.wikipedia.org/wiki/List_of_random_number_generators#Random_number_servers
The random implementation in java is not really random.
public int nextInt(int n)Java 7 Docs
Returns a pseudorandom, uniformly distributed int value between 0 (inclusive) and the specified value (exclusive), drawn from this random number generator's sequence. The general contract of nextInt is that one int value in the specified range is pseudorandomly generated and returned. All n possible int values are produced with (approximately) equal probability. The method nextInt(int n) is implemented by class
I would not use it for cryptography you'd be better off using one of the publicly available random number generators on the internet https://en.wikipedia.org/wiki/List_of_random_number_generators#Random_number_servers
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thanks - great input. No gratuitously random comments in sight. ;)
so the value will be a value between 0 and n