Uniform selection from unknown range
June 30th, 2012 § 6 Comments
I have heard the following puzzle from a colleague I work with. It should be fairly simple, but is still pretty interesting to solve. The solution technique might come in handy for solving various other puzzles, and even for real-world problems. But enough with the introduction, let us cut to the chase.
You are given a black box that is able to generate numbers. You know nothing about the range of these numbers.
At some point, this black box will be done generating the numbers and will not be able to produce any more of them. You obviously can not know in advance how many numbers are going to be generated.
Once all numbers have been generated, the black box can no longer be used. It can not re-generate the same sequence nor does it have any other use.
You are allowed to count the generated elements, and only store a few of them (you may use sufficient memory to store a constant, O(1), number of elements).
Your mission, should you choose to accept it, is to propose an algorithm which will be able to make a uniform selection in the given scenario. That is, you will have to provide an algorithm which satisfies the requirement that every generated element has the same chance to get picked as the output of the algorithm.
For a sample size
k, create an arrayresultsof sizek. Each generated number isj, and is theith generated number.resultsis 1-indexed for simplicity.For the first
kelements generated, setresults[i] = j. For every element generated wherei > k, accept the element with a probability ofk / i. If accepted, replace a random element inresultswithj.The math:
For the first
kelements, the probability that any element from the black box is in your result set is easily calculated ask / i. For each elementjwherei > k, the probability thatjshould be kept isk / i; thus, the probability that any element seen before elementiis in the result set isk / (i - 1).The probability that any given existing element will be replaced at this point is the chance of
jbeing chosen times the chance of said element getting picked for removal (obviously1 / k), which is(k / i) * (1 / k) = 1 / i. The probability that the element is not chosen is one minus that, or(i - 1) / i.Thus, the probably for each element that it is in the result set after
irounds is(k / (i - 1)) * ((i - 1) / i), which is againk / i.This technique is known as “reservoir sampling.” You may be interested in “Random Sampling with a Reservoir”, a paper by Jeffrey Vitter written in 1985: http://www.cs.umd.edu/~samir/498/vitter.pdf
This looks interesting, thanks!
There’s a much simpler algorithm as well
Definitely
You can generate a random number
rfor every elementj, and keepjif itsris in the topkof rs generated so far. However, I think this would be slower than the original method due to having to manage thers, and it requires that you're allowed/able to store some extra data perj.If you're very interested in math, the PDF I linked above describes a reservoir sampling algorithm that runs in
O(n(1 + log(N/k)))time (whereNis the number of elements generated by the black box).Can you clarify what you mean by uniform selection? Do you mean on a range of [0,1)? If not, why cant you just output the random numbers as generated?
A uniform selection in this case means that every number generated by the given black box has the same probability to be selected. Keep in mind that you can only output (select) a single number from the generated sequence.
Using similar notation to that in the reddit thread:
Random rand = new Random()
// fails if box is empty
int current_choice = blackBox.GetNext();
int n_seen = 1;
while( blackBox.hasMore() )
{
int next_generated = blackBox.GetNext();
count++;
if( rand.GetNext() < 1.0/(count) ) {
current_choice = next_generated;
}
}
// current_choice is the final choice
It's pretty spiffy – you only need to remember one number.
To see why it works, think about a proof by induction. Imagine that we have generated 4 numbers, and each of them had a 1/4 chance of being the value we saved in current_choice. Then we generate value #5. There's a 1/5 chance it becomes the new value of current_choice, and there's a 4/5 chance we keep the old one. So the overall chance of one of the first four values being the current_choice at the end of the fifth pick is
(chance of being pick at end of 4) * (chance of stayin the picked value)
= 1/4 * 4/5 = 1/5
So the loop maintains an equal distribution at each step.
(I think I'd seen something like this before but wanted to try writing it up to see if I actually understood it. I.e. I'm not that clever.)