I have heard the following puzzle from a colleague I work with. It should be fairly simple, but is still pretty interesting to solve. The solution technique might come in handy for solving various other puzzles, and even for real-world problems. But enough with the introduction, let us cut to the chase.

You are given a black box that is able to generate numbers. You know nothing about the range of these numbers.

At some point, this black box will be done generating the numbers and will not be able to produce any more of them. You obviously can not know in advance how many numbers are going to be generated.

Once all numbers have been generated, the black box can no longer be used. It can not re-generate the same sequence nor does it have any other use.

You are allowed to count the generated elements, and only store a few of them (you may use sufficient memory to store a constant, O(1), number of elements).

Your mission, should you choose to accept it, is to propose an algorithm which will be able to make a uniform selection in the given scenario. That is, you will have to provide an algorithm which satisfies the requirement that every generated element has the same chance to get picked as the output of the algorithm.

For a sample size

`k`

, create an array`results`

of size`k`

. Each generated number is`j`

, and is the`i`

th generated number.`results`

is 1-indexed for simplicity.For the first

`k`

elements generated, set`results[i] = j`

. For every element generated where`i > k`

, accept the element with a probability of`k / i`

. If accepted, replace a random element in`results`

with`j`

.The math:

For the first

`k`

elements, the probability that any element from the black box is in your result set is easily calculated as`k / i`

. For each element`j`

where`i > k`

, the probability that`j`

should be kept is`k / i`

; thus, the probability that any element seenbeforeelement`i`

is in the result set is`k / (i - 1)`

.The probability that any given existing element will be replaced at this point is the chance of

`j`

being chosen times the chance of said element getting picked for removal (obviously`1 / k`

), which is`(k / i) * (1 / k) = 1 / i`

. The probability that the element isnotchosen is one minus that, or`(i - 1) / i`

.Thus, the probably for each element that it is in the result set after

`i`

rounds is`(k / (i - 1)) * ((i - 1) / i)`

, which is again`k / i`

.This technique is known as “reservoir sampling.” You may be interested in “Random Sampling with a Reservoir”, a paper by Jeffrey Vitter written in 1985: http://www.cs.umd.edu/~samir/498/vitter.pdf

This looks interesting, thanks!

There’s a much simpler algorithm as well

Definitely You can generate a random number

`r`

for every element`j`

, and keep`j`

if its`r`

is in the top`k`

of rs generated so far. However, I think this would be slower than the original method due to having to manage the`r`

s, and it requires that you're allowed/able to store some extra data per`j`

.If you're very interested in math, the PDF I linked above describes a reservoir sampling algorithm that runs in

`O(n(1 + log(N/k)))`

time (where`N`

is the number of elements generated by the black box).Can you clarify what you mean by uniform selection? Do you mean on a range of [0,1)? If not, why cant you just output the random numbers as generated?

A uniform selection in this case means that every number generated by the given black box has the same probability to be selected. Keep in mind that you can only output (select) a single number from the generated sequence.

Using similar notation to that in the reddit thread:

Random rand = new Random()

// fails if box is empty

int current_choice = blackBox.GetNext();

int n_seen = 1;

while( blackBox.hasMore() )

{

int next_generated = blackBox.GetNext();

count++;

if( rand.GetNext() < 1.0/(count) ) {

current_choice = next_generated;

}

}

// current_choice is the final choice

It's pretty spiffy – you only need to remember one number.

To see why it works, think about a proof by induction. Imagine that we have generated 4 numbers, and each of them had a 1/4 chance of being the value we saved in current_choice. Then we generate value #5. There's a 1/5 chance it becomes the new value of current_choice, and there's a 4/5 chance we keep the old one. So the overall chance of one of the first four values being the current_choice at the end of the fifth pick is

(chance of being pick at end of 4) * (chance of stayin the picked value)

= 1/4 * 4/5 = 1/5

So the loop maintains an equal distribution at each step.

(I think I'd seen something like this before but wanted to try writing it up to see if I actually understood it. I.e. I'm not that clever.)