Suppose that you have to sample a student at random in a school. However, you cannot go into a classroom and just pick a student. All you are allowed to do is to pick a classroom, and then let the teacher pick a student at random. The student is then removed from the classroom. You may then have to pick a student at random, again, but with a classroom that is now missing a student. Importantly, I care about the order in which the students were picked. But I also have a limited memory: there are too many students, and I can’t remember too many. I can barely remember how many students I have per classroom.
If the classrooms contain a variable number of students, you cannot just pick classrooms at random. If you do so, students in classrooms with lots of students are going to be less likely to be picked.
So instead, you can do it as follows. If there are N students in total, pick a number x between 1 and N. If the number x is smaller or equal to the number of students in the first classroom, pick the first classroom and stop. If the number is smaller or equal to the number of students in the first two classrooms, pick the second classroom, and so forth. If you remove the student, you have to account for the fact that the sum (N) is reduced and the number of students in the classroom is being updated.
while(N > 0) { int x = r.nextInt(sum); // random integer in [0,sum) int runningsum = 0; int z = 0; for(; z < runninghisto.length; z++) { runningsum += runninghisto[z]; if(y < runningsum) { break; } } // found z runninghisto[z] -= 1; N -= 1; }
If you have many classrooms, this algorithm can be naive because each time you must pick a classroom you may need to iterate over all classroom counts. Suppose you have K classroom: you may need to do N K tasks to solve the problem.
Importantly, the problem is dynamic: the number of students in each classroom is not fixed. You cannot simply precompute some kind of map once and for all.
Can you do better than the naive algorithm? I can craft a “logarithmic” algorithm, one where the number of operations you do is the logarithm of the number of classrooms. I believe that my algorithm is efficient in practice.
So divide your classrooms into two sets of classrooms of size R and S so that R + S = N. Pick an integer in [0,N). If it is smaller than R, decrement R and pick a buffer in the corresponding set. If it is larger than or equal to R, then go with the other set of buffers. So we have reduced the size of the problem in two. We can repeat the same trick recursively. Divide the two subsets of classrooms and so forth. Instead of merely maintaining the total number of students in all of the classrooms, you keep track of the counts in a tree-like manner.
We can operationalize this strategy efficiently in software:
- Build an array containing the number students in each classroom.
- Replace every two counts by the sum of two counts: instead of storing the number of students in second classroom, store the number of students in the first two classrooms.
- Replace every four counts by the sum of the four counts: where the count of the fourth class was, store the sum of the number of students in the first four classrooms.
- And so forth.
In Java, it might look like this:
int level = 1; for(;(1<<level) < runninghisto.length; level++) { for(int z = 0; z + (1<<level) < runninghisto.length; z += 2*(1<<level)) { runninghisto[z + (1<<level)] += runninghisto[z]; } }
You effectively have a tree of prefixes which you can then use to find the right classroom without visiting all of the classroom counts. You visit a tree with a height that is the logarithm of the number of classrooms. At each level, you may have to decrement a counter.
while(N > 0) { int y = r.nextInt(sum); // select logarithmic time level = maxlevel; int offset = 0; for(; level >= 0; level -= 1) { if(y >= runninghisto[offset + (1<<level)]) { runninghisto[offset + (1<<level)] -= 1; offset += (1<<level); } } // found offset N -= 1; }
I have written a Java demonstration of this idea. For simplicity, I assume that the number of classrooms is a power of two, but this limitation could be lifted by doing a bit more work. Instead of N K operations, you can now do only N log K operations.
Importantly, the second approach uses no more memory than the first one. All you need is enough memory to maintain the histogram.
My Java code includes a benchmark.
Update. If you are willing to use batch processing and use a little bit of extra memory, you might be able to do even better in some cases. Go back to the original algorithm. Instead of picking one integer in [0,N), pick C for some number C. Then find out in which room each of these k integers fit, as before, decrementing the counts. If you pick the C integers in sorted order, a simple sequential scan should be enough. For C small compared to N, you can pick the C integers in linear time using effective reservoir sampling. Then you can shuffle the result again using Knuth’s random shuffle, in linear time. So you get to do about N K / C operations. (Credit: Travis Downs and Richard Startin.)
Further reading: Hierarchical bin buffering: Online local moments for dynamic external memory arrays, ACM Transactions on Algorithms.
Follow-up: Sampling Whisky by Richard Startin
Vose’s Alias method (see https://www.keithschwarz.com/darts-dice-coins/) is another approach. It works in constant time with linear time preprocessing using only simple data structures.
As you may notice in my blog post, I remove the students from the classrooms (without replacement). Do you know how to apply the Alias method to this scenario?
My expectation is that the Alias is not applicable to the scenario I describe.
Apologies for the confusion, I did indeed miss out the fact that you are sampling without replacement. The dynamic case is an interesting one, thanks for introducing it.
I was aware of the Alias method: I should have included it in the post to avoid confusion.
Thanks as usual for the informative blog post! Very minor point, but I believe you can remove these lines 34+35 from your sample Java program here
int l = 0;
while((1<<l) < histo.length) { l++; }
I fixed it.
This problem sounds related to adaptive arithmetic coding, and the newer asymmetric numeral systems (ANS).
Perhaps I am missing something fundamental in the problem description, but it seems to me that (a) repeatedly selecting students at random from classrooms, each of whose probability of being chosen is proportional to the number of students remaining in it, is identical to (b) simply making a list of all students, regardless of classroom, and shuffling that list randomly.
That is, it looks like the probabilities have been carefully defined so that each student always has a 1/k chance of being chosen, where k is total number of unchosen students at each iteration. How is that different from a straight shuffle or permutation operation?
“simply making a list of all students, regardless of classroom, and shuffling that list randomly” => This requires N units of storage.
“How is that different from a straight shuffle or permutation operation?” => You are not allowed to do memory allocation beyond the maintenance of the histogram.
Aha! I see. So the classrooms must be treated as black boxes. That is, we don’t know the names or identities of the students in the rooms until after they are chosen; at the outset, we only know the number of students in each room, or the cardinality of each set we’re drawing from. It is then up to the teacher (or some agent of the room) to choose a student at random when we choose a room.
So the goal is to use at most O(K) space, where K is the number of classrooms?
And the solution presented in the post runs in O(N log K) time, and we seek to know whether an alternative solution runs in less time or if this is already optimal?
And the solution presented in the post runs in O(N log K) time, and we seek to know whether an alternative solution runs in less time or if this is already optimal?
We know that if you are given a bit more storage, you can do better than O(N log K) time. It seems that O(N) is entirely within reach, but with extra storage. I’d be interested to know whether one can do better using only the K units of storage (please don’t turn it into O(K) units).
Here is a thought: If we let S[k] be the number of students in classroom k (1 ≤ k ≤ K), then depending on the shape of the curve traced by the tuples (k,S[k]), it may be possible to model the curve in such a way that S[k] can be roughly predicted from k, and vice-versa.
That is, given a random number n (1 ≤ n ≤ N), we can compute two boundaries k₁ and k₂ from which to encapsulate a binary search for the exact k from which the cumulative sum of S[k] terms represent the value n.
Ultimately, however, this is still a binary search for the ordinal number of classroom implied by n—and thus still O(N log K)—and is probably not likely to be faster in practice, unless the shape of the curve of classroom sizes is ideally modelable.
Actually, Todd, I think that this may well work in practice.
https://en.wikipedia.org/wiki/Interpolation_search
I am just not quite sure how to adapt it.
Do you have, by any chance, workable pseudocode?
Your benchmark is a bit misleading: if you want all the students, just put all of them in the output and shuffle.
If you want less than a fraction of the students (let’s say 3 out of 4), and the distribution is not too skewed, a simple linear search with rejection sampling would work well in practice.
public static int rejectRandomSample(int[] histo, int[] output, int outputSize) {
int [] currentHisto = Arrays.copyOf(histo, histo.length);
int [] runningHisto = new int [histo.length+1];
System.arraycopy(histo, 0, runningHisto, 1, histo.length);
for (int i = 2; i <= histo.length; i++)
runningHisto[i] += runningHisto[i-1];
int sum = runningHisto[histo.length];
int averageSize = sum / histo.length;
for(int pos = 0; pos < outputSize; ++pos) {
int i = -1;
while (true) {
int y = r.nextInt(sum);
i = y / averageSize; // hope the data is not too skewed
while ( runningHisto[i] > y) i -= 1;
while (runningHisto[i+1] <= y) i += 1;
if (currentHisto[i] > y - runningHisto[i]) {
currentHisto[i] -= 1;
break;
}
}
output[pos] = i;
}
return outputSize;
}
I obtain the following benchmarks on my machine:
N=1024, M=100000
outputSize = s / 4
naive 260.04551556738215
smarter 74.68069439333325
reject 40.423464682798844
outputSize = 3 * s / 4
naive 258.509821210682
smarter 76.47785267880008
reject 65.39425675586646
You may want all students but may not be able, at any one time, to allocate memory. Thanks for your code.
To avoid the penalty when currentSum is low, you can also recompute the runningHisto regularly. Something like:
public static int rejectRandomSample(int[] histo, int[] output, int outputSize) {
int [] currentHisto = Arrays.copyOf(histo, histo.length);
int [] runningHisto = new int [histo.length+1];
int sum = 1;
int currSum = 0;
int averageSize = 0;
for(int pos = 0; pos < outputSize; ++pos) {
if (sum * .95 > currSum) {
System.arraycopy(currentHisto, 0, runningHisto, 1, histo.length);
for(int i = 2; i <= histo.length; i++)
runningHisto[i] += runningHisto[i-1];
sum = runningHisto[histo.length];
currSum = sum;
averageSize = (sum / histo.length) + 1;
}
int i = -1;
while (true) {
int y = r.nextInt(sum);
i = y / averageSize; // hope the data is not too skewed
while ( runningHisto[i] > y) i -= 1;
while (runningHisto[i+1] <= y) i += 1;
if (currentHisto[i] > y - runningHisto[i]) {
currentHisto[i] -= 1;
break;
}
}
output[pos] = i;
currSum -= 1;
}
return outputSize;
}
gives me
naive 250.25992953739302
smarter 74.97430178092274
reject 42.24443463267574
Just FTR, the last timings are for “outputSize = s” which means get all students.