Suppose that you are given two arrays. Maybe you have a list of cities from the USA and a list of cities from Europe. You want to generate a new list which mixes the two lists, taking a sample from one array (say 50%), and a sample from the other array (say 50%). So if you have 50 cities from the USA and 50 cities from Europe, you want a new array that contains, in random order, 25 cities from the USA and 25 cities from Europe.

We need this kind of mixed sampling all the time in machine learning or data science. This summer, I was running simulations and the bulk of the time was spent mixing arrays. I need to pick, say, 25% of all elements from one array and combine them with, say, 75% of all elements from another array.

There are many bad ways to solve this problem. But here is a reasonable one. First you pick a sample from the first array using reservoir sampling; then you pick a sample from the other array (again using reservoir sampling), and you finally apply a random shuffle to the result.

Reservoir sampling is an efficient way to sample N values from an array:

for (i = 0; i < N; i++) { output[i] = source[i]; } for (; i < ...; i++) { r = random_bounded(i);// value in [0,i) if (r < howmany) { output[r] = source[i]; } }

Knuth shuffling is an efficient way to randomly shuffle the elements in an array:

for (i = size; i > 1; i--) { r = random_bounded(i);// value in [0,i) swap(array[i-1], array[r]); }

With these two algorithms in place, I can sample from two source arrays using three function calls:

reservoirsampling(output, N1, source1, length1); reservoirsampling(output + N1, N2, source2, length2); shuffle(output, N1 + N2);

So how efficient is it? Suppose that I have two arrays made of a million elements each and I want to sample half a million elements from each. On my iMac, I use a bit over 12 CPU cycles per input element (so about 24 million cycles in total). You probably can go even faster, but this approach has the benefit of being both simple and efficient.

I don’t think the reservoir sampling is suitable here: the technique was explicitly designed to handle situations where the exact population size, N, is not known upfront. In particular it requires N steps in every N-choose-K scenario, even when only K steps are necessary. I believe Floyd’s sampling algorithm should work better, since it is explicitly O(K).

You are correct that reservoir sampling is applicable even when the size of the array is unknown. I don’t think that it makes it unadvisable in other instances.

In my context, I am interested in sampling a fixed fraction of the input array (e.g., 50%). In such cases, the best possible complexity is linear in the size of the input.

My choice of algorithms and my implementation is surely not optimal, and certainly not in all cases. I’d be very interested in a new, better implementation that can beat the 12 CPU cycles per input element I get. Can you please provide a code sample that I can benchmark?

I apologize for jumping to conclusions way too quickly and for shouting too loudly. 🙁

You are certainly correct that changing the sampling algorithm would buy absolutely nothing in the 50% case. Specifically, reservoir sampling requires N-K RNG calls while Floyd’s requires K. Obviously, these are exactly same numbers.

I am very sorry for not paying attention.

Your comment was interesting and forced me to elaborate on the context.

When sampling small numbers of values, you are quite correct that my approach would be inefficient.

Thanks for commenting.

This seems like a good opportunity for the

`Merge`

algorithm from the merge shuffle paper. It merges two shuffled arrays into a single shuffled array. (It can also be vectorized with AVX and likely quite well with AVX512).Unfortunately it requires N-permutations of the two source arrays. Vectorized shuffling would help but with large arrays random memory access might slow things down.

A quick impl of

`Merge`

+`sse41_xorshift128plus_partialshuffle32`

shows ~3.8 speedup, using your code.Turns out a vectorized reservoir sampling (with

`_mm_maskstore_ps`

) is even faster, by about 2.5x.