Seeking an efficient algorithm to group identical values

In the past, I have had luck with my requests for help, so here is another one.

Suppose you have a large array made of a large number of distinct values ({A,B,A,B,A,C,C}) and you want to group the identical values like so {A,A,A,B,B,C,C} or like so {C,C, A,A,A,B,B}. That is, you do not care about the order, you just want identical values to be clustered. How do you do it?

  • You could sort the array assuming that there is an ordering between the objects. There are highly efficient external-memory sorting routines. They mostly rely on sequential IO and are amazingly fast.
  • You can build a hash table. Because it is a linear time operation, for very large arrays, it should be faster than sorting, in theory. However, the catch is that external-memory hash tables are not very efficient because they rely on random IO and are prone to cache misses. Remember kids that 100 n > n log n despite what your math. teacher taught you.
  • We can mix hashing and sorting. Scan the array, and randomly hash each value into one of L bins. You know that if the value x appears in bin i, then all values x are in the same bin. So, you can simply sort each bin and concatenate the bins in O(n log n/L) time, assuming your hashing is good enough.
  • One last possible trick might be to adapt fast duplicate detection algorithms such as the Teuhola-Wegner algorithm: J. Teuhola, L. Wegner, Minimal space, average linear time duplicate deletion, Communications of the ACM, 1991.

So, what do you think?

14 thoughts on “Seeking an efficient algorithm to group identical values”

  1. How large is the array, and are you interested in the algorithm or a solution? Either Hadoop or Pig would knock this out easily for very large arrays, not particularly efficiently, but quite scalable-y. In Pig this is is a GROUP BY: http://wiki.apache.org/pig/PigLatin

    Your algorithm above is the same as counting the number of occurrences of each item: A:3, B:2, C:2 . Seems like that should be doable with a simple hash with fairly minimal memory requirements. If it’s larger than fits in memory you can fairly easily use memcached to have a distributed hash. If it’s larger than that you probably want Hadoop.

    Btw, the roman numeral spam thing is getting old, how about something more exciting? A nice game of chess or a short essay on the fall of Rome? Here’s one I’ve heard works well: create a form field called “email” (name=”email”), with a label that reads “leave this blank”. Bots will fill it in with an email address, humans will leave it alone.

  2. Yes, I am overdue to update the captcha. But I am lazy and it works ok.

    No, the hash table cannot fit in memory (otherwise, the problem is trivial).

    Daniel: it needs to be exact.

  3. Here’s an idea based on counting sort / radix sort. Suppose that the values are W bits long, and that your internal memory is composed of N blocks of B values. Let K ~ log_2 N.

    Choose a hash function h(x) mapping the W bits of a data value to smaller K bit numbers. Scan the data values, counting how many times each value of h(x) occurs, and use these counts to allocate chunks of external memory to move the items to. Then scan the data values again, adding each value to a block of internal memory associated with its value of h(x); when one of these internal-memory blocks fills, send x to the corresponding chunk of external memory and start a new internal-memory block for the same value. Recurse (with a new hash function) within each chunk until you get down to chunks small enough to fit into internal memory at which point you can switch to hashing.

    This will run into trouble if any single value has more copies than will fit into main memory, because as described above it will recurse forever. But a simple modification fixes that: when you scan and count values of h(x), also check whether the data are all the same as each other (a trivial algorithm: store the first data item and compare it to all successive ones) and if so stop recursing.

  4. I think that David’s algo is similar to my “mix hashing and sorting” approach.

    Kevembuangga : I think it is cool that people try to design algorithms on a blog.

  5. Yeah, I think it’s pretty much the same idea as yours, changed only in that it does multiple levels of hashing instead of immediately switching to sorting, and with a little more detail in how you do the hashing part I/O-efficiently.

    I think the analysis should work out to something like log_{M/B} K passes over the data, where M is internal memory size, B is I/O block size, and K is the number of distinct keys. Probably improvable to log_{M/B}(K/M) with a better check for having a small number of distinct keys and an appropriate tall-cache assumption. Which is not much better than the log_{M/B}(n/B) that you get for external-memory sorting (achieved by M/B-way mergesort), but it’s interesting that you can do better at all — sorting is a bottleneck for a lot of external memory problems much more than it is for in-memory computation.

    I don’t see what’s wrong with brainstorming on a blog, by the way. It’s a little risky if one wants to hoard one’s intellectual property in order to get proper academic credit for publishing first, but I don’t especially care about that for this problem. As for the literature search, thanks for the pointer — my usual strategy is brainstorm first, figure out whether it was already known second, but I recognize that reasonable people may disagree on that prioritization.

  6. I’m not really an algorithms person, but…

    I probably would have just built a hash table, but if I can’t do that…

    I would simply treat the array as a string, start with the first character P, and use a regular expression to count the number n of occurrences of that character P , and then delete those occurrences from the string. Then I’d save $vals{P}=n;

    Then I would move to the next character and do the same thing, repeating the cycle until the string is empty.

    Then, for each character in %vals, I’d print n instances.

  7. It seems there are plenty of good enough algorithms, so it’s a question of whether we’re looking for the optimal algorithm or a workable solution.

    Memcached is easily distributable over several machines, and you can easily get 8G per box these days, so you can fit a lot in memory. If it’s larger than that Hadoop is what you want.

  8. It would be nice to have some more informations on the problem, approximate size of the key in bytes, total number of keys, acceptable RAM size.

    Parand: It seems there are plenty of good enough algorithms

    Mmmmm, no, there are plenty of close enough algorithms, AFAIK none matching the exact requirements of Daniel.

    D. Eppstein: I don’t see what’s wrong with brainstorming on a blog, by the way. It’s a little risky if one wants to hoard one’s intellectual property in order to get proper academic credit for publishing first,

    That’s not what I meant. I am not an academic, furthermore I am retired, so what do I care? 🙂

    It’s that the volume of messages exchange may be a bit large for a comments thread and that the true “sausage making” process doesn’t look pretty like in published papers or course lectures.
    You have to spout out a lot of silly ideas to possibly (but not eventually…) put your finger on the real nugget.

    To give you the flavor I will give it a try below.

    D. Eppstein: As for the literature search, thanks for the pointer — my usual strategy is brainstorm first, figure out whether it was already known second, but I recognize that reasonable people may disagree on that prioritization.

    Not really disagreeing, reinventing the wheel feels good because then, it’s your wheel (even if very similar to previously known ones), however to have more chances to come up with a better wheel it’s good to have a look at the existing ones.

    Trying a first draft, something along the lines of bucket sort could probably do, the trick being that most of the data should stay on disk rather than in RAM.
    I will assume that it’s acceptable to use an amount of disk workspace slightly larger than the total size of the keys (keysize x N).
    Depending on the available RAM choose a chunk size of 2 or 3 bytes (such that each chunk value can be used as an index into a RAM array) and split the keys into column slices 2 or 3 bytes wide, copying each slice into a distinct file for further processing.
    Unless the key is unusually long there should not be too many files.

    On the first pass (which can be done along the splitting of keys) keep in the RAM array a count of the number of hits for the chunk values of slice 0 and an “id” for this stream of values (allocated whenever the first such value appear), this makes for a 2 ints entry per chunk value, i.e.2 x 2**16 or 2**24 ints.
    Along with this write out for each key the id number of the stream it belongs to, allowing to retrieve that id later just from the rank of the key in the file.
    BTW, since this is all what is required for pass 1 slice 0 need not be written to disk.
    At the end of any pass if the count for a given stream id is 1 the key is unique and need no further consideration.

    Succeeding passes over key slices 1 to N are a bit more complicated and each may even need to be iterated into sub-passes.
    This is where memory constraints show up.
    [ I will continue (later) into another comment since the HTML textfield begins to screw up weirdly, too much text likely…]

  9. Continued…

    In the first pass there were only one stream as input: all yet undiscriminated keys.
    In succeeding passes the keys from each stream must be delt with separately from other streams (qualified by their already assigned stream number).
    That would mean as many count/id arrays as there are distinct streams, this is where memory constraints show up.
    Do you really expect such a large number of distinct keys that you could not host an id/count pair + a few housekeeping bytes for each group in RAM?
    If so even resorting to clever sparse tables won’t do.(like Tarjan’s or Fredman, Komlós, Szemerédi).

    Anyway, the base idea (from bucket sort) is that when processing the next key slice you read along the currently assigned key stream id from the id’s file created in the previous pass to discriminate the key chunk counting/stamping by previous id (major).
    This way all I/O is sequential, not random.
    If there is really no way to stuff all the new id/count arrays in RAM things get really ugly.
    You have to cram as many arrays as possible in RAM, then skip the processing of keys which don’t belong to the stream ids (from previous pass) that you have in core, create the new stream ids file by random access (though still in increasing disk locations thus not the very worst case) and make as many “sub-passes” as to process all key chunks.
    In the end the last stream id file will contain the id for each key as a plain integer and you still have to count the id occurrences for each class (this amount to having translated the arbitrary key sequence to a “clean” sequence of ids).
    If even the counts array cannot fit in RAM 🙁 , in the last pass (slice N) you will have to output the counts in a separate file at the end of each sub-pass.

    Further suggestions from the crowd?
    (asking for the famous “wisdom of crowds 😉 )

Leave a Reply

Your email address will not be published. Required fields are marked *