A Blog Less Ordinary

The blog of Dave Ingram

In-place array uniq in C

I’ve been developing Insight even though the uni project has come to an end, because it’s fun! I also want to make it more stable and eventually release it under an open-source licence of some kind. There will be an update coming soon, I promise! I now have Internet, so I can write up some things… Anyway, one of the interesting things I wanted to do for Insight was an in-place form of uniq for an array, ideally without any additional memory allocation. It seems that this is something nobody else has done yet! So I set about doing it myself… For those of you who are unfamiliar with the Linux/UNIX command uniq, it takes a sorted list and removes any duplicates. This is almost exactly what I’m trying to do, with one caveat: I need to keep the “discarded” duplicates. What happens is that I have an array containing a number of strings, and these have all been dynamically allocated via malloc() or calloc(). If I just remove or overwrite their pointers, they’ll vanish and cause a memory leak. While I’ve now fixed a large number of leaks thanks to Valgrind, I’m trying to actively avoid any possibility of adding them. Read on for the details…

The Idea

The basic idea is that we have a sorted list of items that may have some duplicates, say:

{A0, A1, B0, C0, C1, E0, F0, F1, F2, G0, H0, H1}

and we want to remove all of the duplicates, so we get two sets:

{A0, B0, C0, E0, F0, G0, H0} {A1, C1, F1, F2, H1}

Note that it doesn’t matter which of the duplicates ends up in which set, just that the items in the first are all unique and that we don’t lose any. Also, A0 and A1 have the same value but the subscript will help to distinguish exactly which of the As I’m talking about. Now we have the problem, let’s look at some solutions.

Lossless solution – additional lists

So the first solution uses two temporary lists, unique and duplicate. We start with two pointers into the original list: p and q. We set p to point to the start of the list, and q to point to the element after p. We then copy the item p points at to our unique list. Then, while we still have more items to examine:

We keep doing this until we get to the end of the array, and then copy the unique list to the start of the array, the duplicate list after that (both overwriting the previous contents) and set the number of unique items to the length of the unique

list.

Lossy solution – one list

We can do a similar thing in-place, although this will result in data loss. The essential idea is:

Now this is quite simple, and only requires one pass through the list, but it results in the loss of information, which is unacceptable in this case.

Lossless solution – one list

For this solution, we can also do it with one pass through the list and two pointers. The basic intuition is that we’re going through the array, accumulating a rotating block of stuff that we don’t want. At any given point:

At the end, we return the number of unique items at the start of the list; everything above that is a duplicate.

For integers

Here is some pseudo-Java that performs my algorithm for a set of integers.


int set_uniq(int set[]) {
  int count = set.length();
  int tmp, p=0, q=1;
  while (p < set.length && q < set.length) {
    while (p < set.length && set[p] != set[q]) {
      p = p + 1;
      if (q < set.length) {
        if (p < q) {
          tmp = set[p];
          set[p] = set[q];
          set[q] = tmp;
        }
        q = q + 1;
      }
      else if (count > p) {
        count = p;
      }
    }
    while (q < max && set[p] == set[q]) {
      q = q + 1;
    }
  }
  // q hit the end and is still the same as p
  if (count > p)
    count = p + 1;

  return count;
}

The general code

The arguments to the general set_uniq() function are very similar to the arguments to standard qsort():

So, without further ado, my code:


int set_uniq(void *set,
    size_t count,
    size_t elem_size,
    int (*cmp)(const void*, const void*)) {
  void *p, *q, *max;
  void *tmp=calloc(1, elem_size); /* TODO: check for failure */
  p=set;
  q=set+elem_size;
  max=set+(count*elem_size);
  while (p< p-set)
      count = (p-set)/elem_size;
    }
    while (q p-set)
    count = 1 + (p-set)/elem_size;
  ifree(tmp);
  return count;
}

Conclusion

This code is released under a Creative Commons licence (see below). A similar idea will work well for other solutions, like set difference. I’ll post those later if anyone really cares. Also, there is a better way of swapping the items by using the fact that:


a ^= b;
b ^= a;
a ^= b;

swaps a and b without needing a temporary variable. If you just go along the data in 32-bit chunks, performing those XOR operations to swap the data. This actually works out to be 30% or so faster (if I remember correctly). Finally – general code like this is always going to be slower than code that’s been specialised for a particular purpose. Calling this on integers, for example, will be much slower than coding an integer-specific version. I should also point out that you can just overwrite integers, as you don’t need to free them 😉

One Response to In-place array uniq in C

Jakk Bloggs says: December 9, 2009 at 15:18

Awesome thanks 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *

*

*

 

GitHub Google+ Twitter