Sorting things out

— August 17, 2008 at 10:49 PDT

I recently packed up everything I own and moved. I'd lived in my old place for about nine years and I have the packrat gene on both sides of the family tree, so I had a lot of crap to sort through to figure out what to move and what to trash, as well as which box what should go in. Now that I'm here in the new place, I've had to sort through the remaining stuff to figure out where it all goes. So you might appreciate that sorting has been on my mind a lot lately. (See? It's a topical tie-in. I don't do those often, so I hope it wasn't too awkward.)

I've also been working on an application that does a lot of sorting to prioritize tasks in a workflow. These items often need to be sorted based on multiple criteria, such as how long an application has been waiting for approval, how many times a customer has been called recently, etc. We also have to sort names that have anywhere from two to four components (Hispanic names can have both paternal and maternal names instead of just a surname).

Ruby enumerables can be sorted using the #sort method, which orders contents using the <=> trinary comparison operator. It's pretty simple to override <=> if instances of a class are only ever sorted one way, but when you get to a situation where you need to sort things in different ways depending on the context, you need a more flexible solution

The next obvious thing to try is passing a block to the #sort method. This works great, but it has two drawbacks. One, if the sort criteria are complex, you can end up with some ugly looking code. And B, it's not very DRY for re-using sort criteria.

The approach I came up with for my application makes use of the sorting properties of Ruby arrays. Ruby arrays sort quite nicely using the <=> operator. <=> compares two arrays by comparing all their values until a non-equal result is found for any pairings of values in the two arrays. Here are some example comparisons:

>> [1, 2, 3] <=> [1, 2, 3]
=> 0
>> [0, 2, 3] <=> [1, 2, 3]
=> -1
>> [1, 2, 3] <=> [0, 2, 3]
=> 1
>> [1, 2, [4, 5]] <=> [1, 2, [6, 7]]
=> -1
>> [1, 2, [4, 5]] <=> [1, 2, [2, 3]]
=> 1

There are some things to watch out for, like don't have nil anywhere in the array contents, and shorter arrays sort less than longer arrays where all else is equal. But this is a great feature to build upon.

Based on the array sorting functionality, I defined a concept I call a sorter. A sorter is an array that holds multiple values to use to sort the collection by in order of precedence. For example, if you want to sort people by name using the precedence paternal, maternal, first, middle, you can construct an array of those values and sort the list using that.

class Person < ActiveRecord::Base
  def name_sorter
    [(paternal_name || ""), (maternal_name || ""), (first_name || ""), (middle_name || "")]

@people = Person.find(:all).sort_by { |p| p.name_sorter }

Notice I'm making sure that I never have a nil as a value in the sorter, cause that will make things blow up like the ending of a Die Hard movie. If you want to avoid that verbiage, set up the model so that the names default to empty strings. Also notice that I'm using the #sort_by method, instead of the #sort method with a 2-arg block. Actually, let's look at all our options. The following three forms will give equivalent results:

@people = Person.find(:all).sort_by { |p| p.name_sorter }
@people = Person.find(:all).sort_by(&:name_sorter)
@people = Person.find(:all).sort { |a,b| a.name_sorter <=> b.name_sorter }

The first line is the one I'd use in most situations. The second line is slightly more compact, but the symbol-to-proc coercion should be avoided in any code where you care about performance (at least until Ruby 1.9 is ready for prime time). The third line is usually going to be a bad choice, because a sorter array would be created for every time its person was compared, which will be roughly ln(n) times if you are willing to believe my naive assumptions about Ruby's sorting algorithm. If you don't want to trust naiveté, check out the simple performance comparison I ran. This benchmark sorts an array of 60 items using a sorter on 4 values.

                user     system      total        real
sort        6.430000   0.000000   6.430000 (  6.466645)
sort_by     1.510000   0.000000   1.510000 (  1.509017)

That kind of difference is pretty significant to me. Of course, if your sorter can be cached so you can reuse it for multiple comparisons, a #sort might be better. As always, measure it if you care and you're not sure.

One last thing. What if we just rolled the sort by hand using a block? (The code below assumes names will default to an empty string.)

@people = Person.find(:all).sort do |a,b|
  if    0 != (comp = a.last_name <=> b.last_name)
  elsif 0 != (comp = a.maternal_name <=> b.maternal_name)
  elsif 0 != (comp = a.first_name <=> b.first_name)
    a.middle_name <=> b.middle_name

That's fairly ugly and it's actually slower than the #sort_by approach too.

And because I don't believe in secret benchmarks, here is a gist of the benchmark code I used.

Now if you'll excuse me, I have a lot of empty cardboard boxes to toss in the recycling.

10 commentsruby, sorting

  1. Ryan McGeary2008-08-17 14:22:32

    Minor nitpick, Symbol#to_proc is already native in ruby 1.8.7. Good tip on the possible performance differences between #sort and #sort_by.

  2. Eric Mill2008-08-17 20:23:05

    That was straightforward and pretty. Those caveats (no nils, same length) are pretty significant, but if they're known and handled, that looks pretty great. I'm willing to bet you can do that last if/elsif block in a cleaner fashion, but I still doubt it can touch the sorter. Good idea.

    The topical tie-in wasn't awkward - after all, you discussed whether it was awkward as you went along, and that's known to diffuse awkwardness. I hope to continue the awkwardness diffusion by discussing the discussion of the awkwardness.

  3. Ray Baxter2008-08-17 21:33:04

    How about using a string instead of an array?

    class Person < ActiveRecord::Base
      def sort_string
        @sort_string ||= (paternal_name || "") + (maternal_name || "") + (first_name || "") + (middle_name || "")
    @people = Person.find(:all).sort { |a,b| a.sort_string <=> b.sort_string }
  4. Josh Susser2008-08-17 22:50:37

    @Ryan: Fair point. I wasn't sure if the 1.8.7 implementation was the same as the 1.9 version. Is it as performant? And I think even the 1.9 version isn't quite as fast as doing the block by hand.

    @Ray: That would work for string data, but not for numbers, dates, etc.

    @Eric: Don't get meta-recursive on me or I'll sic the moving drama demons on you.

  5. Aleksandr2008-08-17 23:18:22

    Are the parentalname, maternalname, firstname and middlename stored in the database? If yes, why don't you sort with sql?

  6. Mark Wilden2008-08-19 14:42:55

    1) I'm interested in the performance impact of

    @people = Person.find(:all).sort_by(&:name_sorter)

    Would the proc only be created once? If so, this seems clearer than the alternatives.

    2) The concatenation suggestion would not work on strings, unless each of the components were right-padded.

  7. Scott2008-08-20 22:27:13

    Assuming ruby is like perl (I'm a ruby newbie), you don't need the if-ladder or the temporary variable comp by taking advantage of the short-circuiting nature of ||:

    @people = Person.find(:all).sort do |a,b|
      a.last_name     <=> b.last_name         ||
      a.maternal_name <=> b.maternal_name     ||
      a.first_name    <=> b.first_name        ||
      a.middle_name   <=> b.middle_name

    That's not so ugly.

  8. Josh Susser2008-08-21 12:35:18

    @Scott: Ruby is unlike Perl in that regard. <=> returns -1, 0 or 1, and 0 is a trueish value, not falseish. Only false and nil are falseish values.

  9. Yossef2008-08-28 21:42:59

    I'm not sure 'trinary' is the right word there. That sounds like it compare three things rather than returns one of three values.

    Other than that and the horribly awkward topical tie-in (heh), nice article. Using Ruby's array sorting behavior for multi-attribute sorting is a great trick.

  10. Josh Susser2008-09-04 09:54:09

    Yossef: I used the word "trinary" in the sense of trinary logic for the return value. Ruby already has a "ternary operator" (boolean ? true_value : false_value), so I couldn't say that. I guess if I were a Ruby purist I'd call it the "spaceship operator", but I think that looks more like a flying saucer. Spaceships should have a front and a back, like the Enterprise or Moya. So until there is some kind of official name for <=> I'll have to muddle through and hope that the terminology police don't get me.

Sorry, comments for this article are closed.