When associations aren't enough, part 2

— June 12, 2006 at 15:40 PDT

After I posted part 1 this morning, the inimitable Peter Cooper left a comment of a very different way to get the same results. I'd been thinking about using my approach to do tagging, and wasn't too happy with it. Doing two or three joins is one thing, but doing a dozen or so is another matter and likely to make a database cranky. And with a spiffy tagging UI, you can expect users to be filtering by a bunch of tags. Looks like my many-joins approach won't scale well.

Peter's approach uses grouping and then counting the number of matches. Pretty slick trick. Here's how the (ANSIfied) SQL for his approach looks:

SELECT * FROM movies
  INNER JOIN appearances a ON movies.id = a.movie_id
  WHERE (a.actor_id IN (17, 23, 39, 42))
  GROUP BY movies.id HAVING COUNT(movies.id) = 4

The join finds all the appearances for any actor in a movie. The WHERE filters that down to just the actors we care about, so our result set is now all the movies any of the Marx brothers appeared in. Then we group all the movies by their id, and count how many of our specified actors appeared in each movie. Only the movies that have the right number of actors are selected.

Here's the method for generating that query:

def self.find_with_all_actors(*actors)
  return [] if actors.empty?
  actors = actors.flatten
  find(:all, :readonly => false,
       :joins => "INNER JOIN appearances a ON movies.id = a.movie_id",
       :conditions => "a.actor_id IN (#{actors.map(&:id).join(', ')})",
       :group => "movies.id HAVING COUNT(movies.id) = #{actors.size}")

Note that if you're running edge, you can use this for the conditions:

        :conditions => ["a.actor_id IN (?)", actors],

I haven't had time to do any sort of performance comparison against the two approaches, but I'm pretty sure that for more than a few actors, Peter's approach will usually win. For a small number of actors, it may depend on a lot of factors.

The other nice thing Peter points out is that his approach lets you match most of the actors. If you change the HAVING to COUNT(movies.id) > 1, you'll find all the movies where at least two of the brothers appear together, not not necessarily all four. Or you can order them by number of actors appearing, etc.

Very nice. I'll probably use this technique for a tagging system I'm about to implement.

By the way, SF Ruby Meetup tonight. And just 10 days to RailsConf!

7 commentsassociations, rails

  1. keithm@infused.org2006-06-12 18:02:11

    This is much better than the first one, but I believe you need to flatten actors before checking to see if it's empty. Even if you pass an empty array into the method you've got an array containing an empty array.

  2. jaq2006-06-13 01:11:46

    Just in time to help with a design problem I was trying to figure out for Rails Day. Thanks!

  3. Peter Cooper2006-06-13 08:21:58

    Hey, thanks for the nod, Josh :) I didn't know if my solution would be to taste, but glad it's of some use. It took forever to come up with it (and mostly by accident) and has sat in Snippets since then as a nasty find_by_sql.. but your much cleaner way will certainly provide great inspiration for the next time I rewrite that code. Thanks!

  4. Peter Cooper2006-06-13 08:23:24

    One other (minor) thing I should point out.. someone told me ages ago that "IN" is often not optimized properly by MySQL (this may be a 4.x issue, though) and that doing WHERE (x = 1 OR x = 2 OR x = 3) is quicker than WHERE x IN (1,2,3).. but I'm only providing this info now in anticipation of someone bringing it up ;-)

  5. Mike2006-06-15 19:43:39

    I was going to say a similar thing as Carl, but with one sql query

    movieidswithallactors = Appearances.find(:all, :conditions => ["actorid in (?)",actors]).groupby { |x| x.movie_id }.select { |k,v| v.size == actors.size }.map { |x| x[0] }

    If you need the actual movie object(s), just throw an :include in the sql query.

    You can do actual set intersection with one similar sql query on loading all the actors' movies at once with an :include => movies and set intersecting the already loaded movies

    If i were going this frequently, i would pretty it up by replacing much of it with a method like Enumerable#elementsoccuringn_times or something in that vein, but you get the idea

  6. Pete Yandell2006-06-16 19:39:16

    Is there a way to make eager loading work with this approach?

    To illustrate the problem, let's say I have a HABTM relationships between 'tags' and 'things', and I have this code in Thing.rb:

    def self.find_by_tags(*tags)
      return [] if tags.empty?
      tags = tags.flatten
        :joins      => "INNER JOIN tags_things AS t ON t.thing_id = things.id",
        :conditions => ["t.tag_id IN (?)", tags.map(&:id)],
        :group      => "things.id HAVING COUNT(things.id) = #{tags.size}",
        :order      => "things.updated_at DESC"
        # :include  => :tags

    This produces the correct SQL:

    SELECT *
      FROM things INNER JOIN tags_things AS t ON t.thing_id = things.id
      WHERE (t.tag_id IN (16,29))
      GROUP BY things.id HAVING COUNT(things.id) = 2
      ORDER BY things.updated_at DESC

    But, if I uncomment the :include above in an attempt to retrieve all the tags associated with each thing, the following (paraphrased) SQL is generated:

    SELECT things.`id` AS t0_r0, ..., things.`updated_at` AS t0_r6, tags.`id` AS t1_r0, tags.`name` AS t1_r1
      FROM things LEFT OUTER JOIN tags_things ON tags_things.thing_id = things.id
                  LEFT OUTER JOIN tags ON tags.id = tags_things.tag_id
                  INNER JOIN tags_things AS t ON t.thing_id = things.id
      WHERE (t.tag_id IN (16,29))
      ORDER BY things.updated_at DESC

    Note the lack of a GROUP BY clause. Also note that simply adding the GROUP BY clause to this query will produce incorrect results.

    Anyone think of a good way to get around this, or should I simply revert to the approach described in part 1?

  7. AaronT2006-06-17 02:19:49

    had an issue with the returning an empty array if the array pased in was empty.....

    solved it with

    projects = projects.flatten unless projects.empty?
    return [] if projects.empty?
    projects = projects.flatten

Sorry, comments for this article are closed.