When associations aren't enough

— June 12, 2006 at 09:55 PDT


ActiveRecord associations are the neatest thing since sliced bread. They are great for simplifying a common set of relational database access patterns, something even sliced bread can't do. ActiveRecord associations provide a lot of flexibility, but there comes a point there they run out of steam and you have to roll up your sleeves and write some custom SQL.

Last week I read a question on the rails email list that had, I think, a rather interesting solution. I think it's a good example of how to use custom SQL without throwing all of ActiveRecord out the window and resorting to find_by_sql. I'm going to answer that question again here because it deserves more attention than I was able to give it in just a quick email. I also wanted to test my answer and make sure everything works correctly (which it does now).

IMDb is one of my best friends on the internet. One of its nifty features is the ability to find all the movies in which several actors appear together. How would we go about doing something like that using ActiveRecord associations? Actors appearing in movies is quite similar to the example of dancers in movies that I used in ye olde Many-to-many Dance-off, so we can start with pretty much the same tables I used there, just turning dancers into actors (like that would be the first time that ever happened).

With either of has_and_belongs_to_many or has_many :through you can find all the movies for a given actor simply by using the magic accessor actor.movies, and all the actors for a movie with movie.actors. But there's no simple way to use the standard association machinery to find the set of movies in which several actors appear. The naive approach would be to use a condition that compares the actor_id against the ids of several actors, but that just won't work.

movies = Movie.find(:all, :conditions => "actor_id = 17 AND actor_id = 23")

First off, there is no actor_id in the Movie model - it's in the join table! (Or in the join model's table if using has_many :through.) So if we mess around and get the join table into the SQL, we'll just be looking for rows in the join table where the actor_id is both 17 and 23 at the same time, and that only happens in Bizarro world. What we need to do is to join with the join table multiple times. If you know SQL well then this won't be at all impressive. But if you're a SQL novice it might seem like some kind of black magic. Don't worry, it's not that difficult.

Say we want to find all the Marx Brothers' movies.

movies = Movie.find_with_all_actors([groucho, harpo, chico, zeppo])

Time to write some custom SQL. We can build a query using SQL's ability to join to the same table more than once in a single query. You have to alias the table to a different name for each join, but other than that the rest is straightforward.

SELECT * FROM movies
  INNER JOIN appearances ap0 ON movies.id = ap0.movie_id
  INNER JOIN appearances ap1 ON movies.id = ap1.movie_id
  INNER JOIN appearances ap2 ON movies.id = ap2.movie_id
  INNER JOIN appearances ap3 ON movies.id = ap3.movie_id
  WHERE (ap0.actor_id = 17 AND ap1.actor_id = 23 AND
         ap2.actor_id = 39 AND ap3.actor_id = 42)

I'm using the join model table appearances in that SQL, but for a habtm join table you could just change that to actors_movies and it would work the same. No need to join to the actors table because we aren't searching on any of those attributes, just the actor_id in the join model. If we run that query we'll get back all the movies that all four of the Marx Brothers appeared in. Duck soup!

Let's write the method to generate that SQL. But wait, we're going to take a slight detour and before we write the method and write some simple tests for it. We'd like the method to work correctly whether we give it no arguments, a single actor, an array of actors, or just a bunch of actors not in an array.

assert_equal [], Movie.find_with_all_actors()
assert_equal 12, Movie.find_with_all_actors(groucho).length
assert_equal  7, Movie.find_with_all_actors([groucho, harpo, chico, zeppo]).length
assert_equal  7, Movie.find_with_all_actors(groucho, harpo, chico, zeppo).length

Alright, now let's write the method to generate that SQL. Since we're looking for movies, we'll create a class method in the Movie model:

def self.find_with_all_actors(*actors)
  return [] if actors.empty?
  join_frags, condition_frags = [], []
  actors.flatten.each_with_index do |actor,i|
    join_frags << "INNER JOIN appearances ap#{i} ON movies.id = ap#{i}.movie_id"
    condition_frags << "ap#{i}.actor_id = #{actor.id}"
  end
  find(:all, :readonly => false,
       :joins => join_frags.join(" "),
       :conditions => condition_frags.join(" AND "))
end

That method isn't all that complex, but I'll break it down and walk through it so it's clear. Note that if actors is empty we just return an empty Array. After that we create some arrays to collect the SQL fragments in as they are generated.

  return [] if actors.empty?
  join_frags, condition_frags = [], []

Then comes the meat of the method. We enumerate the actors and create both the JOIN clause and WHERE condition we'll need for that actor. Each join aliases the appearances table to a unique name that is used only for that join to the table.

  actors.flatten.each_with_index do |actor,i|
    join_frags << "INNER JOIN appearances ap#{i} ON movies.id = ap#{i}.movie_id"
    condition_frags << "ap#{i}.actor_id = #{actor.id}"
  end

Then we put the fragments together into the JOIN and WHERE clauses and do the find.

  find(:all, :readonly => false,
       :joins => join_frags.join(" "),
       :conditions => condition_frags.join(" AND "))

We set :readonly to false to override the default behavior where ActiveRecord assumes that joined results can't be written back to a single table. Since we're selecting only the columns from movies, our records are writable.

As I said, we're not joining with the actors table. In this contrived example there aren't a lot of reasons to do that. If you want to find all the movies starring Kevin Bacon and any one of the Baldwins, you might be tempted to write a join against the actors table to search on the last name "Baldwin". However it's simpler to use one query to fetch all the Baldwins then use the ids of those records to create a query that uses OR and AND to find your movies. It takes two queries instead of just one, but odds are that's not going to be a significant performance hit. If it is, then you can sweat out the nasty join.

UPDATE: The story continues in part 2.

13 commentsassociations, rails

Comments
  1. Matt Todd2006-06-12 10:36:22

    This is really great. I'm still learning the associations stuff, but this will be helpful for the harder stuff I'm sure I will eventually have to battle (or not, now)!

    Thanks,

    M.T.

  2. Ted2006-06-12 10:46:15

    But wait, we're going to take a slight detour and before we write the method and write some simple tests for it.

    God bless you!

    Now put that in a plug-in that adds "find_with_all_#{model}s" to all our has_many relationships ;-)

    P. S. Nine more days!

  3. Peter Cooper2006-06-12 11:39:10

    The joins can get pretty intense when you need to do big matches. To do 'which posts have all these tags' in Snippets a year ago, I came up with the equivalent of this SQL:

    SELECT p.* FROM movies m, appearance a 
    WHERE a.movie_id = m.id
      AND a.actor_id IN (17, 23, 39, 42) 
    GROUP BY p.id HAVING COUNT(m.id) = 4
    

    Haven't tested this in this model's case, but should generally work. This SQL only requires a single join, no matter how many actors you specify.

    It also lets you tweak it a little to get 'closest matches' rather than exact matches (if you'd like).. you'd simply order by COUNT(m.id) descending. Movies with all the actors would be at the top (with a count of 4), and then descending through number of common actors.

  4. Josh Susser2006-06-12 12:10:25

    @Peter: Hey, that's pretty slick. Now why didn't you mention that trick yesterday? I'm guessing that "p" is a copy/paste error (used to be posts) and you mean "m", right?

    Actually a few minutes ago I was thinking about using my approach for tagging, and it wasn't looking very clean to me. I think I'll code up your approach too and do some performance comparisons, though I bet yours will win for cases with more than a small number of actors.

  5. keithm@infused.org2006-06-12 13:15:07

    My testing shows that you can replace all that with just this simple method:

    def self.findwithall_actors(actors) actorids = actors.isa?(Array) ? actors.collect{|a| a.id} : [actors.id] find :all, :include => :actors, :conditions => ["actorid IN (?)", actorids] end

  6. keithm@infused.org2006-06-12 13:16:25

    My testing shows that you can replace all that with just this simple method:

    
    def self.find_with_all_actors(actors)
        actor_ids = actors.is_a?(Array) ? actors.collect{|a| a.id} : [actors.id]
        find :all, :include => :actors, :conditions => ["actor_id IN (?)", actor_ids]
      end
    
  7. Peter Cooper2006-06-12 13:26:16

    keithm: Surely that finds all movies with any of those actors, but not necessarily all movies with ALL the actors (in each)? That's the point of Josh's post. Open to correction though..

  8. Josh Susser2006-06-12 13:30:59

    @keithm: I just tried that out. As Peter said, it finds all the movies with ANY of the actors, not the movies with ALL the actors.

  9. Carl-Johan Kihlbom2006-06-12 15:46:21

    If you can live with the performance hit, you can just do:

    actor1.movies & actor2.movies
    

    ...which returns the intersection of the actors' movies.

  10. Josh Susser2006-06-12 16:04:15

    @Carl-Johan: It's not so much the performance hit per se, but the fact that it takes a database query for every actor. That sort of thing makes databases very cranky.

  11. keithm@infused.org2006-06-12 18:03:33

    Yes, indeed. My mistake... I didn't read carefully enough.

  12. kentaro2006-06-12 21:01:48

    How about this? I'm not sure which is better in performance. Many joins seem to cost more.

    def self.find_with_all_actors(*actors)
      return [] if actors.empty?
      find_by_sql(<<-SQL, actors.map { |a| a.id }, actors.size)
        SELECT * FROM movies
        WHERE id in (
          SELECT movie_id
          FROM appearances
          WHERE actor_id in (?)
          GROUP BY movie_id
          HAVING count(*) >= ?
        )
      SQL
    end
    
  13. Mike2006-06-15 19:43:02

    I was going to say a similar thing as Carl, but with one sql query

    movieidswithallactors = Appearances.find(:all, :conditions => ["actorid in (?)",actors]).groupby { |x| x.movie_id }.select { |k,v| v.size == actors.size }.map { |x| x[0] }

    If you need the actual movie object(s), just throw an :include in the sql query.

    You can do actual set intersection with one similar sql query on loading all the actors' movies at once with an :include => movies and set intersecting the already loaded movies

    If i were going this frequently, i would pretty it up by replacing much of it with a method like Enumerable#elementsoccuringn_times or something in that vein, but you get the idea

Sorry, comments for this article are closed.