ActiveRecord associations are the neatest thing since sliced bread. They are great for simplifying a common set of relational database access patterns, something even sliced bread can't do. ActiveRecord associations provide a lot of flexibility, but there comes a point there they run out of steam and you have to roll up your sleeves and write some custom SQL.
Last week I read a question on the rails email list that had, I think, a rather interesting solution. I think it's a good example of how to use custom SQL without throwing all of ActiveRecord out the window and resorting to find_by_sql
. I'm going to answer that question again here because it deserves more attention than I was able to give it in just a quick email. I also wanted to test my answer and make sure everything works correctly (which it does now).
IMDb is one of my best friends on the internet. One of its nifty features is the ability to find all the movies in which several actors appear together. How would we go about doing something like that using ActiveRecord associations? Actors appearing in movies is quite similar to the example of dancers in movies that I used in ye olde Many-to-many Dance-off, so we can start with pretty much the same tables I used there, just turning dancers into actors (like that would be the first time that ever happened).
With either of has_and_belongs_to_many
or has_many :through
you can find all the movies for a given actor simply by using the magic accessor actor.movies
, and all the actors for a movie with movie.actors
. But there's no simple way to use the standard association machinery to find the set of movies in which several actors appear. The naive approach would be to use a condition that compares the actor_id
against the ids of several actors, but that just won't work.
movies = Movie.find(:all, :conditions => "actor_id = 17 AND actor_id = 23")
First off, there is no actor_id
in the Movie model - it's in the join table! (Or in the join model's table if using has_many :through
.) So if we mess around and get the join table into the SQL, we'll just be looking for rows in the join table where the actor_id
is both 17 and 23 at the same time, and that only happens in Bizarro world. What we need to do is to join with the join table multiple times. If you know SQL well then this won't be at all impressive. But if you're a SQL novice it might seem like some kind of black magic. Don't worry, it's not that difficult.
Say we want to find all the Marx Brothers' movies.
movies = Movie.find_with_all_actors([groucho, harpo, chico, zeppo])
Time to write some custom SQL. We can build a query using SQL's ability to join to the same table more than once in a single query. You have to alias the table to a different name for each join, but other than that the rest is straightforward.
SELECT * FROM movies
INNER JOIN appearances ap0 ON movies.id = ap0.movie_id
INNER JOIN appearances ap1 ON movies.id = ap1.movie_id
INNER JOIN appearances ap2 ON movies.id = ap2.movie_id
INNER JOIN appearances ap3 ON movies.id = ap3.movie_id
WHERE (ap0.actor_id = 17 AND ap1.actor_id = 23 AND
ap2.actor_id = 39 AND ap3.actor_id = 42)
I'm using the join model table appearances
in that SQL, but for a habtm join table you could just change that to actors_movies
and it would work the same. No need to join to the actors table because we aren't searching on any of those attributes, just the actor_id
in the join model. If we run that query we'll get back all the movies that all four of the Marx Brothers appeared in. Duck soup!
Let's write the method to generate that SQL. But wait, we're going to take a slight detour and before we write the method and write some simple tests for it. We'd like the method to work correctly whether we give it no arguments, a single actor, an array of actors, or just a bunch of actors not in an array.
assert_equal [], Movie.find_with_all_actors()
assert_equal 12, Movie.find_with_all_actors(groucho).length
assert_equal 7, Movie.find_with_all_actors([groucho, harpo, chico, zeppo]).length
assert_equal 7, Movie.find_with_all_actors(groucho, harpo, chico, zeppo).length
Alright, now let's write the method to generate that SQL. Since we're looking for movies, we'll create a class method in the Movie model:
def self.find_with_all_actors(*actors)
return [] if actors.empty?
join_frags, condition_frags = [], []
actors.flatten.each_with_index do |actor,i|
join_frags << "INNER JOIN appearances ap#{i} ON movies.id = ap#{i}.movie_id"
condition_frags << "ap#{i}.actor_id = #{actor.id}"
end
find(:all, :readonly => false,
:joins => join_frags.join(" "),
:conditions => condition_frags.join(" AND "))
end
That method isn't all that complex, but I'll break it down and walk through it so it's clear. Note that if actors
is empty we just return an empty Array. After that we create some arrays to collect the SQL fragments in as they are generated.
return [] if actors.empty?
join_frags, condition_frags = [], []
Then comes the meat of the method. We enumerate the actors and create both the JOIN clause and WHERE condition we'll need for that actor. Each join aliases the appearances
table to a unique name that is used only for that join to the table.
actors.flatten.each_with_index do |actor,i|
join_frags << "INNER JOIN appearances ap#{i} ON movies.id = ap#{i}.movie_id"
condition_frags << "ap#{i}.actor_id = #{actor.id}"
end
Then we put the fragments together into the JOIN and WHERE clauses and do the find.
find(:all, :readonly => false,
:joins => join_frags.join(" "),
:conditions => condition_frags.join(" AND "))
We set :readonly
to false to override the default behavior where ActiveRecord assumes that joined results can't be written back to a single table. Since we're selecting only the columns from movies, our records are writable.
As I said, we're not joining with the actors
table. In this contrived example there aren't a lot of reasons to do that. If you want to find all the movies starring Kevin Bacon and any one of the Baldwins, you might be tempted to write a join against the actors table to search on the last name "Baldwin". However it's simpler to use one query to fetch all the Baldwins then use the ids of those records to create a query that uses OR and AND to find your movies. It takes two queries instead of just one, but odds are that's not going to be a significant performance hit. If it is, then you can sweat out the nasty join.
UPDATE: The story continues in part 2.
This is really great. I'm still learning the associations stuff, but this will be helpful for the harder stuff I'm sure I will eventually have to battle (or not, now)!
Thanks,
M.T.
God bless you!
Now put that in a plug-in that adds "find_with_all_#{model}s" to all our has_many relationships ;-)
P. S. Nine more days!
The joins can get pretty intense when you need to do big matches. To do 'which posts have all these tags' in Snippets a year ago, I came up with the equivalent of this SQL:
Haven't tested this in this model's case, but should generally work. This SQL only requires a single join, no matter how many actors you specify.
It also lets you tweak it a little to get 'closest matches' rather than exact matches (if you'd like).. you'd simply order by COUNT(m.id) descending. Movies with all the actors would be at the top (with a count of 4), and then descending through number of common actors.
@Peter: Hey, that's pretty slick. Now why didn't you mention that trick yesterday? I'm guessing that "p" is a copy/paste error (used to be posts) and you mean "m", right?
Actually a few minutes ago I was thinking about using my approach for tagging, and it wasn't looking very clean to me. I think I'll code up your approach too and do some performance comparisons, though I bet yours will win for cases with more than a small number of actors.
My testing shows that you can replace all that with just this simple method:
def self.findwithall_actors(actors) actorids = actors.isa?(Array) ? actors.collect{|a| a.id} : [actors.id] find :all, :include => :actors, :conditions => ["actorid IN (?)", actorids] end
My testing shows that you can replace all that with just this simple method:
keithm: Surely that finds all movies with any of those actors, but not necessarily all movies with ALL the actors (in each)? That's the point of Josh's post. Open to correction though..
@keithm: I just tried that out. As Peter said, it finds all the movies with ANY of the actors, not the movies with ALL the actors.
If you can live with the performance hit, you can just do:
...which returns the intersection of the actors' movies.
@Carl-Johan: It's not so much the performance hit per se, but the fact that it takes a database query for every actor. That sort of thing makes databases very cranky.
Yes, indeed. My mistake... I didn't read carefully enough.
How about this? I'm not sure which is better in performance. Many joins seem to cost more.
I was going to say a similar thing as Carl, but with one sql query
movieidswithallactors = Appearances.find(:all, :conditions => ["actorid in (?)",actors]).groupby { |x| x.movie_id }.select { |k,v| v.size == actors.size }.map { |x| x[0] }
If you need the actual movie object(s), just throw an :include in the sql query.
You can do actual set intersection with one similar sql query on loading all the actors' movies at once with an :include => movies and set intersecting the already loaded movies
If i were going this frequently, i would pretty it up by replacing much of it with a method like Enumerable#elementsoccuringn_times or something in that vein, but you get the idea