Symbols are not pretty strings

— April 19, 2008 at 09:25 PDT


Symbols are one of the basic features of Ruby that give it that certain charm we all love. They aren't unique to Ruby (look at Smalltalk or Lisp), but they are a fundamental piece of the language. I'm not going to review what symbols are in this article since there are plenty of other explanations a short google away. However, I do want to say a few words about what I consider a common misuse of symbols.

The way I see it, symbols are great for naming things in your code, but bad for using as domain data. Over the last year or so I've seen a growing number of cases where symbols are used as an alternate syntax for plain old strings. I guess some people like to see :thing instead of "thing" in their code. Well, I don't like it. Sure you get to save a character, but at what cost? Won't somebody think of the children?

My rule of thumb is that if you want an object where all you care about is its identity, then use a symbol. If what you care about is how it's spelled, use a string. If you just want to compare a thing to another thing to see if they are the same thing, go symbol. If you just need to output it in some way (puts, to_s, interpolated value, etc.), it's string all the way. In particular, symbols are awesome as keys in hashes, names of states, and names of methods.

A great example of this anti-pattern is the newish (but pre-sexy) way to write migrations. Consider the old-school style (this is still how schema dump works):

create_table "tags" do |t|
  t.column "name",           :string
  t.column "taggings_count", :integer, :default => 0, :null => false
  t.column "created_at",     :datetime
end

You can see that the names of the table and fields are strings, while the names of the field options are symbols. This makes total sense to me. The migration code uses the table and field names only to generate the DDL statement that creates the table - basically a big, formatted print. On the other hand, that code uses option names as keys to look up values in a hash, a proper use of symbols. Something to notice: table/field names usually occur just once in a migration, while options can occur repeatedly (e.g. maybe all your fields are set to :null => false).

Now, the new way:

create_table :tags do |t|
  t.column :name,           :string
  t.column :taggings_count, :integer, :default => 0, :null => false
  t.column :created_at,     :datetime
end

In this example, all names are symbols. While the code does what you'd want it to, I find it a bit harder to read. I have to take time to think about what is a code name versus what is a data name, whereas in the previous example it's obvious just by looking. Even with the new sexy migrations you have the same problem, just not as much. Hmmm, can I really call sexy migrations new when they've been around for a year already?

The cost of a colon

I wailed about the cost before, but what was I talking about? Besides the cognitive cost of muddling code and data names, there are some costs in resource usage. The two basic resource costs are memory and processing cycles, and using symbols hits you in both places.

Let's start off with memory. Every symbol is a memory leak. I don't know who said that first, but it's a great way to think about symbols. Every symbol lives forever, whether created as a literal in source code or programatically with #intern or its alias #to_sym. Ruby's garbage collector will never reclaim symbols, since they have to stay around in the symbol table. (There are ways to avoid that situation by having the symbol table hang onto symbols using weak references or table compaction, but so far Ruby doesn't do that.) On the other hand, strings are garbage collected as any other normal object. Symbols need to have an enduring identity that is re-used every time to ask for a symbol with the same spelling, but strings do not. So if your names are ephemeral, using strings instead of symbols will save you memory in the long run. On the other hand, if you are using the same name over and over, symbols can be a big win. Why allocate the string "integer" hundreds of times when you can use a single :integer symbol instead?

What about the processing cost? Fundamentally, every creation of a symbol involves doing something called interning a string. That process includes doing a string comparison to see if the symbol is a new one, and allocating an entry in the symbol table if it is, or finding and using the existing symbol with that spelling if it is not. This amount of overhead is not onerous for proper use of symbols. In fact, it saves you processing in the long run because after a symbol has been interned you can compare symbols with a simple integer comparison, which is much faster than doing a full string comparison. But if you are misusing symbols as pretty strings, you've paid the cost of interning for no benefit, and then you're wasting cycles for the sake of using a single colon instead of two quotation marks.

If you've ever written a program you've probably heard that "premature optimization is the root of all evil." Well, that doesn't mean you should ignore performance considerations entirely. As Ezra Zygmuntowicz says, "Postmature optimization is the root of all hosting bills." If cutting corners means you can get your code written faster, then by all means throw performance to the wind and write code at warp speed. But why incur a performance penalty when you get nothing for it? It's like putting snow chains on your tires in the summer.

Indifferent Access

But all is not joyful in Symbolville. As I said above, every symbol is a memory leak. Usually that's not a huge problem, but what happens when you lose control of creation of symbols? Consider how Rails manages request parameters. Request parameters are made available in the params hash, and you can access them using the param's name in the form of a symbol as the key. Take a look at a completely contrived example:

authenticate_user(params[:username], params[:password])

This is a pretty awesome way to get to the request parameters, but there's a problem. Think about what happens if your app gets a series of requests to the following URIs:

/books?title=Nine+Princes+in+Amber
/books?author=Roger+Zelazny
/books?genre=fantasy
/books?series=The+Chronicles+of+Amber
/books?this=1&is=2&a=3&surprise=4&attack=5

Things get off to a good start but quickly go awry. After those five requests, you'll have added nine symbols to the symbol table: :title, :author, :genre, :series, :this, :is, :a, :surprise, :attack. Every query param gets converted to a symbol! Hello, DOS attack! Any old dork can generate random URLs and use up your memory a few bytes at a time. This is what we call eventual certain doom.

Clearly, it's a bad idea to generate a symbol for every param name in every URL when you can't control what names you'll have to deal with. This is why we have HashWithIndifferentAccess. In most ways this object acts like a normal hash, but it doesn't distinguish between symbols and strings as keys, and stores keys internally as strings. Rails can store all the request params using string keys (that can be garbage collected after the request completes), but your controller gets to access them using symbols. You get the appearance of the efficiency of using symbols, with all of the overhead of both symbol interning and string comparisons at the same time! But on the other hand, you don't have to worry about a deluge of random query params using up all your memory.

HashWithIndifferentAccess is the sort of solution that drives me slightly insane. While I can understand the reasoning behind it and I haven't come up with a better solution that retains compatibility and usability as well, there's something about it that just smells wrong. It's the worst of both worlds, but probably the best way to go. Someday symbols will get garbage collected, then we'll all be happy we've been using symbols as keys all along. (Evan Phoenix says Rubinius will support some sort of reclamation of symbols sooner or later, so "someday" will probably arrive before the end of the year.)

Association names

ActiveRecord associations are created using macro-style class methods that specify the name of the association. For example:

class Post < ActiveRecord::Base
  belongs_to :user
  has_many :comments
end

Comparing this example to the migration example I used above, you might think I'd say symbols were the wrong choice here. After all, the names :user and :comments get transformed into class names User and Comment. That's the sort of thing you should use a string for, right? But in this case, I think symbols are a fine choice. That's because those names are also used to name association methods. The association macros define #user and #comments methods (and several others). Those names are in the world of source code instead of domain data, and that's a good use of symbols. Also, the symbols will be allocated anyway when the methods are defined, so there's no extraneous use of symbols. It's the right concept with no extra cost, so no worries.

Doing the right thing

When thinking about using a symbol or a string, I consider two things.

Conceptually, if it's a name for something in code, it should be a symbol. If it's domain data, it should be a string. If it's both, it's probably better to go with a symbol.

For efficiency, it's a matter of cost versus payoff. If a name is only going to be used once, a string will be more efficient. If it's going to be used many times, then the overhead of creating a symbol will be paid back in reuse of a single symbol instead of multiple strings and doing integer comparisons instead of string comparisons.

Sometimes, it's really easy to see whether you should use a symbol or a string. Other times, it's not so simple. Rules of thumb are great, but there's that gray area where names are used as both code names and domain data. That's when you start earning your keep as a programmer.

25 commentsruby, symbols

Comments
  1. mikong2008-04-19 10:10:14

    I have to admit I use symbols for table names and column names in my migrations. I don't find it harder to read. And since we use symbols in declaring attributes, say in attr_accessor, I don't see a problem with column names being declared as symbols. My previous rule of thumb for this is if it doesn't change it's value, like a column name, a symbol should be used. But after reading your article, I have to reevaluate this. I didn't know the garbage collection and potential DOS issue before.

    Thanks for this thought-provoking article.

  2. Green Rails2008-04-19 10:20:22

    Josh -

    As I become a less bad Ruby programmer, stuff like this is really helpful.

    I think you know you have used a language improperly when the syntax highlighting in your editor looks wrong :-) Hashes look great when it's :key => 'string', or :key = false, but look weird when it's :key => :value.

    In fact, I think the notion that good code also looks good comes from the original source of your "premature optimization" comment; in the book Programming Pearls (which it goes without saying has nothing to do with that wretched programming language) his quote is "Code first, optimize later".

    But I digress. Thanks for helping.

    Tom

  3. Dirk2008-04-19 10:29:06

    Helpful post! Thank you!

  4. Rick DeNatale2008-04-19 10:45:45

    Josh,

    Finally, an explanation about why HashWithIndifferentAccess uses strings instead of symbols as the keys.

    The appearance of vs. actual efficiency always drove me crazy as well. I hadn't considered the problem of arbitrary query parameter names.

    good article

  5. Andre Lewis2008-04-19 11:50:50

    Great article Josh ... it helped solidify my understanding of string vs. symbol tradeoffs. I think there's an aesthetic attraction to symbols, and there's also something about the fluency of typing them on the keyboard.

  6. rick2008-04-19 12:00:33

    It'll be nice when we can replace HWIA with a normal symbol hash. I agree with this article a lot, and feel like symbols for request parameter keys fit with the defined use case.

  7. Phil2008-04-19 12:06:08

    if you want an object where all you care about is its identity, then use a symbol.

    I strongly agree. However, I think you could have chosen a better example than the use of symbols as column names in migrations:

    In particular, symbols are awesome as keys in hashes, names of states, and names of methods.

    In this example, all names are symbols. [... and this is bad.]

    Well, one of the points of having an ORM is that you can treat database tables like objects. This implies column names become object methods. So in a way, the migration definition *is* using symbols as method names, just in a roundabout fashion. If you peek under the covers, you discover that they aren't handled symbolically the whole time, but if you look at it as an end-to-end abstraction, you're defining methods that you will want to use on your AR objects later.

    Also, I have no idea how Ezra's last name is spelled, but I'm pretty sure it's not right above. =)

  8. Tim Harper2008-04-19 12:06:25

    Very eye opening - I suspected such was the case, but was not aware of the permanent allocation of symbols during the life of a ruby process

    I thought ruby would garbage collect them if there were no more references.

    This is something every ruby programmer should be aware of.

  9. Josh Susser2008-04-19 12:24:40

    @Phil: Good point about the ORM mapping the table name to a method. It's definitely an overly simplistic example, but it's hard to find one that isn't simplistic but is easy enough to understand.

    And D'oh! over Ezra's name. I fixed it. Shame on me for copying it from someplace else that got it wrong, thinking that would keep me from messing it up. At least I know how to pronounce it!

  10. roger2008-04-19 12:32:19

    you should submit this to core :) ask them to help

  11. bronson2008-04-19 12:37:28

    This is one thing that's driven me a little batty with Ruby. :sym is just another way to quote strings. It has nothing to do with actual symbols.

    If Ruby had gotten symbols right, :sym.to_s would have returned "10891" or whatever unique ID the symbol has. The symbol would be the same regardless of whether I refer to it as "green", "vert" or "녹색". That's why turning a string into a symbol should require a dictionary, and it's impossible to convert a symbol unambiguously into a string.

    Ah well. Everyone enjoys quoting strings with :. Ruby should probably just treat : as yet another string quoting method and add proper symbols in a future release.

  12. Robby Russell2008-04-19 13:19:56

    Josh,

    This was a well-written article. Like a few other comments, I'm not sure that the migration example is the best to chime in on. I'm a fan of using symbols in this case and wouldn't fault running migrations as huge memory leak problems.

    However, with controller actions, this something that definitely deserves some more thought by Rails developers. I use symbols all the time to access stuff from params, session, etc. In my opinion.. it's easier to read, but what is the cost behind the scenes?

  13. wuputah2008-04-19 15:40:48

    Not to nitpick, but the "new" way to write (sexy) migrations is:

    t.string :name, :address, :city, :bio # etc
    t.integer :comments_count, :taggings_count # etc

    Also, as has been said, migrations are not really a problem as they are run-once (then Ruby quits), but it a symptom of a larger epidemic.

    Additionally, your comment on "Indifferent Access"--a clever reference to HashWithIndifferentAccess, I presume--is interesting but not entirely correct. HashWithIndifferentAccess actually converts all its keys to strings, not symbols. (params uses this special ActiveSupport Hash class). If you pass it a symbol, it uses #to_s. The parameters are thus never symbols unless someone coverts them to symbols (e.g., by iterating over them and running #to_sym). So, if you use a symbol it is only in your code, there is no real limitless memory leak. It may use more memory, but it also saves new object creation and destruction if you, alternately, used strings all the time.

    I still agree with the premise, one should definitely be aware that a symbol is basically a reference to a global constant string, and you should use them with care. I completely agree with your rule about when to use symbols vs strings, but I'd also say that I doubt there are many instances where people really need to change what they're doing right now. If you do not dynamically create symbols from user data and you should be okay.

  14. ste2008-04-19 15:52:25

    Interesting article, but I'm having some trouble understanding this claim:

    "After those five requests, you'll have added nine symbols to the symbol table: :title, :author, :genre, :series, :this, :is, :a, :surprise, :attack. Every query param gets converted to a symbol!".

    I created a test application with a single action which outputs the value of params[:foo] and the total number of symbols (Symbol.all_symbols.size). No matter how many requests with different query strings I make, the number always stays the same... Am I missing something?

  15. Josh Susser2008-04-19 16:59:16

    @wuputah: Looks like you didn't see where I wrote, "Even with the new sexy migrations you have the same problem, just not as much. Hmmm, can I really call sexy migrations new when they've been around for a year already?" And I know the migration example is not perfect, but whatever. If I think of a better one I'll update the article. Also, you've basically reiterated the whole point of my section on Indifferent Access, that HWIA stores the keys internally as strings. Maybe that wasn't clear enough, so I made it more explicit.

    @ste: Since Rails is using HashWithIndifferentAccess, you'll never see the symbol leak from making requests. That's the point of it. If you try the same thing with a regular Hash object, you will see the number of Symbols rise.

  16. wuputah2008-04-19 20:02:18

    I think your post was just confusing to someone who already knows how it works and, thus, it looked like you were implying there was a memory problem with using symbols with params hash or HashWithIndifferentAccess (besides making you (and I) cringe, but I don't see a better way either). Maybe I just read it wrong, but it look good now. :)

    Either way, good write-up, as always.

  17. RD2008-04-19 20:45:19

    I had this knowlede that symbols cost you the extra memory, but was never sure how string vs symbol thing worked, this write up makes it clear though. Thanks!

  18. Jason Watkins2008-04-20 00:42:31

    @bronson

    Symbols are not just a different way to quote strings. They are stored in a symbol table.

    irb(main):001:0> :foo.object_id => 145858

    irb(main):002:0> 'foo'.to_sym.object_id => 145858

    to_s means "the string representation of this value". The value isn't the exact integer itself... that should largely be opaque to the application. I think "sym" for :sym.to_s is the reasonable and correct choice for those semantics.

    @josh

    Good article. In a proper implementation, there'd be no important differenced between :sym and 'sym', or said a different way, all interned strings would be treated the same. It'll be nice to have the efficiency concern removed so that the semantic distinction is all we care about.

  19. Sam Smoot2008-04-20 15:15:21

    Nice article. I think the examples unfortunately detract from the message. :-(

    In DataMapper we map Property#name to Symbols. For obvious performance reasons. Column names are effectively constants. Theoretically they could change during the life of a process, but it's so rare that it should be a non-issue. So the migrations example wasn't perhaps the best one.

    The Hash/Mash example is perfect. And it's why during the process of symbolizing finder options, we check strings against the option#to_s to ensure we're not generating new symbols when doing something like: User.first(params[:user])

    Symbols have their use. The bigger problem I think is the idea that your public API should be able to use Strings interchangeably. It makes a certain amount of sense with a HWIA/Mash, but in general I think people just need to learn the API and everyone's life will be easier. :p

    One way to get there is a nice "slap on the wrist" with a descriptive ArgumentError (at the lower levels of the stack at-least). I mean, you want Integer() to raise an error if you pass it a Person object. You want Object#blank? to just "do the right thing". Everything in moderation, Duck-Typing included.

  20. Koz2008-04-20 21:32:41

    Nice article Josh, I appreciate your documenting why we have HashWithIndifferentAccess as it does come up from time to time. Lord knows other libraries have hit the same problem and used the same solution.

    @Robby: However, with controller actions, this something that definitely deserves some more thought by Rails developers. I use symbols all the time to access stuff from params, session, etc. In my opinion.. it's easier to read, but what is the cost behind the scenes?

    There's no thought needed, you have a finite number of symbols in your action source, and HWIA normalises to strings internally. The per-symbol memory usage is tiny relative to the total usage you'll see in your application.

    Basically, if you avoid calling .intern or .to_sym on user-provided data, you'll be fine. Indifference in the API itself is another matter entirely and I don't have strong feelings either way.

  21. Clemens Kofler2008-04-21 03:08:50

    Great article that gives deep insight into the deeper workings of Ruby and Rails! ;-) However, I have to agree with the argument that having different syntax highlighting for hash keys and the like is pretty neat ...

  22. Igor Minar2008-04-29 10:44:36

    Very well written! Nice post. Thanks.

  23. Scott Ballantyne2008-04-30 21:38:07

    Thanks, you always have nice posts.

  24. Chris2008-05-16 21:07:53

    Great article. I'm new to both Ruby and Rails and I was unclear as to the difference between strings and symbols. Though this article didn't set out to define either, reading it helped me to make the differentiation.

    Thanks much

  25. WTN2008-06-03 16:47:58

    I think using double quotes for strings that are never to be interpolated is almost as bad a sin as using :symbol needlessly.

Sorry, comments for this article are closed.