tag:blog.hasmanythrough.com,2006-02-27:/tag/symbolshas_many :through - symbols2008-04-19T09:25:51-07:00tag:blog.hasmanythrough.com,2006-02-27:Article/1082008-04-19T09:25:51-07:002008-04-19T16:49:19-07:00Symbols are not pretty stringsJosh Susser<p>Symbols are one of the basic features of Ruby that give it that certain charm we all love. They aren't unique to Ruby (look at Smalltalk or Lisp), but they are a fundamental piece of the language. I'm not going to review what symbols are in this article since there are plenty of other explanations a short google away. However, I do want to say a few words about what I consider a common misuse of symbols.</p>
<p>The way I see it, symbols are great for naming things in your code, but bad for using as domain data. Over the last year or so I've seen a growing number of cases where symbols are used as an alternate syntax for plain old strings. I guess some people like to see <code>:thing</code> instead of <code>"thing"</code> in their code. Well, I don't like it. Sure you get to save a character, but at what cost? Won't somebody think of the children?</p><p>Symbols are one of the basic features of Ruby that give it that certain charm we all love. They aren't unique to Ruby (look at Smalltalk or Lisp), but they are a fundamental piece of the language. I'm not going to review what symbols are in this article since there are plenty of other explanations a short google away. However, I do want to say a few words about what I consider a common misuse of symbols.</p>
<p>The way I see it, symbols are great for naming things in your code, but bad for using as domain data. Over the last year or so I've seen a growing number of cases where symbols are used as an alternate syntax for plain old strings. I guess some people like to see <code>:thing</code> instead of <code>"thing"</code> in their code. Well, I don't like it. Sure you get to save a character, but at what cost? Won't somebody think of the children?</p>
<p>My rule of thumb is that if you want an object where all you care about is its identity, then use a symbol. If what you care about is how it's spelled, use a string. If you just want to compare a thing to another thing to see if they are the same thing, go symbol. If you just need to output it in some way (puts, to_s, interpolated value, etc.), it's string all the way. In particular, symbols are awesome as keys in hashes, names of states, and names of methods.</p>
<p>A great example of this anti-pattern is the newish (but pre-sexy) way to write migrations. Consider the old-school style (this is still how schema dump works):</p>
<pre><code>create_table "tags" do |t|
t.column "name", :string
t.column "taggings_count", :integer, :default => 0, :null => false
t.column "created_at", :datetime
end
</code></pre>
<p>You can see that the names of the table and fields are strings, while the names of the field options are symbols. This makes total sense to me. The migration code uses the table and field names only to generate the DDL statement that creates the table - basically a big, formatted print. On the other hand, that code uses option names as keys to look up values in a hash, a proper use of symbols. Something to notice: table/field names usually occur just once in a migration, while options can occur repeatedly (e.g. maybe all your fields are set to <code>:null => false</code>).</p>
<p>Now, the new way:</p>
<pre><code>create_table :tags do |t|
t.column :name, :string
t.column :taggings_count, :integer, :default => 0, :null => false
t.column :created_at, :datetime
end
</code></pre>
<p>In this example, all names are symbols. While the code does what you'd want it to, I find it a bit harder to read. I have to take time to think about what is a code name versus what is a data name, whereas in the previous example it's obvious just by looking. Even with the new sexy migrations you have the same problem, just not as much. Hmmm, can I really call sexy migrations new when they've been around for a year already?</p>
<h2>The cost of a colon</h2>
<p>I wailed about the cost before, but what was I talking about? Besides the cognitive cost of muddling code and data names, there are some costs in resource usage. The two basic resource costs are memory and processing cycles, and using symbols hits you in both places.</p>
<p>Let's start off with memory. Every symbol is a memory leak. I don't know who said that first, but it's a great way to think about symbols. Every symbol lives forever, whether created as a literal in source code or programatically with <code>#intern</code> or its alias <code>#to_sym</code>. Ruby's garbage collector will never reclaim symbols, since they have to stay around in the symbol table. (There are ways to avoid that situation by having the symbol table hang onto symbols using <a href="http://en.wikipedia.org/wiki/Weak_reference">weak references</a> or table compaction, but so far Ruby doesn't do that.) On the other hand, strings are garbage collected as any other normal object. Symbols need to have an enduring identity that is re-used every time to ask for a symbol with the same spelling, but strings do not. So if your names are ephemeral, using strings instead of symbols will save you memory in the long run. On the other hand, if you are using the same name over and over, symbols can be a big win. Why allocate the string "integer" hundreds of times when you can use a single :integer symbol instead?</p>
<p>What about the processing cost? Fundamentally, every creation of a symbol involves doing something called <em>interning</em> a string. That process includes doing a string comparison to see if the symbol is a new one, and allocating an entry in the symbol table if it is, or finding and using the existing symbol with that spelling if it is not. This amount of overhead is not onerous for proper use of symbols. In fact, it saves you processing in the long run because after a symbol has been interned you can compare symbols with a simple integer comparison, which is much faster than doing a full string comparison. But if you are misusing symbols as pretty strings, you've paid the cost of interning for no benefit, and then you're wasting cycles for the sake of using a single colon instead of two quotation marks.</p>
<p>If you've ever written a program you've probably heard that "premature optimization is the root of all evil." Well, that doesn't mean you should ignore performance considerations entirely. As Ezra Zygmuntowicz says, "Postmature optimization is the root of all hosting bills." If cutting corners means you can get your code written faster, then by all means throw performance to the wind and write code at warp speed. But why incur a performance penalty when you get nothing for it? It's like putting snow chains on your tires in the summer.</p>
<h2>Indifferent Access</h2>
<p>But all is not joyful in Symbolville. As I said above, <em>every symbol is a memory leak</em>. Usually that's not a huge problem, but what happens when you lose control of creation of symbols? Consider how Rails manages request parameters. Request parameters are made available in the params hash, and you can access them using the param's name in the form of a symbol as the key. Take a look at a completely contrived example:</p>
<pre><code>authenticate_user(params[:username], params[:password])
</code></pre>
<p>This is a pretty awesome way to get to the request parameters, but there's a problem. Think about what happens if your app gets a series of requests to the following URIs:</p>
<pre><code>/books?title=Nine+Princes+in+Amber
/books?author=Roger+Zelazny
/books?genre=fantasy
/books?series=The+Chronicles+of+Amber
/books?this=1&is=2&a=3&surprise=4&attack=5
</code></pre>
<p>Things get off to a good start but quickly go awry. After those <strong>five</strong> requests, you'll have added <strong>nine</strong> symbols to the symbol table: :title, :author, :genre, :series, :this, :is, :a, :surprise, :attack. Every query param gets converted to a symbol! Hello, DOS attack! Any old dork can generate random URLs and use up your memory a few bytes at a time. This is what we call <em>eventual certain doom</em>.</p>
<p>Clearly, it's a bad idea to generate a symbol for every param name in every URL when you can't control what names you'll have to deal with. This is why we have HashWithIndifferentAccess. In most ways this object acts like a normal hash, but it doesn't distinguish between symbols and strings as keys, and stores keys internally as strings. Rails can store all the request params using string keys (that can be garbage collected after the request completes), but your controller gets to access them using symbols. You get the appearance of the efficiency of using symbols, with all of the overhead of both symbol interning and string comparisons at the same time! But on the other hand, you don't have to worry about a deluge of random query params using up all your memory.</p>
<p>HashWithIndifferentAccess is the sort of solution that drives me slightly insane. While I can understand the reasoning behind it and I haven't come up with a better solution that retains compatibility and usability as well, there's something about it that just smells wrong. It's the worst of both worlds, but probably the best way to go. Someday symbols will get garbage collected, then we'll all be happy we've been using symbols as keys all along. (Evan Phoenix says Rubinius will support some sort of reclamation of symbols sooner or later, so "someday" will probably arrive before the end of the year.)</p>
<h2>Association names</h2>
<p>ActiveRecord associations are created using macro-style class methods that specify the name of the association. For example:</p>
<pre><code>class Post < ActiveRecord::Base
belongs_to :user
has_many :comments
end
</code></pre>
<p>Comparing this example to the migration example I used above, you might think I'd say symbols were the wrong choice here. After all, the names :user and :comments get transformed into class names User and Comment. That's the sort of thing you should use a string for, right? But in this case, I think symbols are a fine choice. That's because those names are also used to name association methods. The association macros define #user and #comments methods (and several others). Those names are in the world of source code instead of domain data, and that's a good use of symbols. Also, the symbols will be allocated anyway when the methods are defined, so there's no extraneous use of symbols. It's the right concept with no extra cost, so no worries.</p>
<h2>Doing the right thing</h2>
<p>When thinking about using a symbol or a string, I consider two things.</p>
<p>Conceptually, if it's a name for something in code, it should be a symbol. If it's domain data, it should be a string. If it's both, it's probably better to go with a symbol.</p>
<p>For efficiency, it's a matter of cost versus payoff. If a name is only going to be used once, a string will be more efficient. If it's going to be used many times, then the overhead of creating a symbol will be paid back in reuse of a single symbol instead of multiple strings and doing integer comparisons instead of string comparisons.</p>
<p>Sometimes, it's really easy to see whether you should use a symbol or a string. Other times, it's not so simple. Rules of thumb are great, but there's that gray area where names are used as both code names and domain data. That's when you start earning your keep as a programmer.</p>