infundibulum

Getting Unicode, MySql, and Rails to Cooperate

June 23rd, 2005

In a post about Ruby and Unicode a while back, I mentioned the page at the Rails wiki called How To Use Unicode Strings in Rails. (Btw, check out Why the Lucky Stiff’s response to my post, some useful code there.)

It turns out that there were a few more mostly MySQL-specific steps involved in getting Unicode to work correctly with Rails. So I thought I’d describe all the steps we went through to get it set up in one place. This has only been tested with MySQL 4.1.

In MySQL: Set the Encoding when you Create Tables

You need to explicity tell MySQL that you want your tables to be encoded in UTF-8. Here’s a sample table with 3 columns, id, foo, and bar:

create table samples (
    id int not null auto_increment,
    foo varchar(100) not null,
    bar text not null,
    primary key (id)
) Type=MyISAM CHARACTER SET utf8;

The line Type=MyISAM CHARACTER SET utf8; is where the action is. The table type has to be MyISAM, not innoDB, because unfortunately innoDB tables don’t support full text searching of UTF-8 encoded content (details here). Apparently innoDB tables are more flexible in general, but if full-text search is crucial for you, you’ll have to go with MyISAM.

Bummer, that.

In any case, then add the CHARACTER SET utf8 directive, as shown.

(I wonder if there is some way to set this as a default, without adding the line to the DDL for every table?)

Set the “charset” and “Content-type” in the Application Controller

This is also described at How To Use Unicode Strings in Rails.

class ApplicationController < ActionController::Base
  before_filter :set_charset

  def set_charset
    @headers["Content-Type"] = "text/html; charset=utf-8"
  end
end

The previous steps were enough to get Unicode showing up in a little test app I generated with scaffold (the app just consisted of an input field and a textarea).

To test it, I pasted in some sample text in various languages. It worked okay for text containing only the characters found in ASCII or latin1, but among other characters there were weird cases of random characters being removed or added. The problem seemed not to have anything to do with the script (i.e., the Unicode block). For instance, in an Esperanto text, “ĉ” (U+0109 LATIN SMALL LETTER C WITH CIRCUMFLEX) came out fine, but “ĝ” (U+011D LATIN SMALL LETTER G WITH CIRCUMFLEX) was borked. Go figure.

It took some digging to find the next bit — thanks to Ben Jackson of INCOMUM Design & Conceito for getting the straight story on the Rails list. The solution is…

Tell Rails to tell MySQL to Use UTF-8. (Got that?)

It came down to a MySQL configuration option: You have to tell MySQL to SET NAMES UTF8, as DHH pointed out in the previous link. You can either do it in the source to ActiveRecord, in mysql_adapter.rb, or you can just make the change in your own application. We chose the latter route.

So, here’s our app/controllers/application.rb as it stands:

class ApplicationController < ActionController::Base
  before_filter :set_charset
  before_filter :configure_charsets

  def set_charset
        @headers["Content-Type"] = "text/html; charset=utf-8"
  end

  def configure_charsets
    @response.headers["Content-Type"] = "text/html; charset=utf-8"
    # Set connection charset. MySQL 4.0 doesn't support this so it
    # will throw an error, MySQL 4.1 needs this
        suppress(ActiveRecord::StatementInvalid) do
          ActiveRecord::Base.connection.execute 'SET NAMES UTF8'
        end
   end
end

(In case you’re wondering, yes, you can have more than one before_filter.)

UPDATE Er, but you can get away with just one: Jonas refactored it. Use this version instead. :)

class ApplicationController < ActionController::Base
  before_filter :configure_charsets

  def configure_charsets
    @response.headers["Content-Type"] = "text/html; charset=utf-8"
    # Set connection charset. MySQL 4.0 doesn't support this so it
    # will throw an error, MySQL 4.1 needs this
        suppress(ActiveRecord::StatementInvalid) do
          ActiveRecord::Base.connection.execute 'SET NAMES UTF8'
        end
   end
end

UPDATED AGAIN Argh, I forgot to remove the set_charset definition before. It should be correct now…

Anyway, now our Rails installation seems to handle (almost) any nutty writing system that we throw at it.

I’m not sure these are all the Right Way™, I’m just saying it’s worked for us (so far).

For one thing, it seems like it might make sense to just go ahead and set the default character set and collation for MySQL to UTF-8, independently of any Rails stuff at all. Aren’t all character sets supposed to be slouching toward Unicode, anyway? But that sounds like a rather apocalyptic measure, for some reason… I guess I’ll hold off on the rapture of the character sets.

Collation is it’s own topic — but then we’ve not gotten to sorting anything yet. Here’s a recommendation for building the tables with:

CHARACTER SET utf8 COLLATE utf8_general_ci;

To set the collation order. Wading through the MySQL docs on that is next on the agenda.

Comments

  1. 1

    Thanks, very helpful and worked a charm.

    - rg @
  2. 2

    Great, glad to hear it!

    By the way, since writing this I ran across:

    Using Gettext To Translate Your Rails Application by Sascha Ebach. (I wonder if he has a blog…)

    Which explains everything I covered plus how to use Gettext. It’s far more extensive than this little writeup, but hopefully this will remain a good brief intro on MySQL + Rails.

    - pat @
  3. 3

    Pat,

    Reading this post is better than waking up and finding out you have a super model in your bed (wait a minute I take that back it’s good but not that good), it RULES and you rule for writing it. I’m doing the two-handed sultan salute for helping me with what was turning into one ugly unicode Sunday. Thanks!

    suppress(ActiveRecord::StatementInvalid) do
    ActiveRecord::Base.connection.execute ‘SET NAMES UTF8′
    end

    WTF?!? Brilliant.

    - Tim Case @
  4. 4

    > (I wonder if there is some way to set this as a default, without adding the line to the DDL for every table?)
    You can set it up also on database level and server level and also at column level. Good intro at http://www.php-editors.com/mysql_manual/p_manual_Charset.html

    - David @
  5. 5

    Thanks for the article! Very helpful.

    The first subsection makes it sound like you can use innoDB, if you’re willing to sacrifice full-text searching. I just tried it on MySQL 4.1.12, and it only works as MyISAM. With innoDB, I got each byte displayed as an ASCII character, with all the other settings (before filter) exactly the same.

    - wzph @
  6. 6

    Thanks very much for the configure_charsets code! I want to say that I have tried it with InnoDB tables on MySQL 5.0.x and it seems to work fine! So for those who use foreign keys, no need to panic!

    Cheers

    - telos @
  7. 7

    telos, do you know if full-text searching works in 5.0?

    - pat @
  8. 8

    Actually, if you put this line:
    encoding: utf8
    in your database.yml configuration file, it is
    equivalent to: SET NAMES UTF8.

    - sebastien @
  9. 9

    Sebastien’s last suggestion works great here with Mysql 5.0 too. I even removed content-type header and it’s still fine! Thanks for a great write-up!

    - Metin Amiroff @
  10. 10

    […] @headers["Content-Type"] = "text/html; charset=utf-8" end As explained here, MySQL must use MyISAM table type, not innoDB, so make sure […]

    - My Other Blog is a Ferrari » Blog Archive » To make apps Japanese compatible @
  11. 11

    Hi,
    How about dumping unicode utf8 database from mysql 4.1 and uploading it to mySQL 4.0.
    Can I somehow read unicode utf8 records from uploaded tables?

    - EchoOff @
  12. 12

    hello there,
    i am trying to reproduce a data server that a company uses to present information to its customers via the web. No problem. What they are using right now is Microsoft Access, connecting via ODBC to a server on line. The cool thing that they can do is have several tables linked together. Like if a certain field is updated in one table, it updates the same info automatically in another table. So, i want to reproduce this in MySQL, but i can’t seem to find out how. Is there a way to pull this off on the server side ? or does that have to be done on the client side ?
    thanks for any tips.

    - jeff @
  13. 13

    @EchoOff: Oops, sorry I missed this question, but I’m afraid I really don’t know the answer. A quick glance at this page:

    http://dev.mysql.com/doc/refman/4.1/en/charset-unicode.html

    You might be out of luck; it suggests that Unicode (utf-8) support was added in 4.1.

    @jeff: Sorry, I have no experience with ODBC or Access.

    Good luck, thanks for the comments.

    - pat @
  14. 14

    @jeff: what you are describing could be accomplished by a trigger, which is now supported in mysql 5… see http://dev.mysql.com/doc/refman/5.0/en/using-triggers.html for more info.

    - tim @
  15. 15

    There’s a simplest way. Just add the following into your config/database.yml:

    encoding: utf8

    - nullie @
  16. 16

    Great article! I also faced similar issues. My thoughts here:

    http://www.gaugeus.com/ramblings/2006/11/15/unicode-support-in-ruby-on-rails-with-mysql-and-microsoft-sql-server

    - Andres @
  17. 17

    […] (参考) Getting Unicode, MySql, and Rails to Cooperate http://ruphus.com/blog/2005/06/23/getting-unicode-mysql-and-rails-to-cooperate/ この記事のタグ:Rails Ruby プログラム 覚書 […]

    - Rails UTF-8文字化け対策 » TMJ style @
  18. 18

    On personal opinion, I find this very helpful.
    Guys, I have also posted some more relevant info further on this, not sure if you find it useful: http://www.bidmaxhost.com/forum/

    - ocnsss @
  19. 19

    […] Sveiki, štai ir vėl sulaukėm savaitgalio skaitinių rubrikos, po šiokios tokios pertraukos. Šią savaitę buvo publikuotas NePo straipsnis Kompiuterių virusų istorija. Tikriausiai pastebėjot, jog Pixel.lt aktyvumas truputį sumažėjo, tačiau pavasaris sunkus metas, vieniems darbai, kitiems mokslai ir t.t. Linkime sau pasitempti. Šio savaitgalio skaitiniai: Getting Unicode, MySql, and Rails to Cooperate Perl, the first postmodern computer language Top 5 javascript frameworks Susiję straipsniai:Savaitgalio skaitiniai #24Savaitgalio skaitiniai #22Savaitgalio skaitiniai #21 […]

    - » Savaitgalio skaitiniai #25 Archyvas » Pixel.lt @

Leave a Reply