#48 ✓invalid
qoobaa

Rack::Utils.unescape problems in Ruby 1.9.1

Reported by qoobaa | May 12th, 2009 @ 07:12 PM | in 1.0

The problem is related with "incompatible character encodings: ASCII-8BIT and UTF-8" exception in Rails. It's caused by unescape function used to parse query params:


Rack::Utils.unescape("%C4%85")
#=> "\xC4\x85"
Rack::Utils.unescape("%C4%85").encoding
#=> #<Encoding:ASCII-8BIT>

It should detect the string encoding (somehow). If we pass params encoded with ASCII-8BIT to ActiveRecord classes, we gain no profit from using Ruby 1.9.1.

Edit:

Can we use charset from Content-Type header? (e.g. application/x-www-form-urlencoded; charset=utf-8)

Comments and changes to this ticket

  • Ryan Tomayko

    Ryan Tomayko May 12th, 2009 @ 11:16 PM

    My understanding is that all browsers use the Content-Type/charset response header of the page submitting the form (or that contains the link) when possible and falls back to ASCII. That's always seemed like a sane thing to do but I can't think of a way to automatically detect it on the request side. I assume the Content-Type request header includes the charset on POST but I'm not sure how to handle GET.

  • qoobaa

    qoobaa May 12th, 2009 @ 11:42 PM

    • Tag set to patch, ruby-1.9, utils

    I suggest solution similar to rails/actionpack/lib/action_view/template/handlers/erb.rb. It's the easiest way to fix it, using UTF-8 encoding as default.

  • manveru

    manveru May 16th, 2009 @ 05:11 PM

    • Assigned user set to “manveru”

    May be the easiest way, but also wrong if you want to support another encoding, not to mention that force_encoding is not very pretty and might hide encoding-conflicts.
    There is also the possibility of not touching the encoding in the unescape method, and leaving it to wrappers to change it, that gives frameworks the possibility to add configuration for the behaviour with reasonable default instead of having UTF-8 hardcoded inside Rack (which is encoding-agnostic everywhere else).
    I'd also suggest to make this capability-based instead of version-based, asking .respond_to?(:encode) if we tackle encoding in this method at all.
    I'm taking responsibility for this change for now, as I'm quite interested in the outcome and do a lot of i18n stuff anyway.

  • qoobaa

    qoobaa May 16th, 2009 @ 09:11 PM

    I agree with you. Obviously there must be some way to change the encoding. I needed quick fix for the issue, if anybody wants simple temporary solution he can use the patch. Anyway, RoR and DB adapters are also broken, the easiest way to fix them is to enforce ASCII-8BIT encoding everywhere for now.

  • josh

    josh August 3rd, 2009 @ 04:19 PM

    • State changed from “new” to “open”
    • Milestone set to 1.0
  • Jeremy Kemper

    Jeremy Kemper August 4th, 2009 @ 02:13 AM

    This is addressing a symptom. The root cause is that params parsing doesn't respect string encodings.

    Params parsed from the query string should always be UTF-8, per RFC.

    Params parsed from the body should have encoding set according to the Content-Type header.

    Rather than shift the burden of setting encoding to every middleware and app which reads rack.input, this should be an additional spec requirement for the rack adapters:
    1. must have correct encoding for PATH_INFO and friends
    2. must have rack.input external_encoding reflect the Content-Type

    And this ticket may be closed.

  • josh

    josh August 4th, 2009 @ 02:14 AM

    • State changed from “open” to “invalid”
  • James Healy

    James Healy August 4th, 2009 @ 02:23 AM

    @Jeremy I agree with your comments.

    Out of interest, which RFC says query strings are always encoded as UTF-8? I spent some time looking for an RFC that provided guidance on the encoding of HTTP requests and failed. I'm also keen to read up on how to interpret the encoding in the very common case of a HTTP request with no Content-Type header.

  • Jeremy Kemper

    Jeremy Kemper August 4th, 2009 @ 02:37 AM

    Missing Content-Type could be treated as binary/octet-stream with Encoding::BINARY.

    I'm wrong about URIs: it's just a strong suggestion.

    From http://www.ietf.org/rfc/rfc2718.txt:

    2.2.5 Character encoding
    
    
      When describing URL schemes in which (some of) the elements of the
      URL are actually representations of sequences of characters, care
      should be taken not to introduce unnecessary variety in the ways
      in which characters are encoded into octets and then into URL
      characters.  Unless there is some compelling reason for a
      particular scheme to do otherwise, translating character sequences
      into UTF-8 (RFC 2279) [3] and then subsequently using the %HH
      encoding for unsafe octets is recommended.
    
    
    
    

    So it's actually a recommendation only for URIs. For the sake of usability, and with an eye forward to IRIs, we should probably enforce it (the IRI <-> URI conversions detailed in http://www.ietf.org/rfc/rfc3987 use hex-encoded UTF-8).

    From http://www.ietf.org/rfc/rfc3987:

    Appendix A.4.  Indicating Character Encodings in the URI/IRI
    
    
    Some proposals suggested indicating the character encodings used in an URI or IRI with some new syntactic convention in the URI itself, similar to the "charset" parameter for e-mails and Web pages. As an example, the label in square brackets in "http://www.example.org/ros[iso-8859-1]&#xE9"; indicated that the following "&#xE9"; had to be interpreted as iso-8859-1.
    If UTF-8 is used exclusively, an upgrade to the URI syntax is not needed. It avoids potentially multiple labels that have to be copied correctly in all cases, even on the side of a bus or on a napkin, leading to usability problems (and being prohibitively annoying). Exclusively using UTF-8 also reduces transcoding errors and confusion.

    BTW, the IRI spec is driven by the same guy behind Ruby 1.9's encoding effort: Martin Dürst. Cool!

  • James Healy

    James Healy August 4th, 2009 @ 03:01 AM

    Interesting, I hadn't come across the IRI RFC before.

    WRT HTTP requests with no Content-Type, using BINARY as an encoding may not be a workable solution. In my testing with Firefox 3.0-3.5, standard POST requests have a Content-Type with no charset segment. Like so:

    "Content-Type: application/x-www-form-urlencoded"
    

    I'm not sure how other browsers act, but in this case it would appear form submissions from FF would end up marked as BINARY encoded and therefore trigger incompatible encoding exceptions.

    The XmlHttpRequest spec requires the request to include a charset segment in the Content-Type, so those appear to be OK.

  • Jeremy Kemper

    Jeremy Kemper August 4th, 2009 @ 03:10 AM

    That's a different scenario: you're missing the charset parameter of Content-Type, not missing the whole Content-Type header.

    When the charset is missing, ISO-8859-1 (latin1) is assumed per rfc2616 3.7.1.

  • Jeremy Kemper

    Jeremy Kemper August 4th, 2009 @ 03:13 AM

    BTW, the Content-Type fallback to application/octet-stream is also in rfc2616 7.2.1.

  • James Healy

    James Healy August 4th, 2009 @ 03:18 AM

    Assuming ISO-8859-1 is nice in theory, but if I submit a form with UTF-8 data from FF 3.5, the Content-Type header of my request is still missing a charset.

  • Jeremy Kemper

    Jeremy Kemper August 4th, 2009 @ 03:38 AM

    Submit a form on a page served with what Content-Type?

  • James Healy

    James Healy August 4th, 2009 @ 03:44 AM

    "Content-Type: text/html; charset=utf-8"

    I'm aware that common wisdom is that browsers will submit data back in the character set the page was served. However such behaviour doesn't seem to gel with the RFC.

    Thanks for your quick responses, I'm genuinely interested in learning the most appropriate way to interpret HTTP requests.

  • Jeremy Kemper

    Jeremy Kemper August 4th, 2009 @ 03:53 AM

    I suppose this discussion should move to the rack list :)

  • Yugui (Yuki Sonoda)

    Yugui (Yuki Sonoda) August 4th, 2009 @ 08:08 AM

    Ruby's cgi.rb had the same problem. xibbar, the maintainer of cgi.rb, decided as
    http://svn.ruby-lang.org/cgi-bin/viewvc.cgi?view=rev&amp;revision=2... .

    In summary,
    * CGI::unescape takes an encoding as an optional parameter, * The default value is cgi.rb's default value of ACCEPT_CHARSET. This is UTF-8 unless a programmer overwrites the value.

    * This is because unescaped values are typically used for building a response.
    * And RFC3986 recommends UTF-8 as an encoding for a URI component.
    
  • naruse

    naruse August 4th, 2009 @ 11:30 PM

    First of all, we should know that standards around this are broken.
    We can't rely on any standards and theories, so we must decide all by the real world things.


    1. default encoding of application/x-www-form-urlencoded

    The answer is ASCII. (this may surprise you. not ISO-8859-1)
    17.13.4 Form content types application/x-www-form-urlencoded

    And RFC3986 and other URL or URI or IRI RFCs doesn't define application/x-www-form-urlencoded.
    HTTP RFCs are of course not.
    They have no hints of this problem.

    Note. The "get" method restricts form data set values to ASCII characters. Only the "post" method (with enctype="multipart/form-data") is specified to cover the entire [ISO10646] character set.

    I know all of you won't satisfy this. This is because I said "broken".

    HTML5 tries to correct this.
    http://www.w3.org/TR/html5/forms.html#application-x-www-form-urlenc...
    If you make a web application, this may help you. (charset hack)

    1. QUERY_STRING
      CGI/1.1 defines this. But this collides HTML4's ASCII and HTML5's UTF-8.

    When parsing and decoding the query string, the details of the parsing, reserved characters and support for non US-ASCII characters depends on the context. For example, form submission from an HTML document [18] uses application/x-www-form-urlencoded encoding, in which the characters "+", "&" and "=" are reserved, and the ISO 8859-1 encoding may be used for non US-ASCII characters.

    1. Browsers
      You know browsers often send no hints of the encoding of QUERY_STRING.

    But you also know many browsers sends as the document's character encoding.
    And the creator of the document (= the user of rack) knows the encoding of the document.

    1. Conclusion
      As I wrote above, a request itself doesn't have the encoding information.
      The creator of the document is the only one person who knows request's encoding.
      So cgi.rb choose the way to give the encoding to CGI objects by CGI.accept_charset=.

    Someone may worry about applications which receive requests from documents in different encodings.
    If so, use charset hacks and see it.

    And others may worry about browsers which send data by the encoding different from the document. But we have no information about those request. We only can remember those browser's User Agent and run special logic. If you want to treat those browsers, having the way to give proc to Rack which decide the encoding of request by some information (User Agent, hidden input data and so on).

  • naruse

    naruse August 5th, 2009 @ 07:46 AM

    I forgot to write another problem:
    HTTP charset parameter is not reliable

    In Japan, there are 3 charsets for networking: Shift_JIS, EUC-JP, and ISO-2022-JP.
    But many people use Windows specific characters.
    So the real data is encoded in CP932, CP51932, CP50221.

    Moreover strictly Shift_JIS's 0x00-0x7F is not ASCII but JIS X 0212.
    This means Shift_JIS is strictly ASCII incompatible; it's nonsense.

    So for example, Firefox 3 uses CP932, EUC-JP+CP51932 mixed encoding, ISO-2022-JP as real table.

    I heard Taiwan also had a sad situation like this around Big5 and Big-UAO.
    EU people may be also suffered with ISO-8859-X and Windows125X.

    This difference affects conversion and validation (a character is exist or not. a byte sequence is valid or not).

    These real encodings are not mentioned at a HTTP charset parameter.
    So HTTP charset parameter is not reliable.

  • Jeremy Kemper

    Jeremy Kemper August 5th, 2009 @ 08:00 AM

    Wow! head explodes

    I wish we could make better sense of this all. Perhaps with a HTML 5 mode?

    Otherwise, we have no sane option for dealing with incoming charsets.

  • James Healy

    James Healy August 5th, 2009 @ 08:35 AM

    So.... the answer is you can never reliably trust the encoding of a HTTP request and you're best to assume it's the same as the last served page?

    Aren't web standards a beautiful thing? :)

  • Jeremy Kemper

    Jeremy Kemper August 5th, 2009 @ 05:34 PM

    Seems you can't even assume it's the same as the last-served page!

  • naruse

    naruse August 6th, 2009 @ 06:05 PM

    we have no sane option for dealing with incoming charsets.

    Yes, there is no silver bullet as far as I know.

    I wish we could make better sense of this all. Perhaps with a HTML 5 mode?

    HTML 5 mode (accept_charset, the document's encoding, UTF-8, charset) will work with modern browser.
    But if you can't use UTF-8 or must treat browsers which send request in a encoding different from the document, you can't use this.


    A simple and basic solution of Rack is to assume all data is ASCII-8BIT.
    Throw away the idea that the body is character string, it's byte stream.

    You may think this is only to pass things to others on Rack.
    But middlewares can embed the document's encoding name to hidden input element and read it from request. (yes, you can't assume the last served page)
    If to do so, you can treat browsers sending request in the document's encoding even if you send documents in different encodings to various browsers.
    Hard worker may see user agents and treat browsers which doesn't send request in the document's encoding one by one.
    At this point, you can support almost all browsers.


    Rack can choose more kind way.
    This is base on the first way but with helpers.

    Rack splits a runner and a request parser. So for example,
    * add optional argument like Rack::Request#new(env, :encoding => 'UTF-8') * env is ASCII-8BIT * strings in req.params is encoded given encoding.

    When a situation, some browsers send request in Windows-31J and others send in UTF-8, users can write following:

    #!/usr/bin/env ruby
    require 'rubygems'
    require 'rack'
    
    class Rackapp
      def call(env)
        mobile_p = /^(?:DoCoMo|KDDI|SoftBank)/ =~ env['HTTP_USER_AGENT']
        encoding = mobile_p ? Encoding::WINDOWS_31J : Encoding::UTF_8
        req = Rack::Request.new(env, :encoding => encoding)
        [200, {"Content-Type" => "text/plain"},
          ["Hi!\n"+req.params["foo"]+req.params["foo"].encoding.to_s]]
      end
    end
    
    Rack::Handler::CGI.run Rackapp.new
    

    Note that if you want to get information from POSTed data, you can't get it from env. So you may give :encoding => proc{|env,params| ..} and work with params which are still labeled as ASCII-8BIT.
    If you want to get information from charset parameter or hidden charset, you should use a table name in charset to Ruby's encoding like (but i recommend not to try to get information from charsets):

    table_from_charset_to_encoding = {
      'shift_jis' => Encoding::WINDOWS_31J,
      'euc-jp' => Encoding::CP51932,
      'eucjp' => Encoding::CP51932,
      'iso-2022-jp' => Encoding::CP50221,
    }
    
  • naruse

    naruse August 6th, 2009 @ 06:18 PM

    Aren't web standards a beautiful thing? :) Real world is dirty and web is a mirror of it. Software and standards which handles dirty things shall be dirty. Only restructuring prevents this, but web standards are leaved as it is in 10 years.

    Ruby is also one of them. So we work on Ruby 1.9.

  • naruse

    naruse August 27th, 2009 @ 04:42 AM

    For your information, recent HTML5 has a chapter for this problem; "2.7 Character encodings".
    http://www.whatwg.org/specs/web-apps/current-work/multipage/infrast...

    This describes Character encoding overrides for compatibility.

  • Ronie Uliana

    Ronie Uliana February 13th, 2010 @ 02:18 PM

    I know this an old thread, but just for the sake of completeness:

    I found this link very useful: http://www.crazysquirrel.com/computing/general/form-encoding.jspx

    It seems the "hack" way get the form encoding is to add a hidden field named "charset" to the form. I tested it on Firefox (1.3.6 pre) and it works, on Chrome I tested and it didn't work, on IE I didn't test.

    Also, searching a lot on the web (sorry, no specific links), I found that IE seems to assume cp1252 as default and other browsers ISO8859-1 (this informations needs to be checked, of course)

    On the practical side, as I'm using UTF-8 all over the place, I wrote a little rack to force the encoding.

  • Paul

    Paul April 26th, 2010 @ 08:38 AM

    an email from me at rails core in response to Cezary Baginski "Overview of Ruby 1.9 encoding problem tickets"

    • just for the sake of completeness - and maybe copy"n"paste for others willing to enter the 1.9 adventure outside the US !

    well - i "upgraded" our site running in germany to ruby1.9.1, unicorn and rails 2.3.6
    even with using utf-8 as a default i had to make various patches within rack to get it up and running.

    rack: utils

    # Unescapes a URI escaped string. (Stolen from Camping).
    def unescape(s)
      result = s.tr('+', ' ').gsub(/((?:%[0-9a-fA-F]{2})+)/n){
        [$1.delete('%')].pack('H*')
      }               
      RUBY_VERSION >= "1.9" ? result.force_encoding(Encoding::UTF_8) : result     
    end
    module_function :unescape
    

    found at lighthouse...

    the next one is horrible - i know, but it works for now:

    def parse_query(qs, d = nil)
      params = {}
    
      (qs || '').split(d ? /[#{d}] */n : DEFAULT_SEP).each do |p|
        k, v = p.split('=', 2).map { |x| unescape(x) }
        begin
          if v =~ /^("|')(.*)\1$/
            v = $2.gsub('\\'+$1, $1)
          end
        rescue
          v.force_encoding('ISO-8859-1')
          v.encode!('UTF-8',:invalid => :replace, :undef => :replace, :replace => '')          
          if v =~ /^("|')(.*)\1$/
            v = $2.gsub('\\'+$1, $1)
          end
        end
    

    (we use analytics at the site - analytics stores the last search query within a cookie. If a user will browse google and finds the site with an umlaut query this query will be stored within the cookie. parse_query will be used by rack to parse cookies too. guess what - it wil go booom if you use utf-8 as a default and get an incoming cookie with an different encoding../)

    the next ugly thing :)

    def normalize_params(params, name, v = nil)
      if v and v =~ /^("|')(.*)\1$/
        v = $2.gsub('\\'+$1, $1)
      end
      name =~ %r(\A[\[\]]*([^\[\]]+)\]*)
      k = $1 || ''
      after = $' || ''
    
      return if k.empty?
    
    
      if after == "" 
        params[k] = (RUBY_VERSION >= "1.9" && v.is_a?(String) ? v.force_encoding(Encoding::UTF_8) : v)
        # params[k] = v
      elsif after == "[]"
        params[k] ||= []
        raise TypeError, "expected Array (got #{params[k].class.name}) for param `#{k}'" unless params[k].is_a?(Array)
        params[k] << (RUBY_VERSION >= "1.9" && v.is_a?(String) ? v.force_encoding(Encoding::UTF_8) : v)
        # params[k] << v
      elsif after =~ %r(^\[\]\[([^\[\]]+)\]$) || after =~ %r(^\[\](.+)$)
    

    all patches i found did not include the multipart solution ... this hack makes sure that multipart variables will be utf-8 forced too ...

    Yes / i am glad and thank you that you made this overdue summary!
    i hope others will have a better start into the ruby1.9 rails 2.3 world as me.
    In fact there were times i really wondered why someones dares to state that rails is 1.9 compatible for a real world (not real US) app!

    Thanks a lot!

  • Serge Balyuk

    Serge Balyuk June 28th, 2010 @ 10:22 AM

    • Milestone order changed from “0” to “0”

    I've just added a patch #100 that should be addressing some of these problems without hardcoding UTF-8 into rack core.

Please Sign in or create a free account to add a new ticket.

With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.

New-ticket Create new ticket

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile »

Attachments

Referenced by

Pages