#48 Rack::Utils.unescape problems in Ruby 1.9.1

Type	To find
responsible:me	tickets assigned to you
tagged:"@high"	tickets tagged @high
milestone:next	tickets in the upcoming milestone
state:invalid	tickets with the state invalid
created:"last week"	tickets created last week
sort:number, importance, updated	tickets sorted by #, importance or updated
Combine keywords for powerful searching.
Use advanced searching »

#48 ✓invalid

Rack::Utils.unescape problems in Ruby 1.9.1

Reported by qoobaa | May 12th, 2009 @ 07:12 PM | in 1.0

The problem is related with "incompatible character encodings: ASCII-8BIT and UTF-8" exception in Rails. It's caused by unescape function used to parse query params:


Rack::Utils.unescape("%C4%85")
#=> "\xC4\x85"
Rack::Utils.unescape("%C4%85").encoding
#=> #<Encoding:ASCII-8BIT>

It should detect the string encoding (somehow). If we pass params encoded with ASCII-8BIT to ActiveRecord classes, we gain no profit from using Ruby 1.9.1.

Edit:

Can we use charset from Content-Type header? (e.g. application/x-www-form-urlencoded; charset=utf-8)

Comments and changes to this ticket

Ryan Tomayko May 12th, 2009 @ 11:16 PM
My understanding is that all browsers use the Content-Type/charset response header of the page submitting the form (or that contains the link) when possible and falls back to ASCII. That's always seemed like a sane thing to do but I can't think of a way to automatically detect it on the request side. I assume the Content-Type request header includes the charset on POST but I'm not sure how to handle GET.
You flagged this item as spam.
qoobaa May 12th, 2009 @ 11:42 PM
- Tag set to “patch, ruby-1.9, utils”
I suggest solution similar to rails/actionpack/lib/action_view/template/handlers/erb.rb. It's the easiest way to fix it, using UTF-8 encoding as default.
- unescape_utf8_ruby_1_9_1.diff 1.5 KB
manveru May 16th, 2009 @ 05:11 PM
- Assigned user set to “manveru”
May be the easiest way, but also wrong if you want to support another encoding, not to mention that force_encoding is not very pretty and might hide encoding-conflicts.
There is also the possibility of not touching the encoding in the unescape method, and leaving it to wrappers to change it, that gives frameworks the possibility to add configuration for the behaviour with reasonable default instead of having UTF-8 hardcoded inside Rack (which is encoding-agnostic everywhere else).
I'd also suggest to make this capability-based instead of version-based, asking .respond_to?(:encode) if we tackle encoding in this method at all.
I'm taking responsibility for this change for now, as I'm quite interested in the outcome and do a lot of i18n stuff anyway.
You flagged this item as spam.
qoobaa May 16th, 2009 @ 09:11 PM
I agree with you. Obviously there must be some way to change the encoding. I needed quick fix for the issue, if anybody wants simple temporary solution he can use the patch. Anyway, RoR and DB adapters are also broken, the easiest way to fix them is to enforce ASCII-8BIT encoding everywhere for now.
josh August 3rd, 2009 @ 04:19 PM
- State changed from “new” to “open”
- Milestone set to “1.0”
You flagged this item as spam.
Jeremy Kemper August 4th, 2009 @ 02:13 AM
This is addressing a symptom. The root cause is that params parsing doesn't respect string encodings.

Params parsed from the query string should always be UTF-8, per RFC.

Params parsed from the body should have encoding set according to the Content-Type header.

Rather than shift the burden of setting encoding to every middleware and app which reads rack.input, this should be an additional spec requirement for the rack adapters:
1. must have correct encoding for PATH_INFO and friends
2. must have rack.input external_encoding reflect the Content-Type

And this ticket may be closed.
josh August 4th, 2009 @ 02:14 AM
- State changed from “open” to “invalid”
You flagged this item as spam.
James Healy August 4th, 2009 @ 02:23 AM
@Jeremy I agree with your comments.

Out of interest, which RFC says query strings are always encoded as UTF-8? I spent some time looking for an RFC that provided guidance on the encoding of HTTP requests and failed. I'm also keen to read up on how to interpret the encoding in the very common case of a HTTP request with no Content-Type header.

Jeremy Kemper August 4th, 2009 @ 02:37 AM

Missing Content-Type could be treated as binary/octet-stream with Encoding::BINARY.

I'm wrong about URIs: it's just a strong suggestion.

From http://www.ietf.org/rfc/rfc2718.txt:

2.2.5 Character encoding

  When describing URL schemes in which (some of) the elements of the
  URL are actually representations of sequences of characters, care
  should be taken not to introduce unnecessary variety in the ways
  in which characters are encoded into octets and then into URL
  characters.  Unless there is some compelling reason for a
  particular scheme to do otherwise, translating character sequences
  into UTF-8 (RFC 2279) [3] and then subsequently using the %HH
  encoding for unsafe octets is recommended.

So it's actually a recommendation only for URIs. For the sake of usability, and with an eye forward to IRIs, we should probably enforce it (the IRI <-> URI conversions detailed in http://www.ietf.org/rfc/rfc3987 use hex-encoded UTF-8).

From http://www.ietf.org/rfc/rfc3987:

Appendix A.4.  Indicating Character Encodings in the URI/IRI



   Some proposals suggested indicating the character encodings used in
   an URI or IRI with some new syntactic convention in the URI itself,
   similar to the "charset" parameter for e-mails and Web pages.  As an
   example, the label in square brackets in
   "http://www.example.org/ros[iso-8859-1]&#xE9"; indicated that the
   following "&#xE9"; had to be interpreted as iso-8859-1.



   If UTF-8 is used exclusively, an upgrade to the URI syntax is not
   needed.  It avoids potentially multiple labels that have to be copied
   correctly in all cases, even on the side of a bus or on a napkin,
   leading to usability problems (and being prohibitively annoying).
   Exclusively using UTF-8 also reduces transcoding errors and
   confusion.

BTW, the IRI spec is driven by the same guy behind Ruby 1.9's encoding effort: Martin Dürst. Cool!

You flagged this item as spam.
James Healy August 4th, 2009 @ 03:01 AM
Interesting, I hadn't come across the IRI RFC before.

WRT HTTP requests with no Content-Type, using BINARY as an encoding may not be a workable solution. In my testing with Firefox 3.0-3.5, standard POST requests have a Content-Type with no charset segment. Like so:
```
"Content-Type: application/x-www-form-urlencoded"
```
I'm not sure how other browsers act, but in this case it would appear form submissions from FF would end up marked as BINARY encoded and therefore trigger incompatible encoding exceptions.

The XmlHttpRequest spec requires the request to include a charset segment in the Content-Type, so those appear to be OK.
You flagged this item as spam.
Jeremy Kemper August 4th, 2009 @ 03:10 AM
That's a different scenario: you're missing the charset parameter of Content-Type, not missing the whole Content-Type header.

When the charset is missing, ISO-8859-1 (latin1) is assumed per rfc2616 3.7.1.
You flagged this item as spam.
Jeremy Kemper August 4th, 2009 @ 03:13 AM
BTW, the Content-Type fallback to application/octet-stream is also in rfc2616 7.2.1.
You flagged this item as spam.
James Healy August 4th, 2009 @ 03:18 AM
Assuming ISO-8859-1 is nice in theory, but if I submit a form with UTF-8 data from FF 3.5, the Content-Type header of my request is still missing a charset.
You flagged this item as spam.
Jeremy Kemper August 4th, 2009 @ 03:38 AM
Submit a form on a page served with what Content-Type?
You flagged this item as spam.
James Healy August 4th, 2009 @ 03:44 AM
"Content-Type: text/html; charset=utf-8"

I'm aware that common wisdom is that browsers will submit data back in the character set the page was served. However such behaviour doesn't seem to gel with the RFC.

Thanks for your quick responses, I'm genuinely interested in learning the most appropriate way to interpret HTTP requests.
You flagged this item as spam.
Jeremy Kemper August 4th, 2009 @ 03:53 AM
I suppose this discussion should move to the rack list :)
You flagged this item as spam.
Yugui (Yuki Sonoda) August 4th, 2009 @ 08:08 AM
Ruby's cgi.rb had the same problem. xibbar, the maintainer of cgi.rb, decided as
http://svn.ruby-lang.org/cgi-bin/viewvc.cgi?view=rev&revision=2... .

In summary,
* CGI::unescape takes an encoding as an optional parameter, * The default value is cgi.rb's default value of ACCEPT_CHARSET. This is UTF-8 unless a programmer overwrites the value.
```
* This is because unescaped values are typically used for building a response.
* And RFC3986 recommends UTF-8 as an encoding for a URI component.
```
You flagged this item as spam.
naruse August 4th, 2009 @ 11:30 PM
First of all, we should know that standards around this are broken.
We can't rely on any standards and theories, so we must decide all by the real world things.
1. default encoding of application/x-www-form-urlencoded
The answer is ASCII. (this may surprise you. not ISO-8859-1)
17.13.4 Form content types application/x-www-form-urlencoded

And RFC3986 and other URL or URI or IRI RFCs doesn't define application/x-www-form-urlencoded.
HTTP RFCs are of course not.
They have no hints of this problem.

Note. The "get" method restricts form data set values to ASCII characters. Only the "post" method (with enctype="multipart/form-data") is specified to cover the entire [ISO10646] character set.

I know all of you won't satisfy this. This is because I said "broken".

HTML5 tries to correct this.
http://www.w3.org/TR/html5/forms.html#application-x-www-form-urlenc...
If you make a web application, this may help you. (charset hack)
1. QUERY_STRING
  CGI/1.1 defines this. But this collides HTML4's ASCII and HTML5's UTF-8.
When parsing and decoding the query string, the details of the parsing, reserved characters and support for non US-ASCII characters depends on the context. For example, form submission from an HTML document [18] uses application/x-www-form-urlencoded encoding, in which the characters "+", "&" and "=" are reserved, and the ISO 8859-1 encoding may be used for non US-ASCII characters.
1. Browsers
  You know browsers often send no hints of the encoding of QUERY_STRING.
But you also know many browsers sends as the document's character encoding.
And the creator of the document (= the user of rack) knows the encoding of the document.
1. Conclusion
  As I wrote above, a request itself doesn't have the encoding information.
  The creator of the document is the only one person who knows request's encoding.
  So cgi.rb choose the way to give the encoding to CGI objects by CGI.accept_charset=.
Someone may worry about applications which receive requests from documents in different encodings.
If so, use charset hacks and see it.

And others may worry about browsers which send data by the encoding different from the document. But we have no information about those request. We only can remember those browser's User Agent and run special logic. If you want to treat those browsers, having the way to give proc to Rack which decide the encoding of request by some information (User Agent, hidden input data and so on).
You flagged this item as spam.
naruse August 5th, 2009 @ 07:46 AM
I forgot to write another problem:
HTTP charset parameter is not reliable

In Japan, there are 3 charsets for networking: Shift_JIS, EUC-JP, and ISO-2022-JP.
But many people use Windows specific characters.
So the real data is encoded in CP932, CP51932, CP50221.

Moreover strictly Shift_JIS's 0x00-0x7F is not ASCII but JIS X 0212.
This means Shift_JIS is strictly ASCII incompatible; it's nonsense.

So for example, Firefox 3 uses CP932, EUC-JP+CP51932 mixed encoding, ISO-2022-JP as real table.

I heard Taiwan also had a sad situation like this around Big5 and Big-UAO.
EU people may be also suffered with ISO-8859-X and Windows125X.

This difference affects conversion and validation (a character is exist or not. a byte sequence is valid or not).

These real encodings are not mentioned at a HTTP charset parameter.
So HTTP charset parameter is not reliable.
You flagged this item as spam.
Jeremy Kemper August 5th, 2009 @ 08:00 AM
Wow! head explodes

I wish we could make better sense of this all. Perhaps with a HTML 5 mode?

Otherwise, we have no sane option for dealing with incoming charsets.
You flagged this item as spam.
James Healy August 5th, 2009 @ 08:35 AM
So.... the answer is you can never reliably trust the encoding of a HTTP request and you're best to assume it's the same as the last served page?

Aren't web standards a beautiful thing? :)
You flagged this item as spam.
Jeremy Kemper August 5th, 2009 @ 05:34 PM
Seems you can't even assume it's the same as the last-served page!
You flagged this item as spam.
naruse August 6th, 2009 @ 06:05 PM
we have no sane option for dealing with incoming charsets.

Yes, there is no silver bullet as far as I know.

I wish we could make better sense of this all. Perhaps with a HTML 5 mode?

HTML 5 mode (accept_charset, the document's encoding, UTF-8, charset) will work with modern browser.
But if you can't use UTF-8 or must treat browsers which send request in a encoding different from the document, you can't use this.

A simple and basic solution of Rack is to assume all data is ASCII-8BIT.
Throw away the idea that the body is character string, it's byte stream.

You may think this is only to pass things to others on Rack.
But middlewares can embed the document's encoding name to hidden input element and read it from request. (yes, you can't assume the last served page)
If to do so, you can treat browsers sending request in the document's encoding even if you send documents in different encodings to various browsers.
Hard worker may see user agents and treat browsers which doesn't send request in the document's encoding one by one.
At this point, you can support almost all browsers.

Rack can choose more kind way.
This is base on the first way but with helpers.

Rack splits a runner and a request parser. So for example,
* add optional argument like Rack::Request#new(env, :encoding => 'UTF-8') * env is ASCII-8BIT * strings in req.params is encoded given encoding.

When a situation, some browsers send request in Windows-31J and others send in UTF-8, users can write following:
```
#!/usr/bin/env ruby
require 'rubygems'
require 'rack'

class Rackapp
  def call(env)
    mobile_p = /^(?:DoCoMo|KDDI|SoftBank)/ =~ env['HTTP_USER_AGENT']
    encoding = mobile_p ? Encoding::WINDOWS_31J : Encoding::UTF_8
    req = Rack::Request.new(env, :encoding => encoding)
    [200, {"Content-Type" => "text/plain"},
      ["Hi!\n"+req.params["foo"]+req.params["foo"].encoding.to_s]]
  end
end

Rack::Handler::CGI.run Rackapp.new
```
Note that if you want to get information from POSTed data, you can't get it from env. So you may give :encoding => proc{|env,params| ..} and work with params which are still labeled as ASCII-8BIT.
If you want to get information from charset parameter or hidden charset, you should use a table name in charset to Ruby's encoding like (but i recommend not to try to get information from charsets):
```
table_from_charset_to_encoding = {
  'shift_jis' => Encoding::WINDOWS_31J,
  'euc-jp' => Encoding::CP51932,
  'eucjp' => Encoding::CP51932,
  'iso-2022-jp' => Encoding::CP50221,
}
```
You flagged this item as spam.
naruse August 6th, 2009 @ 06:18 PM
Aren't web standards a beautiful thing? :) Real world is dirty and web is a mirror of it. Software and standards which handles dirty things shall be dirty. Only restructuring prevents this, but web standards are leaved as it is in 10 years.

Ruby is also one of them. So we work on Ruby 1.9.
You flagged this item as spam.
naruse August 27th, 2009 @ 04:42 AM
For your information, recent HTML5 has a chapter for this problem; "2.7 Character encodings".
http://www.whatwg.org/specs/web-apps/current-work/multipage/infrast...

This describes Character encoding overrides for compatibility.
You flagged this item as spam.
Ronie Uliana February 13th, 2010 @ 02:18 PM
I know this an old thread, but just for the sake of completeness:

I found this link very useful: http://www.crazysquirrel.com/computing/general/form-encoding.jspx

It seems the "hack" way get the form encoding is to add a hidden field named "charset" to the form. I tested it on Firefox (1.3.6 pre) and it works, on Chrome I tested and it didn't work, on IE I didn't test.

Also, searching a lot on the web (sorry, no specific links), I found that IE seems to assume cp1252 as default and other browsers ISO8859-1 (this informations needs to be checked, of course)

On the practical side, as I'm using UTF-8 all over the place, I wrote a little rack to force the encoding.

Paul April 26th, 2010 @ 08:38 AM

an email from me at rails core in response to Cezary Baginski "Overview of Ruby 1.9 encoding problem tickets"

just for the sake of completeness - and maybe copy"n"paste for others willing to enter the 1.9 adventure outside the US !

well - i "upgraded" our site running in germany to ruby1.9.1, unicorn and rails 2.3.6
even with using utf-8 as a default i had to make various patches within rack to get it up and running.

rack: utils

# Unescapes a URI escaped string. (Stolen from Camping).
def unescape(s)
  result = s.tr('+', ' ').gsub(/((?:%[0-9a-fA-F]{2})+)/n){
    [$1.delete('%')].pack('H*')
  }               
  RUBY_VERSION >= "1.9" ? result.force_encoding(Encoding::UTF_8) : result     
end
module_function :unescape

found at lighthouse...

the next one is horrible - i know, but it works for now:

def parse_query(qs, d = nil)
  params = {}

  (qs || '').split(d ? /[#{d}] */n : DEFAULT_SEP).each do |p|
    k, v = p.split('=', 2).map { |x| unescape(x) }
    begin
      if v =~ /^("|')(.*)\1$/
        v = $2.gsub('\\'+$1, $1)
      end
    rescue
      v.force_encoding('ISO-8859-1')
      v.encode!('UTF-8',:invalid => :replace, :undef => :replace, :replace => '')          
      if v =~ /^("|')(.*)\1$/
        v = $2.gsub('\\'+$1, $1)
      end
    end

(we use analytics at the site - analytics stores the last search query within a cookie. If a user will browse google and finds the site with an umlaut query this query will be stored within the cookie. parse_query will be used by rack to parse cookies too. guess what - it wil go booom if you use utf-8 as a default and get an incoming cookie with an different encoding../)

the next ugly thing :)

def normalize_params(params, name, v = nil)
  if v and v =~ /^("|')(.*)\1$/
    v = $2.gsub('\\'+$1, $1)
  end
  name =~ %r(\A[\[\]]*([^\[\]]+)\]*)
  k = $1 || ''
  after = $' || ''

  return if k.empty?


  if after == "" 
    params[k] = (RUBY_VERSION >= "1.9" && v.is_a?(String) ? v.force_encoding(Encoding::UTF_8) : v)
    # params[k] = v
  elsif after == "[]"
    params[k] ||= []
    raise TypeError, "expected Array (got #{params[k].class.name}) for param `#{k}'" unless params[k].is_a?(Array)
    params[k] << (RUBY_VERSION >= "1.9" && v.is_a?(String) ? v.force_encoding(Encoding::UTF_8) : v)
    # params[k] << v
  elsif after =~ %r(^\[\]\[([^\[\]]+)\]$) || after =~ %r(^\[\](.+)$)

all patches i found did not include the multipart solution ... this hack makes sure that multipart variables will be utf-8 forced too ...

Yes / i am glad and thank you that you made this overdue summary!
i hope others will have a better start into the ruby1.9 rails 2.3 world as me.
In fact there were times i really wondered why someones dares to state that rails is 1.9 compatible for a real world (not real US) app!

Thanks a lot!

You flagged this item as spam.
Serge Balyuk June 28th, 2010 @ 10:22 AM
- Milestone order changed from “0” to “0”
I've just added a patch #100 that should be addressing some of these problems without hardcoding UTF-8 into rack core.

Please Sign in or create a free account to add a new ticket.

With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.