A friend who runs shortext.com pinged me with an interesting question one evening. People posting messages to his website via a Firefox plugin that he had written for twitter, were reporting that all accented characters in their posts (like éåüç), were being stripped out. For instance, each time someone posted café, all that came through was caf. I asked him for the problematic querystring/post data and here is what it looked like (server name changed to localhost):
http://localhost/default.aspx?message=caf%E9
Doing a Response.Write(Request.QueryString["message"]); got us:
caf
(Page’s encoding set to utf8, ASP.NET 2.0, IIS 6 on Windows Server 2003).
The result was consistent across all browsers.
Response.Write(Request.RawUrl);
got us:
/wwwroot/Default.aspx?message=caf%E9
Obviously, the browser was letting the querystring untouched. It was the ASP.NET infrastructure that was intervening while parsing the querystring.How about Server.UrlDecode, we thought. This is what we got:
Response.Write(Server.UrlDecode(Request.RawUrl));
/wwwroot/Default.aspx?message=caf
Time for some extreme measures. We added a reference to Microsoft.JScript and used
Response.Write(Microsoft.JScript.GlobalObject.unescape(Request.RawUrl)); which got us:
/wwwroot/Default.aspx?message=café
While this solved the problem, it meant that we would need to parse the QueryString by hand. And I had a nagging feeling that we were solving the wrong problem here. It then dawned on me that our URL was not encoded as utf8! Every character other than the first 128 ASCII characters takes up at least two bytes when encoded to utf8. Our é was encoded to E9, which is just one byte (or rather is 00E9 – Unicode but not utf8 encoded).
The next logical thing was to find out how we got this URL in the first place. It so happens, that we were using Javascript’s escape() method on the client to encode our URL. Changing that to encodeURIComponent gave us caf%C3%A9.
http://localhost/default.aspx?message=caf%C3%A9
Doing a Response.Write(Request.QueryString["message"]); now got us:
café
and all was well again with our world. Moral of the story, if you find that ASP.NET is stripping accented characters from your querystring, check that your data is encoded correctly as UTF8.