Technical Leader

Character Sets Having Trouble With An Apostrophe

In the past I had read Joel Spolsky’s article over character sets. Occasionally I run into a problem where certain characters do not show correctly, but for the most part I do nothing about it as I can determine what the word or letter is suppose to be.

At one my previous client location, I was lucky enough to have a broken test. This struck me as odd, as it was only one test and not many. It was on a fresh install of Ubuntu and I assumed I had something out of place relating to configuration. I spent roughly four hours tracking down the problem and finally determined that Ubuntu was set to use UTF-8 and the development and deployment environments where meant to run as 8859, the Windows default encoding.

So when will you change? Dave Ramsey said “Most people won’t change until the pain of where they are exceeds the pain of the change”, well I hit that point. I have begun to learn a little more about the funny looking ‘?’ symbols shown when I expect something else.

The article which I was reading happened to be something referenced from Wikipedia. I’m not sure how this will show up as I would expect someone to correct the issue, but the following are some cropped images of how it looked on a WindowsXP box (they look the same on Mac using FireFox, Chrome, and Safari): Safari

Internet Explorer

FireFox

Chrome

The character is obviously an apostrophe, but saving the source out and viewing the value shows 0222/0x92/146 (1001 0010 binary, take your pick). The encoding of the page is set to 8859-2, which Firefox sets the page to correctly … but of course this shows the question mark in the diamond shape. When I override the default encoding and change it to 8859-1 then it shows fine.

I thought that I had it all figured out at this point. This is an easy case right? The encoding is just wrong, in that it shouldn’t be 8859-2, but 8859-1. Well if you look through the 8859-1 character set, there is not any apostrophe near 146 (side note: if you don’t know how to use an apostrophe or unsure on some parts, then The Oatmeal has an awesome explanation). I tried many combinations to try and get something to calculate out, such as dropping the most significant bit which would leave me with 18 decimal. But I could never get close to what an apostrophe should be showing as.

As it turns out, according to this article:

ISO-8859-1 explicitly does not define displayable characters for positions 0-31 and 127-159, and the HTML standard does not allow those to be used for displayable characters.

So it seems like no browser should show anything for that character.

So the question still remains, why does switching the encoding to 8859-1 make Firefox show it right?

Shockingly, Wikipedia to the rescue:

It is very common to mislabel text data with the charset label ISO-8859-1, even though the data is really Windows-1252 encoded. In Windows-1252, codes between 0x80 and 0x9F are used for letters and punctuation, whereas they are control codes in ISO-8859-1. Many web browsers and e-mail clients will interpret ISO-8859-1 control codes as Windows-1252 characters in order to accommodate such mislabeling but it is not standard behaviour and care should be taken to avoid generating these characters in ISO-8859-1 labeled content. However, the draft HTML 5 specification requires that documents advertised as ISO-8859-1 actually be parsed with the Windows-1252 encoding.

So FireFox, upon being told of using the 8859-1 character set instead of the declared 8859-2, notices this mistake and displays it as the Windows-1252 encoding. Doing so the apostrophe finally shows up in correct fashion.

I believe I’m going to enjoy this journey down the Character Set Path.

I must acknowledge Paul Carter, as he pointed out the article from Wikipedia.