Beyonc□, or How We Can All Learn From Other Developers’ Character Encoding Mistakes
I’m sure everyone who reads this blog has noticed, at some point, the result of another developer’s mistake in dealing with Unicode or other character encodings. I’ve had a few issues myself. To the left, you can see how my Napster displayed a Beyoncé song as Beyonc□. You may have seen a black diamond symbol with a question mark in it while browsing a web page, or perhaps other strange symbols when interacting with programs or web pages.
Most, if not all of these, are inconsistencies when dealing with character encoding, with most of them being Unicode. Hazarding a guess, Beyoncé’s is likely stored as Unicode (UTF-16) in Napster’s database, but when output on the screen, it is converted down to UTF-8 or ASCII. Either way, it can’t be converted down, so an entity is displayed. There is even a shirt memorializing the problem in T-shirt form:
My issues have been even more low-level than this. I deal with a lot of interaction using EDI with older computer systems running UNIX or or some IBM mainframe OS. None of these are using Unicode for their medical claims adjudication, and are either using ASCII or EBCDIC. Yes, EBCDIC; I have to program using EBCDIC in 2009.
I have to be very careful when I’m converting to and from Unicode, the native format of the string class in .Net, and other character encodings such as ASCII and EBCDIC, and so should you.