2012-06-03

文字化け and three little piggies.

The Swedish alphabet has three national characters Å,Ä and Ö.

Å  = the vowel in y ou r, Ä  = the vowel in b ea r  (the animal), Ö  = the vowel  in S i r.

These three characters have haunted me throughout my IT career. I cannot recall how many times I created the 256 bit string with all hex combinations from ‘00’ to ‘ff’, and sent the string across lines between computers to verify the string is the same when it arrives at the receiving computer. In the IBM mainframe EBCDIC coding world you may think it will always work, but no  it does not. There is often nodes in between that do intermediate ‘translations’, and applications that have their own character translations.  When character translations did go wrong it was mostly due to ‘my’ three national characters Å,Ä,Ö or the three little piggies as I call them. But still it was relatively simple only 256 characters to keep track of and you could easily implement your own translate-tables in between. 1985 with the help of converted telex machines we transferred  data (program-to-program communication) between a IBM mainframe in Stockholm and one in Antwerp and of course the three little piggies showed their ugly faces.

Today we have better ways to communicate data than converted telex machines. ( In Unix they probably already had better communication in the 19th century. According to Unix guys, Unix is always better. )  Even if today’s communication facilities are better they are not less complicated, today we communicate not only with western Latin alphabets but with Cyrillic, Kanji, Katakana, Hanji etc. and we have Unicode to make this simple. But converting old encoding into Unicode is not simple and Unicode is not uncode but there are UTF8, UTF16 etc. there are a variety  of Unicode encodings. And how should you encode and decode character strings? E.g. I think we all have seen and are seeing double Unicode encodings on the web and in mails. If we in western Europe have problems with character encoding this is nothing with the problems others have. The Japanese have three alphabets, Kanji, Katakana and Romanji, and at least Kanji have two forms new and old. It must be hell to convert between these character representations. Consequently the Japanese  have a special word for  character conversion problems mojibake,  which I think means false character or something similar.

Today our Business intelligence application the Data Warehouse only support ASCII codes, but we will convert our data to UTF-8, to support other alphabets , since we are spreading the Data Warehouse to Japan and China. Russians complains they cannot use the Data Warehouse since Cyrillic shows as ‘X’. I became aware of this embarrassing fact just some month ago. In theory it looks simple to convert from ASCII to Unicode, but I fear it is very complex. We have to change the hole chain of tools  supporting all from extracting data from the source systems to our data storage which is MySQL. To me it does not look like MySQL excels in character encoding conversions, there are myriads of parameters involved in character conversions and the UTF-8 encoding which we will use is not the standard UTF-8. The next MySQL release 5.6 will introduce standard UTF-8 encoding. I fear I will end up with many unexpected ‘challenges’ or ‘possibilities’. If you dear reader have advice to give me please tell.

No comments:

Post a Comment