Web Rarely

It is impossible to experience one's death objectively and still carry a tune. -- Woody Allen

Unicode sucks, and Unicode in .NET is even worse

2015-07-08

Unicode sucks, and its implementation in .NET is even worse. Back before Unicode existed, and in its early days, I had a lot of hope for it, but in a lot of ways modern Unicode is even worse than the multi-byte character sets that came before it.

The main problems with Unicode itself are that some characters require multiple code points and it allows one character to be represented in multiple ways. The character Ǭ can be represented as the code point sequences "\u01EC", "\u01EA\u0304", "\u014C\u0328", "\u004F\u0328\u0304", "\u004F\u0304\u0328", "\u004F\u0328\u034F\u0304", "\u004F\u0304\u034F\u0328", and maybe others. As a programmer, this is a huge pain and a frequent source of bugs because these strings, despite representing the exact same text, are not considered equal, don't map to the same entries in hash tables, etc. and string processing becomes very inconvenient when there's no one-to-one mapping between code points and characters. Another annoyance is that combining marks come after the code points they modify. This means that whenever you see a letter, you can't operate on it immediately. You need to scan ahead an arbitrary distance to search for combining marks. It would have been more convenient if the modifiers came before the code points being modified. Then you could process a string in a strict left-to-right fashion. These are all problems with Unicode, but .NET's implementation of Unicode makes it even worse.

Modern .NET represents strings in UTF-16, which means that a single Unicode code point may require multiple UTF-16 code units (misleadingly called chars in .NET). A .NET char is not a character, it is a UTF-16 code unit. A single code point may require multiple code units, and a single character may require multiple code points. If you ask a .NET programmer what string.Length represents, 99% of them will say it's the number of characters in the string. It's not! It's not even the number of Unicode code points! We're two levels of encoding away from actual characters, and each level of encoding is variable-width.

The result is that most .NET string-handling code is incorrect. How many people compare strings without normalizing the various code point sequences, or specifying a culture-specific comparison? Most of them. Both str1 == str2 and str1.Equals(str2) will return false for textually equal strings that use different code points. More seriously, by default string-based hash tables will not consider keys equal if they are textually equal but binary-unequal. Who normalizes strings before using them in a hash table? Approximately nobody. Who specifies a culture-specific comparer for their hash tables? Only a few. Even the C# compiler team, when they made string switches compile into code that uses a hash table, didn't do it. So a C# string switch will not match strings that are textually equal but not binary-equal. You may have to manually normalize your strings before giving them to a C# string switch. And who even knows about that? Essentially nobody. (No surprise. The official documentation says nothing about it.) And a lot of LINQ functions internally use hash tables, and they also get it wrong. In fact, generic code will frequently get it wrong because the default equality comparer for strings gets it wrong. How many people write 'for' loops over a string, thinking they're enumerating characters? Almost everyone. I've only seen a couple people write loops that were aware that there may be multiple code units per code point, but even they failed to consider that there can be multiple code points per character! Nobody gets it right. Not even me. Sure, the code seems to work, but only because it's not being tested with code points outside plane 0 or characters that use multiple code points. It's too hard and it's too painful to get it right, and .NET is largely to blame.

First of all, .NET provides almost no tools to help with these problems. Yes, string.Normalize exists, but it only solves a small part of the problem. There's still no way to iterate through the characters in a string. There's still no way to get the length of a string in characters. There's still no way to know how many code units you have to advance to move from the current character to the next one. Yes, you can write these functions yourself and then use them, but that's not an easy task and nobody does it. And even if you did, you might find it difficult to interoperate with other code, including the rest of the .NET framework, which isn't as careful as you in dealing with strings.

At least back when we had to deal with multi-byte character sets, the library provided tools to count the actual number of characters in a string and to iterate through them. And there was only one way to represent any given character. And the library documentation provided examples of how to properly deal with these strings, so that programmers knew the right way to do it. By contrast, .NET provides no such tools, almost all .NET code samples, even many official ones, get string handling wrong, and .NET has two levels of encoding rather than one. And that's how Unicode in .NET is even worse than the old multi-byte character sets that came before. It's good that multilingual programs don't have to deal with multiple character sets anymore, but things could be so much better than they are now.

I'm not saying other programming languages do better. For the most part, they don't. A few use UTF-32, which eliminates one of the encodings, but I'm not aware of any that provides much assistance in dealing with the other encoding. It's bad everywhere, and that is largely Unicode's fault. I hate that Unicode went down the path of having a variable number of code points per character, with each character encoding to potentially many different sequences, and I hate that we'll be stuck with this crap for the rest of my life. .NET just makes it worse by adding a second variable-length encoding on top of that while providing almost no tools or guidance to help programmers properly deal with strings. What a mess.

Comments

No comments yet.

Add a comment

Note: The information you enter (including your name and email address) will be displayed publicly.




Line breaks are converted to <br>'s, and all HTML tags except b, u, i, tt, and pre are filtered out.