[erlang-questions] Strings - deprecated functions
Lloyd R. Prentice
lloyd@REDACTED
Thu Nov 23 04:54:19 CET 2017
A big hearty thanks to ok and Fred for the terrific clarifications.
I guess I'll just have to suck it up and convert all Latin-1 functions that I've written so far to Unicode functions. If I wait until later I may not be among the living. Hate to foist that off on some unsuspecting soul.
Meanwhile, I've just pushed my release target out another month (or two).
And thanks to all.
Lloyd
Sent from my iPad
> On Nov 22, 2017, at 9:45 PM, Fred Hebert <mononcqc@REDACTED> wrote:
>
>> On 11/22, lloyd@REDACTED wrote:
>> I read further in the Erlang docs and I see "grapheme cluster." WHAT THE ... IS GRAPHEME CLUSTER?
>>
>
> A quick run-through. In ASCII and latin-1 you mostly can deal with the following words, which are all synonymous:
>
> - character
> - letter
> - symbol
>
> In some variants, you also have to add the word 'diacritic' or 'accent' which let you modify a character in terms of linguistincs:
>
> a + ` = à
>
> Fortunately, in latin1, most of these naughty diacritics have been bundled into specific characters. In French, for example, this final 'letter' can be represented under a single code (224).
>
> There are however complications coming from that. One of them is 'collation' (the sorting order of letters). For example, a and à in French ought to sort in the same portion of the alphabet (before 'b'), but by default, they end up sorting after 'z'.
>
> In Danish, A is the first letter of the alphabet, but Å is last. Also Å is seen as a ligature of Aa; Aa is sorted like Å rather than two letters 'Aa' one after the other. Swedish has different diacritics with different orders: Å, Ä, Ö.
>
> So uh, currently, Erlang did not even do a great job at Latin-1 because there was nothing to handle 'collations' (string comparisons to know what is equal or not).
>
>
> Enter UNICODE. To make a matter short (and I hope Mr. ROK won't be too mad at my gross oversimplifications), we have the following terms in vocabulary:
>
> - character: smallest representable unit in a language, in the abstract. '`' is a character, so is 'a', and so is 'à'
> - glyph: the visual representation of a character. Think of it as a character from the point of view of the font or typeface designer. For example, the same glyph may be used for the capital letter 'pi' and the mathematical symbol for a product: ∏. Similarly, capital 'Sigma' and the mathematical 'sum' may have different character representation, but the same ∑ glyph.
> - letter: an element of an alphabet
> - codepoint: A given value in the unicode space. There's a big table with a crapload of characters in them, and every character is assigned a codepoint, as a unique identifier for it.
> - code unit: a specific encoding of a given code point. This refers to bits, not just the big table. The same code point may have different code units in UTF-8, UTF-16, and UTF-32, which are 3 'encodings' of unicode.
> - grapheme: what the user thinks of a 'character'
> - grapheme cluster: what you want to think of as a 'character' for your user's sake. Basically, 'a' and '`' can be two graphemes, but if I combine them together as 'à', I want to be able to say that a single 'delete' key press will remove both the '`' and the 'a' at once from my text, and not be left with one or the other.
>
> We're left with the word 'lexeme' which is not really defined in the unicode glossary. Linguists will treat it as a lexical unit (word or term of vocabulary). In computer talk, you'd just define it as an arbitrary string, or maybe token (it appears some people use them interchangeably).
>
> The big fun bit is that unicode takes all these really shitty complicated linguistic things and specifies how they should be handled.
>
> Like, what makes two strings equal? I understand it's of little importance in English, but the french 'é' can be represented both as a single é or as e+´. It would be good, when you deal with say JSON or maybe my username, that you don't end up having 'Frédéric' as 4 different people depending on which form was used. JSON, by the way, specifies 'unicode' as an encoding!
>
> In any case, these encoding rules are specified in normalization forms (http://unicode.org/reports/tr15/). The new interface lets you compare string with 'string:compare(A, B, IgnoreCase, nfc | nfk | nfkc | nfkd)' which is good, because the rules for changing case are also language- or alphabet-specific.
>
> So when you look at functions like 'string:next_grapheme/1' and 'string:next_codepoint/1', they're related to whether you want to consume the data in terms of user-observable 'characters' or in terms of unicode-specific 'characters'. Because they're not the same, and depending on what you want to do, this is important.
>
> You could call 'string:to_graphemes' and get an iterable list the way you could use them before:
>
> 1> string:to_graphemes("ß↑e̊").
> [223,8593,[101,778]]
> 2> string:to_graphemes(<<"ß↑e̊"/utf8>>).
> [223,8593,[101,778]]
>
> But now it's working regardless of the initial format! This is really freaking cool.
>
>> SOLUTION
>>
>
> Translation!
>
> centre/2-3 ==> pad/2-4
> Same thing, except pad is more generic and accepts a direction
> chars/2 ==> lists:duplicate/2
> Same thing, except the 2 arguments are flipped.
> chars/3 ==> ???
> No direct match, but just call [lists:duplicate(N, Elem)|Tail] to get an equivalence
> chr/2 ==> find/2-3 (with 3rd argument 'leading')
> whereas chr/2 returns a position, find/2-3 returns the string after the match. This leaves a bit of a gap if you're looking to take everyting *until* a given character (look at take/3-4 if you need a single character, or maybe string:split/2-3), or really the position, but in Unicode the concept of a position is vague: is it based on code units, codepoints, grapheme clusters, or what?
> concat/2 ==> ???
> You can concatenate strings by using iolists: [A,B]. If you need to flatten the string with unicode:character_to_[list|binary].
> copies/2 ==> lists:duplicate/2
> Same thing, except the two arguments are flipped
> cspan/2 ==> take/3-4
> specifically, cspan(Str, Chars) is equivalent to take(Str, Chars, false, leading). Returns a pair of {Before, After} strings rather than a length.
> join/2 ==> lists:join/2
> same thing, but the arguments are flipped
> left/2-3 ==> pad/2-4
> same thing, except pad is more generic and accepts a direction
> len/1 ==> length/1
> returns grapheme cluster counts rather than 'characters'.
> rchr/2 ==> find/2-3 (with 3rd argument 'trailing')
> see chr/2 conversion for description.
> right/2-3 ==> pad/2-4
> same as center/2-3 and left/2-3.
> rstr/2 ==> find/3
> use 'trailing' as third argument for similar semantics. Drops characters before the match and returns the leftover string rather than just an index. A bit of a gap if you want the opposite, maybe use string:split/2-3
> span/2 ==> take/2
> no modifications required for arguments, but take/2 returns a {Before, After} pair of strings rather than a length.
> str/2 ==> find/2
> use 'leading' as a third argument. Drops characters before the match rather than just an index. Maybe string:split/2-3 is nicer there?
> strip/1-3 ==> trim/1-3
> Same, aside from the direction. strip/2-3 accepted 'left | right | both' whereas trim/2-3 accepts 'leading | trailing | both'. Be careful. Oh also strip/3 takes a single character as an argument and trim/3 takes a list of characters.
> sub_string/2-3 ==> slice/2-3
> (String, Start, Stop) is changed for (String, Start, Length). This reflects the idea that grapheme clusters make it a lot harder to know where in a string a specific position is. The length is in grapheme clusters.
> substr/2-3 ==> slice/2-3
> no change
> sub_word/2-3 ==> nth_lexeme/3
> Same, except rather than a single character in the last position, it now takes a list of separators (grapheme clusters). So ".e" is actually the list of [$., $e], two distinct separators.
> to_lower/1 ==> lowercase/1
> Same
> to_upper/1 ==> uppercase/1
> Same
> tokens/2 ==> lexemes/2
> Same
> words/2 ==> lexemes/2
> Same, but lexemes/2 accepts a list of 'characters' (grapheme clusters) instead of a single one of them.
>
>
> The biggest annoyance I have had converting so far was handling find/2-3; in a lot of places in code, I had patterns where the objective was to drop what happend *after* a given character, and the function does just the opposite. You can take a look at string:split/2-3 there.
>
> The second biggest annoyance is making sure that functions that used to take just a single character now may take more than one of them. It makes compatibility a bit weird.
>
>> Keep the Latin-1 string functions. Put them in a separate library if necessary. Or put the new Unicode functions in a separate library. But don't arbitrarily drop them.
>>
>> Some folks have suggested that I maintain my own library of the deprecated Latin1 functions. But why should I have to do that? How does that help other folks with the same issue?
>>
>
> I had a few problems with it myself; I just finished updating rebar3 and dependencies to run on both >OTP-21 releases and stuff dating back to OTP-16. The problem we have is that we run on a backwards compat schedule that is stricter and longer than the OTP team.
>
> For example, string:join/2 is being replaced with lists:join/2, but lists:join/2 did not exist in R16 and string:join/2 is deprecated in OTP-21. So we needed to extract that function from OTP into custom modules everywhere, and replace old usage with new one.
>
> I was forced to add translation functions and modules like https://github.com/erlang/rebar3/blob/master/src/rebar_string.erl to the code base, along with this conditional define: https://github.com/erlang/rebar3/blob/master/rebar.config#L33
>
> It's a bit painful, but it ends up working quite alright. Frankly it's nicer if you can adopt OTP's deprecation pace, it's quite painful for us being on a larger sequence.
>
>> Bottom line: please please please do not drop the existing Latin-1 string functions.
>>
>> Please don't.
>>
>> Best wishes,
>>
>> LRP
>>
>
> It's probably alright if they keep warnings for a good period of time before dropping everything. OTP-21 starts warning so people who want to keep 'warning_as_errors' as an option will suffer the most.
>
> But overall, you can't escape Unicode. Work with JSON? There it is. HTTP? There again. URLs? You bet. File systems? Hell yes! Erlang modules and app files: also yes!
>
> The one place I've seen a deliberate attempt at not doing it was with host names in certificate validation (or DNS), since there, making distinct domain names compare the same could be an attack vector. There you get to deal with the magic of punycode (https://en.wikipedia.org/wiki/Punycode) if you want to be safer.
>
> - Fred.
More information about the erlang-questions
mailing list