[erlang-questions] Strings - deprecated functions

Fred Hebert mononcqc@REDACTED
Thu Nov 23 03:45:55 CET 2017


On 11/22, lloyd@REDACTED wrote:
>I read further in the Erlang docs and I see "grapheme cluster."  WHAT 
>THE ... IS GRAPHEME CLUSTER?
>

A quick run-through. In ASCII and latin-1 you mostly can deal with the 
following words, which are all synonymous:

- character
- letter
- symbol

In some variants, you also have to add the word 'diacritic' or 'accent' 
which let you modify a character in terms of linguistincs:

a + ` = à

Fortunately, in latin1, most of these naughty diacritics have been 
bundled into specific characters. In French, for example, this final 
'letter' can be represented under a single code (224).

There are however complications coming from that. One of them is 
'collation' (the sorting order of letters). For example, a and à in 
French ought to sort in the same portion of the alphabet (before 'b'), 
but by default, they end up sorting after 'z'.

In Danish, A is the first letter of the alphabet, but Å is last. Also Å 
is seen as a ligature of Aa; Aa is sorted like Å rather than two letters 
'Aa' one after the other. Swedish has different diacritics with 
different orders: Å, Ä, Ö.

So uh, currently, Erlang did not even do a great job at Latin-1 because 
there was nothing to handle 'collations' (string comparisons to know 
what is equal or not).


Enter UNICODE. To make a matter short (and I hope Mr. ROK won't be too 
mad at my gross oversimplifications), we have the following terms in 
vocabulary:

- character: smallest representable unit in a language, in the abstract.  
  '`' is a character, so is 'a', and so is 'à'
- glyph: the visual representation of a character. Think of it as a 
  character from the point of view of the font or typeface designer. For 
  example, the same glyph may be used for the capital letter 'pi' and 
  the mathematical symbol for a product: ∏. Similarly, capital 'Sigma' 
  and the mathematical 'sum' may have different character 
  representation, but the same ∑ glyph.
- letter: an element of an alphabet
- codepoint: A given value in the unicode space. There's a big table 
  with a crapload of characters in them, and every character is assigned 
  a codepoint, as a unique identifier for it.
- code unit: a specific encoding of a given code point. This refers to 
  bits, not just the big table. The same code point may have different 
  code units in UTF-8, UTF-16, and UTF-32, which are 3 'encodings' of 
  unicode.
- grapheme: what the user thinks of a 'character'
- grapheme cluster: what you want to think of as a 'character' for your 
  user's sake. Basically, 'a' and '`' can be two graphemes, but if I 
  combine them together as 'à', I want to be able to say that a single 
  'delete' key press will remove both the '`' and the 'a' at once from 
  my text, and not be left with one or the other.

We're left with the word 'lexeme' which is not really defined in the 
unicode glossary. Linguists will treat it as a lexical unit (word or 
term of vocabulary). In computer talk, you'd just define it as an 
arbitrary string, or maybe token (it appears some people use them 
interchangeably).

The big fun bit is that unicode takes all these really shitty 
complicated linguistic things and specifies how they should be handled.

Like, what makes two strings equal? I understand it's of little 
importance in English, but the french 'é' can be represented both as a 
single é or as e+´. It would be good, when you deal with say JSON or 
maybe my username, that you don't end up having 'Frédéric' as 4 
different people depending on which form was used. JSON, by the way, 
specifies 'unicode' as an encoding!

In any case, these encoding rules are specified in normalization forms 
(http://unicode.org/reports/tr15/). The new interface lets you compare 
string with 'string:compare(A, B, IgnoreCase, nfc | nfk | nfkc | nfkd)' 
which is good, because the rules for changing case are also language- or 
alphabet-specific.

So when you look at functions like 'string:next_grapheme/1' and 
'string:next_codepoint/1', they're related to whether you want to 
consume the data in terms of user-observable 'characters' or in terms of 
unicode-specific 'characters'. Because they're not the same, and 
depending on what you want to do, this is important.

You could call 'string:to_graphemes' and get an iterable list the way 
you could use them before:

1> string:to_graphemes("ß↑e̊").
[223,8593,[101,778]]
2> string:to_graphemes(<<"ß↑e̊"/utf8>>).
[223,8593,[101,778]]

But now it's working regardless of the initial format! This is really 
freaking cool.

>SOLUTION
>

Translation!

centre/2-3 ==> pad/2-4
    Same thing, except pad is more generic and accepts a direction
chars/2    ==> lists:duplicate/2
    Same thing, except the 2 arguments are flipped.
chars/3    ==> ???
    No direct match, but just call [lists:duplicate(N, Elem)|Tail] to 
    get an equivalence
chr/2      ==> find/2-3 (with 3rd argument 'leading')
    whereas chr/2 returns a position, find/2-3 returns the string after 
    the match. This leaves a bit of a gap if you're looking to take 
    everyting *until* a given character (look at take/3-4 if you need a 
    single character, or maybe string:split/2-3), or really the 
    position, but in Unicode the concept of a position is vague: is it 
    based on code units, codepoints, grapheme clusters, or what?
concat/2   ==> ???
    You can concatenate strings by using iolists: [A,B]. If you need to 
    flatten the string with unicode:character_to_[list|binary].
copies/2   ==> lists:duplicate/2
    Same thing, except the two arguments are flipped
cspan/2    ==> take/3-4
    specifically, cspan(Str, Chars) is equivalent to take(Str, Chars, 
    false, leading). Returns a pair of {Before, After} strings rather 
    than a length.
join/2     ==> lists:join/2
    same thing, but the arguments are flipped
left/2-3   ==> pad/2-4
    same thing, except pad is more generic and accepts a direction
len/1      ==> length/1
    returns grapheme cluster counts rather than 'characters'.
rchr/2     ==> find/2-3 (with 3rd argument 'trailing')
    see chr/2 conversion for description.
right/2-3  ==> pad/2-4
    same as center/2-3 and left/2-3.
rstr/2     ==> find/3
    use 'trailing' as third argument for similar semantics. Drops 
    characters before the match and returns the leftover string rather 
    than just an index. A bit of a gap if you want the opposite, maybe 
    use string:split/2-3
span/2     ==> take/2
    no modifications required for arguments, but take/2 returns a 
    {Before, After} pair of strings rather than a length.
str/2      ==> find/2
    use 'leading' as a third argument. Drops characters before the match 
    rather than just an index. Maybe string:split/2-3 is nicer there?
strip/1-3  ==> trim/1-3
    Same, aside from the direction. strip/2-3 accepted 'left | right | 
    both' whereas trim/2-3 accepts 'leading | trailing | both'. Be 
    careful. Oh also strip/3 takes a single character as an argument and 
    trim/3 takes a list of characters.
sub_string/2-3 ==> slice/2-3
    (String, Start, Stop) is changed for (String, Start, Length). This 
    reflects the idea that grapheme clusters make it a lot harder to 
    know where in a string a specific position is. The length is in 
    grapheme clusters.
substr/2-3 ==> slice/2-3
    no change
sub_word/2-3 ==> nth_lexeme/3
    Same, except rather than a single character in the last position, it 
    now takes a list of separators (grapheme clusters). So ".e" is 
    actually the list of [$., $e], two distinct separators.
to_lower/1 ==> lowercase/1
    Same
to_upper/1 ==> uppercase/1
    Same
tokens/2   ==> lexemes/2
    Same
words/2    ==> lexemes/2
    Same, but lexemes/2 accepts a list of 'characters' (grapheme 
    clusters) instead of a single one of them.


The biggest annoyance I have had converting so far was handling 
find/2-3; in a lot of places in code, I had patterns where the objective 
was to drop what happend *after* a given character, and the function 
does just the opposite. You can take a look at string:split/2-3 there.

The second biggest annoyance is making sure that functions that used to 
take just a single character now may take more than one of them. It 
makes compatibility a bit weird.

>Keep the Latin-1 string functions. Put them in a separate library if 
>necessary. Or put the new Unicode functions in a separate library. But 
>don't arbitrarily drop them.
>
>Some folks have suggested that I maintain my own library of the 
>deprecated Latin1 functions. But why should I have to do that? How does 
>that help other folks with the same issue?
>

I had a few problems with it myself; I just finished updating rebar3 and 
dependencies to run on both >OTP-21 releases and stuff dating back to 
OTP-16. The problem we have is that we run on a backwards compat 
schedule that is stricter and longer than the OTP team.

For example, string:join/2 is being replaced with lists:join/2, but 
lists:join/2 did not exist in R16 and string:join/2 is deprecated in 
OTP-21. So we needed to extract that function from OTP into custom 
modules everywhere, and replace old usage with new one.

I was forced to add translation functions and modules like 
https://github.com/erlang/rebar3/blob/master/src/rebar_string.erl to the 
code base, along with this conditional define: https://github.com/erlang/rebar3/blob/master/rebar.config#L33

It's a bit painful, but it ends up working quite alright. Frankly it's 
nicer if you can adopt OTP's deprecation pace, it's quite painful for us 
being on a larger sequence.

>Bottom line: please please please do not drop the existing Latin-1 
>string functions.
>
>Please don't.
>
>Best wishes,
>
>LRP
>

It's probably alright if they keep warnings for a good period of time 
before dropping everything. OTP-21 starts warning so people who want to 
keep 'warning_as_errors' as an option will suffer the most.

But overall, you can't escape Unicode. Work with JSON? There it is.  
HTTP? There again. URLs? You bet. File systems? Hell yes! Erlang modules 
and app files: also yes!

The one place I've seen a deliberate attempt at not doing it was with 
host names in certificate validation (or DNS), since there, making 
distinct domain names compare the same could be an attack vector. There 
you get to deal with the magic of punycode 
(https://en.wikipedia.org/wiki/Punycode) if you want to be safer.

- Fred.



More information about the erlang-questions mailing list