[erlang-questions] Strings - deprecated functions
Fred Hebert
mononcqc@REDACTED
Thu Nov 23 03:45:55 CET 2017
On 11/22, lloyd@REDACTED wrote:
>I read further in the Erlang docs and I see "grapheme cluster." WHAT
>THE ... IS GRAPHEME CLUSTER?
>
A quick run-through. In ASCII and latin-1 you mostly can deal with the
following words, which are all synonymous:
- character
- letter
- symbol
In some variants, you also have to add the word 'diacritic' or 'accent'
which let you modify a character in terms of linguistincs:
a + ` = à
Fortunately, in latin1, most of these naughty diacritics have been
bundled into specific characters. In French, for example, this final
'letter' can be represented under a single code (224).
There are however complications coming from that. One of them is
'collation' (the sorting order of letters). For example, a and à in
French ought to sort in the same portion of the alphabet (before 'b'),
but by default, they end up sorting after 'z'.
In Danish, A is the first letter of the alphabet, but Å is last. Also Å
is seen as a ligature of Aa; Aa is sorted like Å rather than two letters
'Aa' one after the other. Swedish has different diacritics with
different orders: Å, Ä, Ö.
So uh, currently, Erlang did not even do a great job at Latin-1 because
there was nothing to handle 'collations' (string comparisons to know
what is equal or not).
Enter UNICODE. To make a matter short (and I hope Mr. ROK won't be too
mad at my gross oversimplifications), we have the following terms in
vocabulary:
- character: smallest representable unit in a language, in the abstract.
'`' is a character, so is 'a', and so is 'à'
- glyph: the visual representation of a character. Think of it as a
character from the point of view of the font or typeface designer. For
example, the same glyph may be used for the capital letter 'pi' and
the mathematical symbol for a product: ∏. Similarly, capital 'Sigma'
and the mathematical 'sum' may have different character
representation, but the same ∑ glyph.
- letter: an element of an alphabet
- codepoint: A given value in the unicode space. There's a big table
with a crapload of characters in them, and every character is assigned
a codepoint, as a unique identifier for it.
- code unit: a specific encoding of a given code point. This refers to
bits, not just the big table. The same code point may have different
code units in UTF-8, UTF-16, and UTF-32, which are 3 'encodings' of
unicode.
- grapheme: what the user thinks of a 'character'
- grapheme cluster: what you want to think of as a 'character' for your
user's sake. Basically, 'a' and '`' can be two graphemes, but if I
combine them together as 'à', I want to be able to say that a single
'delete' key press will remove both the '`' and the 'a' at once from
my text, and not be left with one or the other.
We're left with the word 'lexeme' which is not really defined in the
unicode glossary. Linguists will treat it as a lexical unit (word or
term of vocabulary). In computer talk, you'd just define it as an
arbitrary string, or maybe token (it appears some people use them
interchangeably).
The big fun bit is that unicode takes all these really shitty
complicated linguistic things and specifies how they should be handled.
Like, what makes two strings equal? I understand it's of little
importance in English, but the french 'é' can be represented both as a
single é or as e+´. It would be good, when you deal with say JSON or
maybe my username, that you don't end up having 'Frédéric' as 4
different people depending on which form was used. JSON, by the way,
specifies 'unicode' as an encoding!
In any case, these encoding rules are specified in normalization forms
(http://unicode.org/reports/tr15/). The new interface lets you compare
string with 'string:compare(A, B, IgnoreCase, nfc | nfk | nfkc | nfkd)'
which is good, because the rules for changing case are also language- or
alphabet-specific.
So when you look at functions like 'string:next_grapheme/1' and
'string:next_codepoint/1', they're related to whether you want to
consume the data in terms of user-observable 'characters' or in terms of
unicode-specific 'characters'. Because they're not the same, and
depending on what you want to do, this is important.
You could call 'string:to_graphemes' and get an iterable list the way
you could use them before:
1> string:to_graphemes("ß↑e̊").
[223,8593,[101,778]]
2> string:to_graphemes(<<"ß↑e̊"/utf8>>).
[223,8593,[101,778]]
But now it's working regardless of the initial format! This is really
freaking cool.
>SOLUTION
>
Translation!
centre/2-3 ==> pad/2-4
Same thing, except pad is more generic and accepts a direction
chars/2 ==> lists:duplicate/2
Same thing, except the 2 arguments are flipped.
chars/3 ==> ???
No direct match, but just call [lists:duplicate(N, Elem)|Tail] to
get an equivalence
chr/2 ==> find/2-3 (with 3rd argument 'leading')
whereas chr/2 returns a position, find/2-3 returns the string after
the match. This leaves a bit of a gap if you're looking to take
everyting *until* a given character (look at take/3-4 if you need a
single character, or maybe string:split/2-3), or really the
position, but in Unicode the concept of a position is vague: is it
based on code units, codepoints, grapheme clusters, or what?
concat/2 ==> ???
You can concatenate strings by using iolists: [A,B]. If you need to
flatten the string with unicode:character_to_[list|binary].
copies/2 ==> lists:duplicate/2
Same thing, except the two arguments are flipped
cspan/2 ==> take/3-4
specifically, cspan(Str, Chars) is equivalent to take(Str, Chars,
false, leading). Returns a pair of {Before, After} strings rather
than a length.
join/2 ==> lists:join/2
same thing, but the arguments are flipped
left/2-3 ==> pad/2-4
same thing, except pad is more generic and accepts a direction
len/1 ==> length/1
returns grapheme cluster counts rather than 'characters'.
rchr/2 ==> find/2-3 (with 3rd argument 'trailing')
see chr/2 conversion for description.
right/2-3 ==> pad/2-4
same as center/2-3 and left/2-3.
rstr/2 ==> find/3
use 'trailing' as third argument for similar semantics. Drops
characters before the match and returns the leftover string rather
than just an index. A bit of a gap if you want the opposite, maybe
use string:split/2-3
span/2 ==> take/2
no modifications required for arguments, but take/2 returns a
{Before, After} pair of strings rather than a length.
str/2 ==> find/2
use 'leading' as a third argument. Drops characters before the match
rather than just an index. Maybe string:split/2-3 is nicer there?
strip/1-3 ==> trim/1-3
Same, aside from the direction. strip/2-3 accepted 'left | right |
both' whereas trim/2-3 accepts 'leading | trailing | both'. Be
careful. Oh also strip/3 takes a single character as an argument and
trim/3 takes a list of characters.
sub_string/2-3 ==> slice/2-3
(String, Start, Stop) is changed for (String, Start, Length). This
reflects the idea that grapheme clusters make it a lot harder to
know where in a string a specific position is. The length is in
grapheme clusters.
substr/2-3 ==> slice/2-3
no change
sub_word/2-3 ==> nth_lexeme/3
Same, except rather than a single character in the last position, it
now takes a list of separators (grapheme clusters). So ".e" is
actually the list of [$., $e], two distinct separators.
to_lower/1 ==> lowercase/1
Same
to_upper/1 ==> uppercase/1
Same
tokens/2 ==> lexemes/2
Same
words/2 ==> lexemes/2
Same, but lexemes/2 accepts a list of 'characters' (grapheme
clusters) instead of a single one of them.
The biggest annoyance I have had converting so far was handling
find/2-3; in a lot of places in code, I had patterns where the objective
was to drop what happend *after* a given character, and the function
does just the opposite. You can take a look at string:split/2-3 there.
The second biggest annoyance is making sure that functions that used to
take just a single character now may take more than one of them. It
makes compatibility a bit weird.
>Keep the Latin-1 string functions. Put them in a separate library if
>necessary. Or put the new Unicode functions in a separate library. But
>don't arbitrarily drop them.
>
>Some folks have suggested that I maintain my own library of the
>deprecated Latin1 functions. But why should I have to do that? How does
>that help other folks with the same issue?
>
I had a few problems with it myself; I just finished updating rebar3 and
dependencies to run on both >OTP-21 releases and stuff dating back to
OTP-16. The problem we have is that we run on a backwards compat
schedule that is stricter and longer than the OTP team.
For example, string:join/2 is being replaced with lists:join/2, but
lists:join/2 did not exist in R16 and string:join/2 is deprecated in
OTP-21. So we needed to extract that function from OTP into custom
modules everywhere, and replace old usage with new one.
I was forced to add translation functions and modules like
https://github.com/erlang/rebar3/blob/master/src/rebar_string.erl to the
code base, along with this conditional define: https://github.com/erlang/rebar3/blob/master/rebar.config#L33
It's a bit painful, but it ends up working quite alright. Frankly it's
nicer if you can adopt OTP's deprecation pace, it's quite painful for us
being on a larger sequence.
>Bottom line: please please please do not drop the existing Latin-1
>string functions.
>
>Please don't.
>
>Best wishes,
>
>LRP
>
It's probably alright if they keep warnings for a good period of time
before dropping everything. OTP-21 starts warning so people who want to
keep 'warning_as_errors' as an option will suffer the most.
But overall, you can't escape Unicode. Work with JSON? There it is.
HTTP? There again. URLs? You bet. File systems? Hell yes! Erlang modules
and app files: also yes!
The one place I've seen a deliberate attempt at not doing it was with
host names in certificate validation (or DNS), since there, making
distinct domain names compare the same could be an attack vector. There
you get to deal with the magic of punycode
(https://en.wikipedia.org/wiki/Punycode) if you want to be safer.
- Fred.
More information about the erlang-questions
mailing list