[erlang-questions] Strings - deprecated functions

Jesper Louis Andersen jesper.louis.andersen@REDACTED
Wed Nov 22 21:26:24 CET 2017


In this case, the words comes partly from terms you would find in
linguistics, partly words which have specific meaning in the unicode
standard.

The problem with Latin-1 and ISO8859-1 and ISO8859-15 are that they work
somewhat well for Western Latin languages, but it falls short on almost
everything else. If your only concern is truly English text, then there
should be no worry at all, since that uses ASCII and the predominant
Unicode enconding, UTF-8 was chosen such that there is a 1-1 overlap
between the first 128 characters and ASCII.

However, Unicode imposes some difficulties. The most notable one is that
you have several ways of writing symbols such as the danish Ø and Å: Either
as a specific character, or as a combination: and A and a small ring on top
for instance.

In languages the written symbols are graphemes, and collections of symbols
forming tokens or words are lexemes. However, because one grapheme can be
represented as one or several characters, the notion of a grapheme cluster
arises: several code-points which form a single grapheme. It is of utmost
importance for certain Asian writing systems in which a single grapheme is
composed out of several smaller ones.

For ASCII, however, string:lexemes/2 would work exactly like
string:tokens/2. Yet it will handle far more cases.

Unicode presents its own set of complexities. There are several ways of
writing a unicode string which is "the same" string in that it renders
equally to the human eye. Hence, there are some routines for handling
normalization, canonization and collation which by no means are easy to
handle.

And finally, it would probably be good to define those terms in the
documentation. I don't think they are well-known to most people.

On Wed, Nov 22, 2017 at 8:59 PM Grzegorz Junka <list1@REDACTED> wrote:

> Dear Lloyd,
>
> Isn't this more about documentation than the code? What I am reading is
> that you want to keep the old functions because you don't understand how
> the new functions work. Shouldn't you rather ask for a more clear
> documentation? Is there anything in the old functions that is not supported
> in the new functions?
>
> GrzegorzJ
>
> On 22/11/2017 19:43, lloyd@REDACTED wrote:
>
> Dear Gods of Erlang,
>
>
>
> "This module has been reworked in Erlang/OTP 20 to handle
> unicode:chardata() <http://erlang.org/doc/man/unicode.html#type-chardata>
> and operate on grapheme clusters. The old functions
> <http://erlang.org/doc/man/string.html#oldapi> that only work on Latin-1
> lists as input are still available but should not be used. They will be
> deprecated in Erlang/OTP 21."
>
>
>
> I'm sorry. I've brought up this issue before and got lots of push back.
>
>
>
> But every time I look up tried and true and long-used string functions to
> find that they are deprecated and will be dropped in future Erlang releases
> my blood pressure soars. Both my wife and my doctor tell me that at my age
> this is a dangerous thing.
>
>
>
> I do understand the importance and necessity of Unicode. And applaud the
> addition of Unicode functions.
>
>
>
> But the deprecated string functions have a long history. The English
> language and Latin-1 characters are widely used around the world.
>
>
>
> Yes, it should be easy for programmers to translate code from one user
> language to another. But I'm not convinced that the Gods of Erlang have
> found the optimal solution by dropping all Latin-1 string functions.
>
>
>
> My particular application is directed toward English speakers. So, until
> further notice, I have no use for Unicode.
>
>
>
> I don't want to sound like nationalist pig, but I think dropping the
> Latin-1 string functions from future Erlang releases is a BIG mistake.
>
>
>
> I look up tokens/2, a function that I use fairly frequently, and I see
> that it's deprecated. I look up the suggested replacement and I see
> lexemes/2.
>
>
>
> So I ask, what the ... is a lexeme? I look it up in Merriam-Webster and I
> see that a lexeme is  "a meaningful linguistic unit."
>
>
>
> Meaning what? I just want to turn "this and that" into "This And That."
>
>
>
> I read further in the Erlang docs and I see "grapheme cluster."  WHAT THE
> ... IS GRAPHEME CLUSTER?
>
>
>
> I look up "grapheme" in Merriam-Webster. Oh it is now all so clear: "a
> unit of a writing system."
>
>
>
> Ah yes, grapheme is defined in the docs. But I have to read and re-read
> the definition to understand what the God's of Erlang mean by a "graphene
> cluster." And I'm still not sure I get it.
>
>
>
> It sounds like someone took a linguistics class and is trying to show off.
>
>
>
> But now I've spent 30 minutes--- time that I don't have to waste trying to
> figure out how do a simple manipulation of "this and that." Recurse the
> next time I want to look up a string function in the Erlang docs.
>
>
>
> SOLUTION
>
>
>
> Keep the Latin-1 string functions. Put them in a separate library if
> necessary. Or put the new Unicode functions in a separate library. But
> don't arbitrarily drop them.
>
>
>
> Some folks have suggested that I maintain my own library of the deprecated
> Latin1 functions. But why should I have to do that? How does that help
> other folks with the same issue?
>
>
>
> Bottom line: please please please do not drop the existing Latin-1 string
> functions.
>
>
>
> Please don't.
>
>
>
> Best wishes,
>
>
>
> LRP
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> erlang-questions mailing listerlang-questions@REDACTED://erlang.org/mailman/listinfo/erlang-questions
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20171122/891828ae/attachment.htm>


More information about the erlang-questions mailing list