[erlang-questions] string:lexeme/s2 - an old man's rant

Wed May 8 06:27:41 CEST 2019

Hi Richard,

My head spins.

I’m ashamed to say that I’m functionally illiterate in every natural language of the world except English— and I’m still working at mastering that. And I program for English speakers.

So, I guess I’ll stick with re:split/3 until I have a pressing need for Unicode. Maybe by then the issues will be ironed out— and better, well tucked under the hood.

Richard, you’re a star in the Erlang firmament. 

Thank you,

Lloyd

Sent from my iPad

> On May 7, 2019, at 8:39 PM, Richard O'Keefe <raoknz@REDACTED> wrote:
> 
> Let's look at the documentation for tokens/2:
> 
> http://erlang.org/doc/man/string.html#tokens-2
> 
> The first thing I notice is that we are told *that*
> the function is obsolete but not *why* it is, and
> that's important.
> 
> The second thing I notice is that we are told
> to use lexemes/2 instead, but we are not told *how*
> to do that.  An example showing an old call and its
> new equivalent would do wonders.
> 
> The third thing I notice is the reason that the
> second thing matters.  Consider the following
> examples:
>   tokens("aaa", "x") => ["aaa"]
>   tokens("aa", "x")  => ["aa"]
>   tokens("a", "x")   => ["a"]
> so by continuity we expect
>   tokens("", "x")    => [""]
> BUT the result is actually [].  True, the
> description says that the result is a list
> of non-empty strings, but I don't really see
> why that is so important that our natural
> expectation that tokens(S, [X]) => [X]
> whenever S is *any* string not containing X
> should be violated, and if it is, then I
> would definitely expect an exception.
> 
> The fourth thing I notice is that the treatment
> of multi-element separator lists is odd.  I have
> had occasion to use separators with more than
> one code-point, and for Unicode that could be
> essential.  I have also had occasion to use
> split at C1, then at C2, then at C3, then at C4, ...
> I've also had occasion to split on one separator
> and then split the pieces into smaller pieces,
> so multiple levels of splitting.  (Think of
> /etc/passwd for a simple example.)  But the only
> time I ever want multiple *alternative* separators
> is when asking for white-space separation, and
> *that* is when I want non-empty pieces.  It is
> also the only time I ever want separators coalesced.
> Given a string like "||x|yy||w" and the separator
> "|", I've always wanted ["","","x","yy","","w"]
> as the answer.  But there's a particular point
> here:  which of us knows off-hand just what all
> the Zs, Zl, and Zp characters of Unicode actually
> are?  It would make a *lot* of sense to have
>    tokens(String) -> list of non-empty pieces
>    tokens(String, Sep) -> list of possibly empty
>      pieces separated by the non-empty substring Sep.
> 
> The fifth thing I notice is that there is no
> specification of what happens if SeparatorList is
> empty.
> 
> All things considered, this is a function I am never
> going to use, because it is less work to write my own
> than to try to figure out this documentation.  And I
> had to look at the code to figure some of it out.
> 
> I get seriously confused by some of the code in
> string.erl.  We find
> %% Fetch first grapheme cluster ..
> next_grapheme(CD) -> ..
> Which is it?  Grapheme or grapheme cluster?  These
> are *different* (but overlapping) things!  And
> where is the locale argument so that the function
> knows what a "user-perceived character" actually *is*?
> How come an empty list counts as a grapheme_cluster()?
> 
> What if I have something like
> "foo:bar::uggle::zoosh" and I want to split it at
> "::" but NOT at ":"?  "::" is not a grapheme cluster,'
> so it looks like neither of these functions will help
> me.
> 
> Writing good documentation is HARD.  At dear departed
> Quintus, we started with a full time technical writer
> and expanded to three, nearly as many as developers.
> 
> The *name* 'lexemes' is arguably the *least* confusing
> thing in the documentation.  If it were called z3k_u4y/2
> that would increase my confusion very little.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20190508/a7df2279/attachment.htm>