[erlang-questions] string:lexeme/s2 - an old man's rant

Richard O'Keefe raoknz@REDACTED
Wed May 8 00:53:25 CEST 2019

For what it's worth, in Unicode, Line Separator and Paragraph
Separator are the recommended characters, with CR, LF, CR+LF,
and of arguably NEL (U+0085) being "legacy".

Again for what it's worth, Unicode defines an algorithm for
breaking text into word( token)s.

On Wed, 8 May 2019 at 08:17, <empro2@REDACTED> wrote:

> On Tue, 7 May 2019 12:04:46 -0400
> "Lloyd R. Prentice" <lloyd@REDACTED> wrote:
> > language conventions. But I would much prefer keeping
> > string:tokens/2 in the Erlang string library and renaming
> > string:lexemes/2 to something like
> > string:unicode_tokens/2. If nothing else, this would take
> > a considerable burden off the documentation.
> Both names force meaning onto mere substrings plucked from
> some string argument chopped to pieces at some separator
> characters (with `tokens`) or substrings (`lexemes`).
> The author cannot know what the resulting substrings mean to
> the user, may be tokens, may be lexemes, may simply be
> substrings for whatever use substrings might be useful
> for; the "key=value" strings from a query-string chopped at
> "&" are neither tokens nor lexemes.
> I would provide an option to return empty substrings (for
> counting) or not, instead of imperative `split` and
> foisting `token` and `lexeme`.
> This is a good example of things I have been collecting
> about the documentation (so far I have been brainstorming
> in my chamber without you):
>         "two or more adjacent separator graphemes clusters
>         in String are treated as one."
> No-one cares: users looking up the spec want to know
> whether they get empty substrings or not -- and how:
> with such an option to one `substring` function they
> get to know unmediately, as things are one needs to guess
> from `split` to `token` to `lexeme` or fro. Qizzy! or they
> employ a text search for "substring" to end up where I
> would have begun ... (I hope :-)
> Moreover: why at all treat two adjacent separators
> specially? And more thinking takes the users further away
> from whatever they were trying to accomplish ...
>         "Notice that [$\r,$\n] is one grapheme cluster."
> as is any character list = string? This note drives me to
> confusion, requires me to step one meta-layer further away
> from whatever I was trying to implement or design. Did I
> misunderstand something? Up until here I thought
> `Newline_separators = [[$\n], [$\r], [$\r, $\n]]`, but
> now, why mention the obvious ...?
>         "Where, default leading, indicates whether the
>         leading, the trailing or all encounters of
>         SearchPattern will split String."
> Leading and trailing separators do not really separate, the
> example (I love examples at specs :-) shows that more
> probably "first" and "last" are meant. Now who would want
> to dig up the documentation out of the repo, change, index,
> commit, push (or whatever), set up a pull request and ...
> try to remember what they were doing? Reference, with
> examples (and possibly some *distinct* implementation
> rationale, if that is the right place) and User Guide are
> not code (nor code comments, I am starting to grow doubts
> about all these JavaDoc, erldoc, ...doc thingies).
> (*Note: "spec" above means 'reference', not the module
> attribute, I know I should change them ...*)
> Reference and Guide are to be read by people who do not
> know what is meant -- code, comments and implementation
> details are for those who know too much, it is too much
> effort to change ones view from implementing know-it-all
> to unknowing user. So some know to little and others too
> much. Of course, ex nihilo nihil fit, so the documentation
> needs to be prepared by those who know too much and some
> wiki could be a good way to get improvements by those who
> know too little.
> My! such loads of prose to lay down what is some simple
> thought in my head ... Sorry! (somehow :-)
> Now I can possibly throw away the other draft in which I
> have been collecting and trying to arrange many (all?) of
> those things mentioned above over the previous
> months ... :-)
> ~Michael
> --
> Time is not money, but money is time: life-time people have
> spent transforming their environment.
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20190508/d1d609db/attachment.htm>

More information about the erlang-questions mailing list