[erlang-questions] string:lexeme/s2 - an old man's rant

Wed May 8 01:40:07 CEST 2019

Hi Richard,

Thanks for clarifying the inner workings of Unicode. 

Which makes me wonder—- If string:tokens/2 and string:lexemes/2 are functionally identical, or at least substitutable, why not change the implementation of string:tokens/2 to accommodate Unicode, leave the function name alone, and announce to the world that as of Erlang Version XX the implementation of string:tokens/2 has been changed to accommodate Unicode?

Then we don’t have to worry about revising legacy code at some point in the future. Yes, I understand that the legacy code might have to be recompiled under the new version of Erlang in the case Unicode becomes universal.  But that seems to me a smaller price than revising source code.

A simple example that I ran into yesterday while proofreading Build It with Nitrogen, the book that Jesse Gumm and I have been working on for far too long now:

We used the function string:tokens/2 moons ago to parse a date string in the form  “04/07/19”.  String:tokens/2 was in good standing when we wrote the chapter. Had we published the book in a timely fashion, our readers today might think, oh this book is no good. It uses obsolete functions. 

I could have changed the function to string:lexemes/2. But if my mind goes tilt when  I look at the documentation, what can I expect of my readers? I ended up changing it to re:split/3.

All the best,

Lloyd

Sent from my iPad

> On May 7, 2019, at 6:53 PM, Richard O'Keefe <raoknz@REDACTED> wrote:
> 
> For what it's worth, in Unicode, Line Separator and Paragraph
> Separator are the recommended characters, with CR, LF, CR+LF,
> and of arguably NEL (U+0085) being "legacy".
> 
> Again for what it's worth, Unicode defines an algorithm for
> breaking text into word( token)s.
> 
>> On Wed, 8 May 2019 at 08:17, <empro2@REDACTED> wrote:
>> On Tue, 7 May 2019 12:04:46 -0400
>> "Lloyd R. Prentice" <lloyd@REDACTED> wrote:
>> 
>> > language conventions. But I would much prefer keeping
>> > string:tokens/2 in the Erlang string library and renaming
>> > string:lexemes/2 to something like
>> > string:unicode_tokens/2. If nothing else, this would take
>> > a considerable burden off the documentation.
>> 
>> Both names force meaning onto mere substrings plucked from
>> some string argument chopped to pieces at some separator
>> characters (with `tokens`) or substrings (`lexemes`).
>> 
>> The author cannot know what the resulting substrings mean to
>> the user, may be tokens, may be lexemes, may simply be
>> substrings for whatever use substrings might be useful
>> for; the "key=value" strings from a query-string chopped at
>> "&" are neither tokens nor lexemes.
>> 
>> I would provide an option to return empty substrings (for
>> counting) or not, instead of imperative `split` and
>> foisting `token` and `lexeme`.
>> 
>> This is a good example of things I have been collecting
>> about the documentation (so far I have been brainstorming
>> in my chamber without you):
>> 
>>         "two or more adjacent separator graphemes clusters
>>         in String are treated as one."
>> 
>> No-one cares: users looking up the spec want to know
>> whether they get empty substrings or not -- and how:
>> with such an option to one `substring` function they
>> get to know unmediately, as things are one needs to guess
>> from `split` to `token` to `lexeme` or fro. Qizzy! or they
>> employ a text search for "substring" to end up where I
>> would have begun ... (I hope :-)
>> 
>> Moreover: why at all treat two adjacent separators
>> specially? And more thinking takes the users further away
>> from whatever they were trying to accomplish ...
>> 
>> 
>>         "Notice that [$\r,$\n] is one grapheme cluster."
>> 
>> as is any character list = string? This note drives me to
>> confusion, requires me to step one meta-layer further away
>> from whatever I was trying to implement or design. Did I
>> misunderstand something? Up until here I thought
>> `Newline_separators = [[$\n], [$\r], [$\r, $\n]]`, but
>> now, why mention the obvious ...?
>> 
>> 
>>         "Where, default leading, indicates whether the
>>         leading, the trailing or all encounters of
>>         SearchPattern will split String."
>> 
>> Leading and trailing separators do not really separate, the
>> example (I love examples at specs :-) shows that more
>> probably "first" and "last" are meant. Now who would want
>> to dig up the documentation out of the repo, change, index,
>> commit, push (or whatever), set up a pull request and ...
>> try to remember what they were doing? Reference, with
>> examples (and possibly some *distinct* implementation
>> rationale, if that is the right place) and User Guide are
>> not code (nor code comments, I am starting to grow doubts
>> about all these JavaDoc, erldoc, ...doc thingies).
>> 
>> (*Note: "spec" above means 'reference', not the module
>> attribute, I know I should change them ...*)
>> 
>> Reference and Guide are to be read by people who do not
>> know what is meant -- code, comments and implementation
>> details are for those who know too much, it is too much
>> effort to change ones view from implementing know-it-all
>> to unknowing user. So some know to little and others too
>> much. Of course, ex nihilo nihil fit, so the documentation
>> needs to be prepared by those who know too much and some
>> wiki could be a good way to get improvements by those who
>> know too little.
>> 
>> My! such loads of prose to lay down what is some simple
>> thought in my head ... Sorry! (somehow :-)
>> 
>> Now I can possibly throw away the other draft in which I
>> have been collecting and trying to arrange many (all?) of
>> those things mentioned above over the previous
>> months ... :-)
>> 
>> ~Michael
>> 
>> --
>> 
>> Time is not money, but money is time: life-time people have
>> spent transforming their environment.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20190507/36080d11/attachment.htm>