<div dir="ltr"><div class="gmail_default" style="font-family:monospace,monospace">For what it's worth, in Unicode, Line Separator and Paragraph</div><div class="gmail_default" style="font-family:monospace,monospace">Separator are the recommended characters, with CR, LF, CR+LF,</div><div class="gmail_default" style="font-family:monospace,monospace">and of arguably NEL (U+0085) being "legacy".</div><div class="gmail_default" style="font-family:monospace,monospace"><br></div><div class="gmail_default" style="font-family:monospace,monospace">Again for what it's worth, Unicode defines an algorithm for</div><div class="gmail_default" style="font-family:monospace,monospace">breaking text into word( token)s.<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, 8 May 2019 at 08:17, <<a href="mailto:empro2@web.de">empro2@web.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Tue, 7 May 2019 12:04:46 -0400<br>

"Lloyd R. Prentice" <<a href="mailto:lloyd@writersglen.com" target="_blank">lloyd@writersglen.com</a>> wrote:<br>

<br>

> language conventions. But I would much prefer keeping<br>

> string:tokens/2 in the Erlang string library and renaming<br>

> string:lexemes/2 to something like<br>

> string:unicode_tokens/2. If nothing else, this would take<br>

> a considerable burden off the documentation.<br>

<br>

Both names force meaning onto mere substrings plucked from<br>

some string argument chopped to pieces at some separator<br>

characters (with `tokens`) or substrings (`lexemes`).<br>

<br>

The author cannot know what the resulting substrings mean to<br>

the user, may be tokens, may be lexemes, may simply be<br>

substrings for whatever use substrings might be useful<br>

for; the "key=value" strings from a query-string chopped at<br>

"&" are neither tokens nor lexemes.<br>

<br>

I would provide an option to return empty substrings (for<br>

counting) or not, instead of imperative `split` and<br>

foisting `token` and `lexeme`.<br>

<br>

This is a good example of things I have been collecting<br>

about the documentation (so far I have been brainstorming<br>

in my chamber without you):<br>

<br>

Â  Â  Â  Â  "two or more adjacent separator graphemes clusters<br>

Â  Â  Â  Â  in String are treated as one."<br>

<br>

No-one cares: users looking up the spec want to know<br>

whether they get empty substrings or not -- and how:<br>

with such an option to one `substring` function they<br>

get to know unmediately, as things are one needs to guess<br>

from `split` to `token` to `lexeme` or fro. Qizzy! or they<br>

employ a text search for "substring" to end up where I<br>

would have begun ... (I hope :-)<br>

<br>

Moreover: why at all treat two adjacent separators<br>

specially? And more thinking takes the users further away<br>

from whatever they were trying to accomplish ...<br>

<br>

<br>

Â  Â  Â  Â  "Notice that [$\r,$\n] is one grapheme cluster."<br>

<br>

as is any character list = string? This note drives me to<br>

confusion, requires me to step one meta-layer further away<br>

from whatever I was trying to implement or design. Did I<br>

misunderstand something? Up until here I thought<br>

`Newline_separators = [[$\n], [$\r], [$\r, $\n]]`, but<br>

now, why mention the obvious ...?<br>

<br>

<br>

Â  Â  Â  Â  "Where, default leading, indicates whether the<br>

Â  Â  Â  Â  leading, the trailing or all encounters of<br>

Â  Â  Â  Â  SearchPattern will split String."<br>

<br>

Leading and trailing separators do not really separate, the<br>

example (I love examples at specs :-) shows that more<br>

probably "first" and "last" are meant. Now who would want<br>

to dig up the documentation out of the repo, change, index,<br>

commit, push (or whatever), set up a pull request and ...<br>

try to remember what they were doing? Reference, with<br>

examples (and possibly some *distinct* implementation<br>

rationale, if that is the right place) and User Guide are<br>

not code (nor code comments, I am starting to grow doubts<br>

about all these JavaDoc, erldoc, ...doc thingies).<br>

<br>

(*Note: "spec" above means 'reference', not the module<br>

attribute, I know I should change them ...*)<br>

<br>

Reference and Guide are to be read by people who do not<br>

know what is meant -- code, comments and implementation<br>

details are for those who know too much, it is too much<br>

effort to change ones view from implementing know-it-all<br>

to unknowing user. So some know to little and others too<br>

much. Of course, ex nihilo nihil fit, so the documentation<br>

needs to be prepared by those who know too much and some<br>

wiki could be a good way to get improvements by those who<br>

know too little.<br>

<br>

My! such loads of prose to lay down what is some simple<br>

thought in my head ... Sorry! (somehow :-)<br>

<br>

Now I can possibly throw away the other draft in which I<br>

have been collecting and trying to arrange many (all?) of<br>

those things mentioned above over the previous<br>

months ... :-)<br>

<br>

~Michael<br>

<br>

--<br>

<br>

Time is not money, but money is time: life-time people have<br>

spent transforming their environment.<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

_______________________________________________<br>

erlang-questions mailing list<br>

<a href="mailto:erlang-questions@erlang.org" target="_blank">erlang-questions@erlang.org</a><br>

<a href="http://erlang.org/mailman/listinfo/erlang-questions" rel="noreferrer" target="_blank">http://erlang.org/mailman/listinfo/erlang-questions</a><br>

</blockquote></div>