<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body dir="auto">Hi Richard,<div><br></div><div>Thanks for clarifying the inner workings of Unicode. </div><div><br></div><div>Which makes me wonderâ€”- If string:tokens/2 and string:lexemes/2 are functionally identical, or at least substitutable, why not change the implementation of string:tokens/2 to accommodate Unicode, leave the function name alone, and announce to the world that as of Erlang Version XX the implementation of string:tokens/2 has been changed to accommodate Unicode?</div><div><br></div><div>Then we donâ€™t have to worry about revising legacy code at some point in the future. Yes, I understand that the legacy code might have to be recompiled under the new version of Erlang in the case Unicode becomes universal.  But that seems to me a smaller price than revising source code.<br><div><br></div><div>A simple example that I ran into yesterday while proofreading Build It with Nitrogen, the book that Jesse Gumm and I have been working on for far too long now:</div><div><br></div><div>We used the function string:tokens/2 moons ago to parse a date string in the form  â€œ04/07/19â€.  String:tokens/2 was in good standing when we wrote the chapter. Had we published the book in a timely fashion, our readers today might think, oh this book is no good. It uses obsolete functions. </div><div><br></div><div>I could have changed the function to string:lexemes/2. But if my mind goes tilt when  I look at the documentation, what can I expect of my readers? I ended up changing it to re:split/3.</div><div><br></div><div>All the best,</div><div><br></div><div>Lloyd</div><div><br><div id="AppleMailSignature" dir="ltr">Sent from my iPad</div><div dir="ltr"><br>On May 7, 2019, at 6:53 PM, Richard O'Keefe <<a href="mailto:raoknz@gmail.com">raoknz@gmail.com</a>> wrote:<br><br></div><blockquote type="cite"><div dir="ltr"><div dir="ltr"><div class="gmail_default" style="font-family:monospace,monospace">For what it's worth, in Unicode, Line Separator and Paragraph</div><div class="gmail_default" style="font-family:monospace,monospace">Separator are the recommended characters, with CR, LF, CR+LF,</div><div class="gmail_default" style="font-family:monospace,monospace">and of arguably NEL (U+0085) being "legacy".</div><div class="gmail_default" style="font-family:monospace,monospace"><br></div><div class="gmail_default" style="font-family:monospace,monospace">Again for what it's worth, Unicode defines an algorithm for</div><div class="gmail_default" style="font-family:monospace,monospace">breaking text into word( token)s.<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, 8 May 2019 at 08:17, <<a href="mailto:empro2@web.de">empro2@web.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Tue, 7 May 2019 12:04:46 -0400<br>

"Lloyd R. Prentice" <<a href="mailto:lloyd@writersglen.com" target="_blank">lloyd@writersglen.com</a>> wrote:<br>

<br>

> language conventions. But I would much prefer keeping<br>

> string:tokens/2 in the Erlang string library and renaming<br>

> string:lexemes/2 to something like<br>

> string:unicode_tokens/2. If nothing else, this would take<br>

> a considerable burden off the documentation.<br>

<br>

Both names force meaning onto mere substrings plucked from<br>

some string argument chopped to pieces at some separator<br>

characters (with `tokens`) or substrings (`lexemes`).<br>

<br>

The author cannot know what the resulting substrings mean to<br>

the user, may be tokens, may be lexemes, may simply be<br>

substrings for whatever use substrings might be useful<br>

for; the "key=value" strings from a query-string chopped at<br>

"&" are neither tokens nor lexemes.<br>

<br>

I would provide an option to return empty substrings (for<br>

counting) or not, instead of imperative `split` and<br>

foisting `token` and `lexeme`.<br>

<br>

This is a good example of things I have been collecting<br>

about the documentation (so far I have been brainstorming<br>

in my chamber without you):<br>

<br>

        "two or more adjacent separator graphemes clusters<br>

        in String are treated as one."<br>

<br>

No-one cares: users looking up the spec want to know<br>

whether they get empty substrings or not -- and how:<br>

with such an option to one `substring` function they<br>

get to know unmediately, as things are one needs to guess<br>

from `split` to `token` to `lexeme` or fro. Qizzy! or they<br>

employ a text search for "substring" to end up where I<br>

would have begun ... (I hope :-)<br>

<br>

Moreover: why at all treat two adjacent separators<br>

specially? And more thinking takes the users further away<br>

from whatever they were trying to accomplish ...<br>

<br>

<br>

        "Notice that [$\r,$\n] is one grapheme cluster."<br>

<br>

as is any character list = string? This note drives me to<br>

confusion, requires me to step one meta-layer further away<br>

from whatever I was trying to implement or design. Did I<br>

misunderstand something? Up until here I thought<br>

`Newline_separators = [[$\n], [$\r], [$\r, $\n]]`, but<br>

now, why mention the obvious ...?<br>

<br>

<br>

        "Where, default leading, indicates whether the<br>

        leading, the trailing or all encounters of<br>

        SearchPattern will split String."<br>

<br>

Leading and trailing separators do not really separate, the<br>

example (I love examples at specs :-) shows that more<br>

probably "first" and "last" are meant. Now who would want<br>

to dig up the documentation out of the repo, change, index,<br>

commit, push (or whatever), set up a pull request and ...<br>

try to remember what they were doing? Reference, with<br>

examples (and possibly some *distinct* implementation<br>

rationale, if that is the right place) and User Guide are<br>

not code (nor code comments, I am starting to grow doubts<br>

about all these JavaDoc, erldoc, ...doc thingies).<br>

<br>

(*Note: "spec" above means 'reference', not the module<br>

attribute, I know I should change them ...*)<br>

<br>

Reference and Guide are to be read by people who do not<br>

know what is meant -- code, comments and implementation<br>

details are for those who know too much, it is too much<br>

effort to change ones view from implementing know-it-all<br>

to unknowing user. So some know to little and others too<br>

much. Of course, ex nihilo nihil fit, so the documentation<br>

needs to be prepared by those who know too much and some<br>

wiki could be a good way to get improvements by those who<br>

know too little.<br>

<br>

My! such loads of prose to lay down what is some simple<br>

thought in my head ... Sorry! (somehow :-)<br>

<br>

Now I can possibly throw away the other draft in which I<br>

have been collecting and trying to arrange many (all?) of<br>

those things mentioned above over the previous<br>

months ... :-)<br>

<br>

~Michael<br>

<br>

--<br>

<br>

Time is not money, but money is time: life-time people have<br>

spent transforming their environment.<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

_______________________________________________<br>

erlang-questions mailing list<br>

<a href="mailto:erlang-questions@erlang.org" target="_blank">erlang-questions@erlang.org</a><br>

<a href="http://erlang.org/mailman/listinfo/erlang-questions" rel="noreferrer" target="_blank">http://erlang.org/mailman/listinfo/erlang-questions</a><br>

</blockquote></div>

</div></blockquote></div></div></body></html>