<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body dir="auto">Hi Richard,<div><br></div><div>My head spins.</div><div><br></div><div>I’m ashamed to say that I’m functionally illiterate in every natural language of the world except English— and I’m still working at mastering that. And I program for English speakers.</div><div><br></div><div>So, I guess I’ll stick with re:split/3 until I have a pressing need for Unicode. Maybe by then the issues will be ironed out— and better, well tucked under the hood.</div><div><br></div><div>Richard, you’re a star in the Erlang firmament. </div><div><br></div><div>Thank you,</div><div><br></div><div>Lloyd</div><div><br></div><div><div id="AppleMailSignature" dir="ltr">Sent from my iPad</div><div dir="ltr"><br>On May 7, 2019, at 8:39 PM, Richard O'Keefe <<a href="mailto:raoknz@gmail.com">raoknz@gmail.com</a>> wrote:<br><br></div><blockquote type="cite"><div dir="ltr"><div dir="ltr">Let's look at the documentation for tokens/2:<br><br><a href="http://erlang.org/doc/man/string.html#tokens-2">http://erlang.org/doc/man/string.html#tokens-2</a><br><br>The first thing I notice is that we are told *that*<br>the function is obsolete but not *why* it is, and<br>that's important.<br><br>The second thing I notice is that we are told<br>to use lexemes/2 instead, but we are not told *how*<br>to do that. An example showing an old call and its<br>new equivalent would do wonders.<br><br>The third thing I notice is the reason that the<br>second thing matters. Consider the following<br>examples:<br> tokens("aaa", "x") => ["aaa"]<br> tokens("aa", "x") => ["aa"]<br> tokens("a", "x") => ["a"]<br>so by continuity we expect<br> tokens("", "x") => [""]<br>BUT the result is actually []. True, the<br>description says that the result is a list<br>of non-empty strings, but I don't really see<br>why that is so important that our natural<br>expectation that tokens(S, [X]) => [X]<br>whenever S is *any* string not containing X<br>should be violated, and if it is, then I<br>would definitely expect an exception.<br><br>The fourth thing I notice is that the treatment<br>of multi-element separator lists is odd. I have<br>had occasion to use separators with more than<br>one code-point, and for Unicode that could be<br>essential. I have also had occasion to use<br>split at C1, then at C2, then at C3, then at C4, ...<br>I've also had occasion to split on one separator<br>and then split the pieces into smaller pieces,<br>so multiple levels of splitting. (Think of<br>/etc/passwd for a simple example.) But the only<br>time I ever want multiple *alternative* separators<br>is when asking for white-space separation, and<br>*that* is when I want non-empty pieces. It is<br>also the only time I ever want separators coalesced.<br>Given a string like "||x|yy||w" and the separator<br>"|", I've always wanted ["","","x","yy","","w"]<br>as the answer. But there's a particular point<br>here: which of us knows off-hand just what all<br>the Zs, Zl, and Zp characters of Unicode actually<br>are? It would make a *lot* of sense to have<br> tokens(String) -> list of non-empty pieces<br> tokens(String, Sep) -> list of possibly empty<br> pieces separated by the non-empty substring Sep.<br><br>The fifth thing I notice is that there is no<br>specification of what happens if SeparatorList is<br>empty.<br><br>All things considered, this is a function I am never<br>going to use, because it is less work to write my own<br>than to try to figure out this documentation. And I<br>had to look at the code to figure some of it out.<br><br>I get seriously confused by some of the code in<br>string.erl. We find<br>%% Fetch first grapheme cluster ..<br>next_grapheme(CD) -> ..<br>Which is it? Grapheme or grapheme cluster? These<br>are *different* (but overlapping) things! And<br>where is the locale argument so that the function<br>knows what a "user-perceived character" actually *is*?<br>How come an empty list counts as a grapheme_cluster()?<br><br>What if I have something like<br>"foo:bar::uggle::zoosh" and I want to split it at<br>"::" but NOT at ":"? "::" is not a grapheme cluster,'<br>so it looks like neither of these functions will help<br>me.<br><br>Writing good documentation is HARD. At dear departed<br>Quintus, we started with a full time technical writer<br>and expanded to three, nearly as many as developers.<br><br>The *name* 'lexemes' is arguably the *least* confusing<br>thing in the documentation. If it were called z3k_u4y/2<br>that would increase my confusion very little.</div>
</div></blockquote></div></body></html>