<div dir="ltr">Let's look at the documentation for tokens/2:<br><br><a href="http://erlang.org/doc/man/string.html#tokens-2">http://erlang.org/doc/man/string.html#tokens-2</a><br><br>The first thing I notice is that we are told *that*<br>the function is obsolete but not *why* it is, and<br>that's important.<br><br>The second thing I notice is that we are told<br>to use lexemes/2 instead, but we are not told *how*<br>to do that.  An example showing an old call and its<br>new equivalent would do wonders.<br><br>The third thing I notice is the reason that the<br>second thing matters.  Consider the following<br>examples:<br>  tokens("aaa", "x") => ["aaa"]<br>  tokens("aa", "x")  => ["aa"]<br>  tokens("a", "x")   => ["a"]<br>so by continuity we expect<br>  tokens("", "x")    => [""]<br>BUT the result is actually [].  True, the<br>description says that the result is a list<br>of non-empty strings, but I don't really see<br>why that is so important that our natural<br>expectation that tokens(S, [X]) => [X]<br>whenever S is *any* string not containing X<br>should be violated, and if it is, then I<br>would definitely expect an exception.<br><br>The fourth thing I notice is that the treatment<br>of multi-element separator lists is odd.  I have<br>had occasion to use separators with more than<br>one code-point, and for Unicode that could be<br>essential.  I have also had occasion to use<br>split at C1, then at C2, then at C3, then at C4, ...<br>I've also had occasion to split on one separator<br>and then split the pieces into smaller pieces,<br>so multiple levels of splitting.  (Think of<br>/etc/passwd for a simple example.)  But the only<br>time I ever want multiple *alternative* separators<br>is when asking for white-space separation, and<br>*that* is when I want non-empty pieces.  It is<br>also the only time I ever want separators coalesced.<br>Given a string like "||x|yy||w" and the separator<br>"|", I've always wanted ["","","x","yy","","w"]<br>as the answer.  But there's a particular point<br>here:  which of us knows off-hand just what all<br>the Zs, Zl, and Zp characters of Unicode actually<br>are?  It would make a *lot* of sense to have<br>   tokens(String) -> list of non-empty pieces<br>   tokens(String, Sep) -> list of possibly empty<br>     pieces separated by the non-empty substring Sep.<br><br>The fifth thing I notice is that there is no<br>specification of what happens if SeparatorList is<br>empty.<br><br>All things considered, this is a function I am never<br>going to use, because it is less work to write my own<br>than to try to figure out this documentation.  And I<br>had to look at the code to figure some of it out.<br><br>I get seriously confused by some of the code in<br>string.erl.  We find<br>%% Fetch first grapheme cluster ..<br>next_grapheme(CD) -> ..<br>Which is it?  Grapheme or grapheme cluster?  These<br>are *different* (but overlapping) things!  And<br>where is the locale argument so that the function<br>knows what a "user-perceived character" actually *is*?<br>How come an empty list counts as a grapheme_cluster()?<br><br>What if I have something like<br>"foo:bar::uggle::zoosh" and I want to split it at<br>"::" but NOT at ":"?  "::" is not a grapheme cluster,'<br>so it looks like neither of these functions will help<br>me.<br><br>Writing good documentation is HARD.  At dear departed<br>Quintus, we started with a full time technical writer<br>and expanded to three, nearly as many as developers.<br><br>The *name* 'lexemes' is arguably the *least* confusing<br>thing in the documentation.  If it were called z3k_u4y/2<br>that would increase my confusion very little.</div>