[erlang-questions] Changes to string module

Mon Jul 24 07:19:31 CEST 2017

Hi, Lloyd!

API changes are a major pain. Hopefully there will not be too many minor traumas as things move forward...

On 2017年07月23日 日曜日 15:53:42 you wrote:
> At risk of being tagged culturally insensitive, Is it presumptions of me to ask if we could please split the new Unicode string functions into a separate library and preserve the ascii string functions as we find them in older Erlang versions?

Not culturally insensitive -- annoyed at a changing API. That's normal.

The original problem was that the "string" module really performed (occasionally redundant) list operations, not string operations, just insufficiently stringish operations. The maintainers went with a jarring approach to change (well sort of -- nothing has been removed yet). This is understandable considering the alternatives. They left functions in place many of us are using, but deprecated many of them, and eventually we will find ourselves with a for-real string module. Quite nice.

But the road there is fraught with peril...

> I've been working with ascii string functions but now find the new docs much harder to work with, e.g. to my eye too much visual noise, plus stuff like this:
>  
> - string:chr/2 (and rchr/2) returns an index into a string. But we're told "This function is [ obsolete ]( http://erlang.org/doc/man/string.html#oldapi ). Use [ find/2 ]( http://erlang.org/doc/man/string.html#find-2 )."
>   find/2, however, returns "returns the remainder of the string or nomatch...."

Yep. That's a big wtf. Why not just leave that one there or (at worst) move it to something more general where it really belongs like lists:index/2 which the language lacks (because it is usually just not called for and nobody seems to have trouble with this).

At a minimum, the docs shouldn't ONLY reference string:find/2,3, but also include a reference to a way to get the exact same behavior -- for example, using the re module:

1> S = "All you need in this life is ignorance and confidence, and then success is sure.".
"All you need in this life is ignorance and confidence, and then success is sure."
2> string:chr(S, $i).
14
3> re:run(S, "i").
{match,[{13,1}]}
4> Chr = fun(String, Char) -> case re:run(String, [Char]) of {match, [{I, 1}]} -> I + 1; nomatch -> 0 end end.
#Fun<erl_eval.12.87737649>
8> Chr(S, $i).
14

Which indicates an internal definition can give us the same effect with

-module(my_string).

-spec chr(String, Char) -> Index
    when String :: list(),
         Char   :: non_neg_integer(),
         Index  :: 0 | pos_integer().

chr(String, Char) ->
    case re:run(String, [Char]) of
        {match, [{Index, 1}]} -> Index + 1;
        nomatch               -> 0
    end.

(And then, of course, doing the equivalent of 's/string:chr/my_string:chr/g' over the source...)

That is a relatively easy fix that leaves whatever code depends on the exact return value of the current string:chr/2 function without a big rewrite.

But having to do so is a bit annoying.

Why not leave string:chr/2 in place as-is? I don't really know. Maybe because the string module is intended to conceptually do string processing, not array processing. Or something. Or whatever.

> - similarly, string:cspan/2 is marked "This function is [ obsolete ]( http://erlang.org/doc/man/string.html#oldapi ). Use [ take/3 ]( http://erlang.org/doc/man/string.html#take-3 )."
>   But string:take/3 returns leading and trailing data.

That is pretty odd also.

Once again, I think the idea here is that the new API is assuming what the intended use of this function typically was, and jumping straight to that intended effect instead of leaving the current intermediate step in place.

Once again, the re module can be used to get the same result:

40> string:cspan(S, "abc").
34
41> string:cspan(S, "ZXY").
80
42> string:cspan(S, "All").
0
43> Cspan = fun(S, C) -> case re:run(S, "[" ++ C ++ "]") of {match, [{I, 1}]} -> I; nomatch -> length(S) end end.
#Fun<erl_eval.12.87737649>
44> Cspan(S, "abc").
34
45> Cspan(S, "ZXY").
80
46> Cspan(S, "All").
0

Which indicates we could do:

-module(my_string).

-spec cspan(String, Chars) -> Index
    when String :: [non_neg_integer()],
         Char   :: [non_neg_integer()],
         Index  :: non_neg_integer().

cspan(String, Chars) ->
    case re:run(String, "[" ++ Chars ++ "]") of
        {match, [{Index, 1}]} -> Index;
        nomatch               -> length(String)
    end.

(...and once again run sed for cspan on the source).

> Plus, I dread the future necessity of rewriting string code.

This is annoying -- I totally agree.

Overall I think the changes are good. They leave the natural place to do listy things in the lists module, the regexy things in the re module, and truly stringy things over true UTF8 strings -- and this is a HUGE step forward in terms of doing non-Romaji things.

BUT

WHY some direct 1-for-1 replacement functions, references, advice, etc. is not included in the docs is beyond me. It does little help to direct someone to a string munging function from the docs on a deprecated index function when users have very likely already written their OWN string munging libs based on index return values.

Hopefully the switch won't hurt too terribly bad.

-Craig

PS: Thanks again for the awesome work on strings, Dan! I noticed I had an email sitting in my box from you months ago regarding some Kana functions you had put in. I wound up dropping out of civilization for a bit right then -- and I'll get back to you on it eventually.