[erlang-questions] [enhancement] string:split/2

Thu Oct 9 08:43:36 CEST 2008

On Wed, Oct 8, 2008 at 11:51 PM, Richard O'Keefe <ok@REDACTED> wrote:

> I agree that there should be an operation that is
> the converse of string:join/2.

*That's* the word I wanted, thanks.

> I'd like to see the spelling mistake (s/Seperators/Separators/)
> fixed in the comment for string:tokens/2.
> Can anyone explain to me why
>
> -spec(tokens/2 :: (string(), string()) -> [[char(),...]]).
>
> doesn't mention "string()" on the right hand side of the arrow?
>
> There are three minor quibbles about the proposed
> string:split/2.
>
> (1) We already have lists:split/2 and lists:splitwith/2.
>    A module system exists so that names can safely be reused,
>    but this is a little too close for comfort.
>

Well, split is used in some very widespread languages like Perl, Ruby, and
JavaScript. The version I proposed works much like Ruby's but doesn't have
all the "kitchen sink" options that are available in Ruby. This being a
multi-lingual software world, I decided to go with the Principle of Least
Astonishment and do as the Romans do, as it were. I regret  the much earlier
naming of lists:split, which is a totally different beast, but I maintain
that string:split is the right name. As long as people don't overuse import
(which I never use, ever) it should be ok. If you *really* wanted to be
perverse, you could call it nioj (only joking, I never understood the
attraction of that Algol-68 esac/fi/od stuff).

>
>    While 'unjoin' is uglier than 'split', maybe it would mean
>    less confusion?
>

Maybe. Yuk.

>
> (2) What should string:split(Input, "") do?
>
>    One plausible answer would be to split the input into a
>    list of single-character strings.  I know this is what
>    Edwin Fine asked for, but is it the _right_ thing to do?

It's what Ruby and JavaScript do, and for all I know, others. That doesn't
mean it's *right* but it does mean it's unsurprising for people who have to
work across multiple languages that include these and similar ones.

>    Why is "abc" "" -> ["a","b","c"] the right answer rather
>    than ["abc"], for example?  Why is the separator deemed
>    to occur only between characters and not at the beginning
>    or end, yielding ["","a","b","c",""]?
>

Good point. I suppose by convention, really; we are talking about an
invisible zero-length string, after all, and there could be an infinite
number of them in any string. I think. I am sure you can tell from my
"idempotent" gaffe that I'm not a computer scientist by training; I are an
enjineer.

   Perhaps the best answer for now is to require the separator
>    to be a non-empty list.
>

That would work... but could cause exceptions in places where it's not
really an obvious error. I know you hate those. Me too.

> (3) unjoin:unjoin(";;;abc;;de;f;g;;", ";;").
>    [[],";abc","de;f;g",[]]
>
>    Is that the right answer, or should it be
>    [";","abc","de;f;g",[]]?
>

Out of interest, what does Ruby do?

 irb(main):001:0> ";;;abc;;de;f;g;;".split ";;"
=> ["", ";abc", "de;f;g"]

Neither of the above. Hmmm.. why?
Ah. "If the *limit* parameter is omitted, trailing null fields are
suppressed." (http://www.ruby-doc.org/core/classes/String.html#M000818)

Well. Things *can* get really confusing.

>  It should be possible to perform an idempotent transformation as follows:
>>
>
> An idempotent transformation F is one such that
>
>        F(F(X)) = F(X)
>
> I see no idempotent transformation here.  A useful operation,
> yes, but an idempotent one, no.
>

I know, I know. I was trying to find the right word to signify a
"round-trip" operation that leaves the operand unchanged, and was in too
much of a hurry to make sure that idempotent meant that. Sorry.

This is more of a sort of mutual identity operation, like f(g(x)) = x.

> Examples:
>
> > string:split(":", ":This:is::a:contrived:example::").
> ["","This","is","","a","contrived","example","",""]
> > string:split("", "Hello").
> ["H","e","l","l","o"]
>
>
Since the separator is the SECOND argument of
> string:join/2, I suggest that it should be the SECOND
> argument of string:unjoin/2 as well.
>

That was very careless of me. I specially wrote the specification of
string:split to be patterned after string:join, and then screwed up the
example. Eheu, mea culpa (no, really, I could kick myself).

> The following code
> (1) has the name I suggested (unjoin/2) rather than the name
>    Edwin Fine suggested (split/2);
> (2) has the argument order I suggested (consistent with join/2)
>    rather than the argument order Edwin Fine suggested;
> (3) has the "split into single character strings" behaviour
>    when presented with an empty separator that Edwin Fine
>    suggested, rather than any of the alternatives I did;
> (4) has been tested.
>

I'm assuming you are providing this code as an "executable requirements
document." I was rather hoping that it could be implemented as a BIF.
I'll have to study your code below so I can learn some "fast and fancy"
Erlang!

> unjoin(String, []) ->
>    unjoin0(String);
> unjoin(String, [Sep]) when is_integer(Sep) ->
>    unjoin1(String, Sep);
> unjoin(String, [C1,C2|L]) when is_integer(C1), is_integer(C2) ->
>    unjoin2(String, C1, C2, L).
>
> %% Split a string at "", which is deemed to occur _between_
> %% adjacent characters, but queerly, not at the beginning
> %% or the end.
>
> unjoin0([C|Cs]) ->
>    [[C] | unjoin0(Cs)];
> unjoin0([]) ->
>    [].
>
> %% Split a string at a single character separator.
>
> unjoin1(String, Sep) ->
>    unjoin1_loop(String, Sep, "").
>
> unjoin1_loop([Sep|String], Sep, Rev) ->
>    [lists:reverse(Rev) | unjoin1(String, Sep)];
> unjoin1_loop([Chr|String], Sep, Rev) ->
>    unjoin1_loop(String, Sep, [Chr|Rev]);
> unjoin1_loop([], _, Rev) ->
>    [lists:reverse(Rev)].
>
> %% Split a string at a multi-character separator
> %% [C1,C2|L].  These components are split out for
> %% a fast match.
>
> unjoin2(String, C1, C2, L) ->
>    unjoin2_loop(String, C1, C2, L, "").
>
> unjoin2_loop([C1|S = [C2|String]], C1, C2, L, Rev) ->
>    case unjoin_prefix(L, String)
>      of no   -> unjoin2_loop(S, C1, C2, L, [C1|Rev])
>       ; Rest -> [lists:reverse(Rev) | unjoin2(Rest, C1, C2, L)]
>    end;
> unjoin2_loop([Chr|String], C1, C2, L, Rev) ->
>    unjoin2_loop(String, C1, C2, L, [Chr|Rev]);
> unjoin2_loop([], _, _, _, Rev) ->
>    [lists:reverse(Rev)].
>
> unjoin_prefix([C|L], [C|S]) -> unjoin_prefix(L, S);
> unjoin_prefix([],    S)     -> S;
> unjoin_prefix(_,     _)     -> no.
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20081009/bf2f7dac/attachment.htm>