[erlang-questions] [enhancement] string:split/2

Thu Oct 9 05:51:52 CEST 2008

I agree that there should be an operation that is
the converse of string:join/2.

I'd like to see the spelling mistake (s/Seperators/Separators/)
fixed in the comment for string:tokens/2.
Can anyone explain to me why

-spec(tokens/2 :: (string(), string()) -> [[char(),...]]).

doesn't mention "string()" on the right hand side of the arrow?

There are three minor quibbles about the proposed
string:split/2.

(1) We already have lists:split/2 and lists:splitwith/2.
     A module system exists so that names can safely be reused,
     but this is a little too close for comfort.

     While 'unjoin' is uglier than 'split', maybe it would mean
     less confusion?

(2) What should string:split(Input, "") do?

     One plausible answer would be to split the input into a
     list of single-character strings.  I know this is what
     Edwin Fine asked for, but is it the _right_ thing to do?
     Why is "abc" "" -> ["a","b","c"] the right answer rather
     than ["abc"], for example?  Why is the separator deemed
     to occur only between characters and not at the beginning
     or end, yielding ["","a","b","c",""]?

     Perhaps the best answer for now is to require the separator
     to be a non-empty list.

(3) unjoin:unjoin(";;;abc;;de;f;g;;", ";;").
     [[],";abc","de;f;g",[]]

     Is that the right answer, or should it be
     [";","abc","de;f;g",[]]?

> It should be possible to perform an idempotent transformation as  
> follows:

An idempotent transformation F is one such that

	F(F(X)) = F(X)

I see no idempotent transformation here.  A useful operation,
yes, but an idempotent one, no.

> Examples:
>
> > string:split(":", ":This:is::a:contrived:example::").
> ["","This","is","","a","contrived","example","",""]
> > string:split("", "Hello").
> ["H","e","l","l","o"]
>

Since the separator is the SECOND argument of
string:join/2, I suggest that it should be the SECOND
argument of string:unjoin/2 as well.

The following code
(1) has the name I suggested (unjoin/2) rather than the name
     Edwin Fine suggested (split/2);
(2) has the argument order I suggested (consistent with join/2)
     rather than the argument order Edwin Fine suggested;
(3) has the "split into single character strings" behaviour
     when presented with an empty separator that Edwin Fine
     suggested, rather than any of the alternatives I did;
(4) has been tested.

unjoin(String, []) ->
     unjoin0(String);
unjoin(String, [Sep]) when is_integer(Sep) ->
     unjoin1(String, Sep);
unjoin(String, [C1,C2|L]) when is_integer(C1), is_integer(C2) ->
     unjoin2(String, C1, C2, L).

%% Split a string at "", which is deemed to occur _between_
%% adjacent characters, but queerly, not at the beginning
%% or the end.

unjoin0([C|Cs]) ->
     [[C] | unjoin0(Cs)];
unjoin0([]) ->
     [].

%% Split a string at a single character separator.

unjoin1(String, Sep) ->
     unjoin1_loop(String, Sep, "").

unjoin1_loop([Sep|String], Sep, Rev) ->
     [lists:reverse(Rev) | unjoin1(String, Sep)];
unjoin1_loop([Chr|String], Sep, Rev) ->
     unjoin1_loop(String, Sep, [Chr|Rev]);
unjoin1_loop([], _, Rev) ->
     [lists:reverse(Rev)].

%% Split a string at a multi-character separator
%% [C1,C2|L].  These components are split out for
%% a fast match.

unjoin2(String, C1, C2, L) ->
     unjoin2_loop(String, C1, C2, L, "").

unjoin2_loop([C1|S = [C2|String]], C1, C2, L, Rev) ->
     case unjoin_prefix(L, String)
       of no   -> unjoin2_loop(S, C1, C2, L, [C1|Rev])
        ; Rest -> [lists:reverse(Rev) | unjoin2(Rest, C1, C2, L)]
     end;
unjoin2_loop([Chr|String], C1, C2, L, Rev) ->
     unjoin2_loop(String, C1, C2, L, [Chr|Rev]);
unjoin2_loop([], _, _, _, Rev) ->
     [lists:reverse(Rev)].

unjoin_prefix([C|L], [C|S]) -> unjoin_prefix(L, S);
unjoin_prefix([],    S)     -> S;
unjoin_prefix(_,     _)     -> no.