[erlang-questions] [enhancement] string:split/2
Richard O'Keefe
ok@REDACTED
Thu Oct 9 05:51:52 CEST 2008
I agree that there should be an operation that is
the converse of string:join/2.
I'd like to see the spelling mistake (s/Seperators/Separators/)
fixed in the comment for string:tokens/2.
Can anyone explain to me why
-spec(tokens/2 :: (string(), string()) -> [[char(),...]]).
doesn't mention "string()" on the right hand side of the arrow?
There are three minor quibbles about the proposed
string:split/2.
(1) We already have lists:split/2 and lists:splitwith/2.
A module system exists so that names can safely be reused,
but this is a little too close for comfort.
While 'unjoin' is uglier than 'split', maybe it would mean
less confusion?
(2) What should string:split(Input, "") do?
One plausible answer would be to split the input into a
list of single-character strings. I know this is what
Edwin Fine asked for, but is it the _right_ thing to do?
Why is "abc" "" -> ["a","b","c"] the right answer rather
than ["abc"], for example? Why is the separator deemed
to occur only between characters and not at the beginning
or end, yielding ["","a","b","c",""]?
Perhaps the best answer for now is to require the separator
to be a non-empty list.
(3) unjoin:unjoin(";;;abc;;de;f;g;;", ";;").
[[],";abc","de;f;g",[]]
Is that the right answer, or should it be
[";","abc","de;f;g",[]]?
> It should be possible to perform an idempotent transformation as
> follows:
An idempotent transformation F is one such that
F(F(X)) = F(X)
I see no idempotent transformation here. A useful operation,
yes, but an idempotent one, no.
> Examples:
>
> > string:split(":", ":This:is::a:contrived:example::").
> ["","This","is","","a","contrived","example","",""]
> > string:split("", "Hello").
> ["H","e","l","l","o"]
>
Since the separator is the SECOND argument of
string:join/2, I suggest that it should be the SECOND
argument of string:unjoin/2 as well.
The following code
(1) has the name I suggested (unjoin/2) rather than the name
Edwin Fine suggested (split/2);
(2) has the argument order I suggested (consistent with join/2)
rather than the argument order Edwin Fine suggested;
(3) has the "split into single character strings" behaviour
when presented with an empty separator that Edwin Fine
suggested, rather than any of the alternatives I did;
(4) has been tested.
unjoin(String, []) ->
unjoin0(String);
unjoin(String, [Sep]) when is_integer(Sep) ->
unjoin1(String, Sep);
unjoin(String, [C1,C2|L]) when is_integer(C1), is_integer(C2) ->
unjoin2(String, C1, C2, L).
%% Split a string at "", which is deemed to occur _between_
%% adjacent characters, but queerly, not at the beginning
%% or the end.
unjoin0([C|Cs]) ->
[[C] | unjoin0(Cs)];
unjoin0([]) ->
[].
%% Split a string at a single character separator.
unjoin1(String, Sep) ->
unjoin1_loop(String, Sep, "").
unjoin1_loop([Sep|String], Sep, Rev) ->
[lists:reverse(Rev) | unjoin1(String, Sep)];
unjoin1_loop([Chr|String], Sep, Rev) ->
unjoin1_loop(String, Sep, [Chr|Rev]);
unjoin1_loop([], _, Rev) ->
[lists:reverse(Rev)].
%% Split a string at a multi-character separator
%% [C1,C2|L]. These components are split out for
%% a fast match.
unjoin2(String, C1, C2, L) ->
unjoin2_loop(String, C1, C2, L, "").
unjoin2_loop([C1|S = [C2|String]], C1, C2, L, Rev) ->
case unjoin_prefix(L, String)
of no -> unjoin2_loop(S, C1, C2, L, [C1|Rev])
; Rest -> [lists:reverse(Rev) | unjoin2(Rest, C1, C2, L)]
end;
unjoin2_loop([Chr|String], C1, C2, L, Rev) ->
unjoin2_loop(String, C1, C2, L, [Chr|Rev]);
unjoin2_loop([], _, _, _, Rev) ->
[lists:reverse(Rev)].
unjoin_prefix([C|L], [C|S]) -> unjoin_prefix(L, S);
unjoin_prefix([], S) -> S;
unjoin_prefix(_, _) -> no.
More information about the erlang-questions
mailing list