[erlang-questions] [enhancement] string:split/2
Thu Oct 9 08:43:36 CEST 2008
On Wed, Oct 8, 2008 at 11:51 PM, Richard O'Keefe <> wrote:
> I agree that there should be an operation that is
> the converse of string:join/2.
*That's* the word I wanted, thanks.
> I'd like to see the spelling mistake (s/Seperators/Separators/)
> fixed in the comment for string:tokens/2.
> Can anyone explain to me why
> -spec(tokens/2 :: (string(), string()) -> [[char(),...]]).
> doesn't mention "string()" on the right hand side of the arrow?
> There are three minor quibbles about the proposed
> (1) We already have lists:split/2 and lists:splitwith/2.
> A module system exists so that names can safely be reused,
> but this is a little too close for comfort.
Well, split is used in some very widespread languages like Perl, Ruby, and
all the "kitchen sink" options that are available in Ruby. This being a
multi-lingual software world, I decided to go with the Principle of Least
Astonishment and do as the Romans do, as it were. I regret the much earlier
naming of lists:split, which is a totally different beast, but I maintain
that string:split is the right name. As long as people don't overuse import
(which I never use, ever) it should be ok. If you *really* wanted to be
perverse, you could call it nioj (only joking, I never understood the
attraction of that Algol-68 esac/fi/od stuff).
> While 'unjoin' is uglier than 'split', maybe it would mean
> less confusion?
> (2) What should string:split(Input, "") do?
> One plausible answer would be to split the input into a
> list of single-character strings. I know this is what
> Edwin Fine asked for, but is it the _right_ thing to do?
mean it's *right* but it does mean it's unsurprising for people who have to
work across multiple languages that include these and similar ones.
> Why is "abc" "" -> ["a","b","c"] the right answer rather
> than ["abc"], for example? Why is the separator deemed
> to occur only between characters and not at the beginning
> or end, yielding ["","a","b","c",""]?
Good point. I suppose by convention, really; we are talking about an
invisible zero-length string, after all, and there could be an infinite
number of them in any string. I think. I am sure you can tell from my
"idempotent" gaffe that I'm not a computer scientist by training; I are an
Perhaps the best answer for now is to require the separator
> to be a non-empty list.
That would work... but could cause exceptions in places where it's not
really an obvious error. I know you hate those. Me too.
> (3) unjoin:unjoin(";;;abc;;de;f;g;;", ";;").
> Is that the right answer, or should it be
Out of interest, what does Ruby do?
irb(main):001:0> ";;;abc;;de;f;g;;".split ";;"
=> ["", ";abc", "de;f;g"]
Neither of the above. Hmmm.. why?
Ah. "If the *limit* parameter is omitted, trailing null fields are
Well. Things *can* get really confusing.
> It should be possible to perform an idempotent transformation as follows:
> An idempotent transformation F is one such that
> F(F(X)) = F(X)
> I see no idempotent transformation here. A useful operation,
> yes, but an idempotent one, no.
I know, I know. I was trying to find the right word to signify a
"round-trip" operation that leaves the operand unchanged, and was in too
much of a hurry to make sure that idempotent meant that. Sorry.
This is more of a sort of mutual identity operation, like f(g(x)) = x.
> > string:split(":", ":This:is::a:contrived:example::").
> > string:split("", "Hello").
Since the separator is the SECOND argument of
> string:join/2, I suggest that it should be the SECOND
> argument of string:unjoin/2 as well.
That was very careless of me. I specially wrote the specification of
string:split to be patterned after string:join, and then screwed up the
example. Eheu, mea culpa (no, really, I could kick myself).
> The following code
> (1) has the name I suggested (unjoin/2) rather than the name
> Edwin Fine suggested (split/2);
> (2) has the argument order I suggested (consistent with join/2)
> rather than the argument order Edwin Fine suggested;
> (3) has the "split into single character strings" behaviour
> when presented with an empty separator that Edwin Fine
> suggested, rather than any of the alternatives I did;
> (4) has been tested.
I'm assuming you are providing this code as an "executable requirements
document." I was rather hoping that it could be implemented as a BIF.
I'll have to study your code below so I can learn some "fast and fancy"
> unjoin(String, ) ->
> unjoin(String, [Sep]) when is_integer(Sep) ->
> unjoin1(String, Sep);
> unjoin(String, [C1,C2|L]) when is_integer(C1), is_integer(C2) ->
> unjoin2(String, C1, C2, L).
> %% Split a string at "", which is deemed to occur _between_
> %% adjacent characters, but queerly, not at the beginning
> %% or the end.
> unjoin0([C|Cs]) ->
> [[C] | unjoin0(Cs)];
> unjoin0() ->
> %% Split a string at a single character separator.
> unjoin1(String, Sep) ->
> unjoin1_loop(String, Sep, "").
> unjoin1_loop([Sep|String], Sep, Rev) ->
> [lists:reverse(Rev) | unjoin1(String, Sep)];
> unjoin1_loop([Chr|String], Sep, Rev) ->
> unjoin1_loop(String, Sep, [Chr|Rev]);
> unjoin1_loop(, _, Rev) ->
> %% Split a string at a multi-character separator
> %% [C1,C2|L]. These components are split out for
> %% a fast match.
> unjoin2(String, C1, C2, L) ->
> unjoin2_loop(String, C1, C2, L, "").
> unjoin2_loop([C1|S = [C2|String]], C1, C2, L, Rev) ->
> case unjoin_prefix(L, String)
> of no -> unjoin2_loop(S, C1, C2, L, [C1|Rev])
> ; Rest -> [lists:reverse(Rev) | unjoin2(Rest, C1, C2, L)]
> unjoin2_loop([Chr|String], C1, C2, L, Rev) ->
> unjoin2_loop(String, C1, C2, L, [Chr|Rev]);
> unjoin2_loop(, _, _, _, Rev) ->
> unjoin_prefix([C|L], [C|S]) -> unjoin_prefix(L, S);
> unjoin_prefix(, S) -> S;
> unjoin_prefix(_, _) -> no.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the erlang-questions