[erlang-questions] [enhancement] string:split/2

Thu Oct 9 01:42:25 CEST 2008

Splitting a string into multiple parts based on a literal (as opposed to
regular expression or set of characters) string delimiter is an extremely
common operation in text processing. Its equal and opposite partner, join,
is also very common. split and join are complementary functions (that's just
what I call them here, so even if there's some more accurate or widely used
terminology, please bear with me).

At present, string:join exists and behaves like its cousins in other
programming languages. There is an *almost*, but not quite complementary
function in the string module: string:tokens/2.

It *should* be possible to perform an idempotent transformation as follows:

Input = "A,,B,C,",
Sep = ",",
Output = string:join(string:tokens(Input, Sep), Sep),
Output =:= Input.

The above would return *true* if string:join and string:tokens were truly
complementary, but it does not.

However, there is nothing *wrong* with string:tokens as it stands today
(except maybe that it's a bit underdocumented, for example, by not
mentioning that it treats consecutive sequences of delimiters as a single
delimiter). It does what its name suggests: tokenizes a string based on a
set of token characters. In a tokenization operation, you usually *want* to
skip multiple identical token delimiters, and that's precisely what it does.
I am not proposing a change to string:tokens; I *am* proposing *the addition
of string:split/2*.

split(String, Separator) -> List

Types:

String = string()
Separator = string()
List = [string()]

Returns a list of strings that were separated in String by Separator.

Separator can contain any number of characters, including zero.

Leading separators will result in empty strings as the first elements of the
returned list; the same is true for trailing separators and multiple
consecutive separators.

If Separator is the empty string, the returned list will be identical to the
list returned by [[X] || X <- String].

If Separator cannot be found in String, the returned list will be [String].
In general, if there are N separators embedded in String, the returned list
will contain N+1 strings.

Examples:
> string:split(":", ":This:is::a:contrived:example::").
["","This","is","","a","contrived","example","",""]
> string:split("", "Hello").
["H","e","l","l","o"]
*Existing Alternatives*

There are no existing alternatives that I could find that have similar
syntactic, semantic, and performance characteristics to split. Anything
using regular expressions is more cumbersome and probably slower, and
simulating this using Erlang code would also present performance penalties.
I searched the list archives using the phrases "string split" and
string:split, but found nothing that would cause me not to submit this
proposal. EEP 9 does mention a binary_string:split, but I still feel that
string:split would be a worthwhile addition to balance string:join.

I believe this is too trivial an addition to warrant an EEP, so I am
proposing it to the list for comment.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20081008/ade2c1c8/attachment.htm>