[erlang-questions] Erlang string datatype [cookbook entry #1 - unicode/UTF-8 strings]

Jon Watte <>
Tue Oct 25 04:46:00 CEST 2011


> > 3) Operations include (erlang-style) matching, (grep-style) finding,
> > converting, splitting, joining and trimming.
>
> Note that splitting is a tricky concept for languages like Japanese
> that don't have wordspacing, and in which the analogous thing might be
> splitting based on shifts between any of four distinct character sets
> (plus punctuation) within the overall Japanese character set.
> Othewise, +1.
>
>

>From what I recall, Kanji and Romanji has well-defined word separation, but
Katakana and Hiragana requires semantic analysis and/or user hinting. It's
been a long time since I've done i18n, though, so I may remember wrong.
Anyway -- this is complex, which is why I'd want the library to do it,
rather than me :-)



> > 4) Random access is important! As is slicing. string:substr(from,to)
> should
> > be O(1).
>
> Binaries get you most of this already, so it goes back to how
> important (1) and (2) are to you. Erlang isn't used much for
>


I don't see why a native string that stores code points (similar to, say,
std::string<wchar_t> in C++) couldn't also give me random access and O(1)
slicing.

> 8) Easy conversion between binary, list (if needed at all), iolist, and
> > string. Strings should be "native" in iolists.
>
> I'm getting nervous about how big this job would be. ;-)
>
>

Hey, the question was "what would your ideal string API for Erlang look
like" and I answered that question :-) If the question is "what's the
biggest bang for the smallest buck" then you'd have to start making
engineering trade-offs. I still like to be able to make such trade-offs in
the context of a longer road map, though, to avoid globally pessimizing
through some short-term local optimization.

> Sub-question: Can you find a nicer syntax than just "string" module
> functino
> > calls?
>
> "Functino"? You mean those functions they discovered recently that go
> slightly faster than light?
>


I wish I was that smart, but no -- a simple typo :-) The main question is --
if lists and binaries already have nice, built-in syntax, how about strings?
If any guard has to be expressed as a string:whatever() function, it's
inelegant. However, some of the things you want to express may be counter to
traditional Erlang syntax. If random access is O(1), you can do "endswith"
just as easily as "startswith" -- and maybe you even want to match
string-insensitive (for the current locale, or a specified locale)...
Which ends up being immutable guards, still, but not necessarily
particularly fast to execute.

do_command(Cmd) when string:startswith_case_insensitive(Cmd, "whatever",
locale:cyrillic()) ->
    do_whatever();
...

Can a nice syntax be found that looks more like binary matching, but has
these parameters?

Using angle+bracket doesn't strike me as particularly good looking, but
then, does any language have a string syntax that doesn't use any of the two
ASCII quotes nor the double-angles, that could be stolen?
Caret (^a string^)? Pipe (|a string|)? Backtick (`a string`)?

Sincerely,

jw
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20111024/fb4dc6f2/attachment.html>


More information about the erlang-questions mailing list