[erlang-questions] strings vs binaries
zxq9
zxq9@REDACTED
Wed Aug 19 05:26:35 CEST 2015
On 2015年8月18日 火曜日 19:19:36 Steve Davis wrote:
> Hi zxq9,
> …example…?
> …test case…?
> (Did I miss a post already in history?)
> (Is the example/test case better handled in another platform? if so, which?, and how?)
> Please change my view if it is obviously incorrect.
> Best,
> /s
>
> > On 2015年8月18日 火曜日 17:47:53 Steve Davis wrote:
> >> In addition to what’s already been said...
> >>
> >> The old chestnut that “erlang is bad at string manipulation” completely goes away if you choose to use binaries for all your text.
> >>
> >> In fact, erlang is far superior at text tasks than most platforms I have used if you keep all your text as binaries.
>
> >...unless you need to do it in utf8...
>
>
> Hi zxq9,
> …example…?
> …test case…?
> (Did I miss a post already in history?)
> (Is the example/test case better handled in another platform? if so, which?, and how?)
> Please change my view if it is obviously incorrect.
> Best,
> /s
Hi, Steve.
Sure. Let's say you have a string where it is always known that you need to pull some specific segment, characters 3 to 11, for example. When these are ASCII that's marvelously easy with binaries. Not so much when they are utf8 multibyte charcters, though:
<<"Getting characters from a range within this is easy.">>
<<"binaryではそんなに簡単ではないが、stringの方では問題ないですね。"/utf8>>
You will either have to use "(*UTF8)blahblah" regexes everywhere (which is appropriate for some cases, but not nearly enough) or convert those binaries to strings so that you can do the things you expected to be able to do with them easily.
Quite a few of the split, split on X, split after segment X-Y, etc. type functions don't work the expected way with utf8 in binary form, but do with utf8 strings.
The lucky thing there is that Erlang's output interfaces (format functions, socket output, etc.) accepts iolists, so it doesn't really matter how messy a cobbled together deep list happens to be before you hand it to them.
Delimiters also can catch you by surprise -- there are "many" 'kinds' 「of」 『quotes』 (and) (brackets) `in` 【use】 and not all are easy to make sense of or fit within a single byte. Spaces are spaces and tabs are tabs, sure, but check the spacing in this sentence carefully. Also, many multibyte characters are single width, ニホンゴデモ.
My region obviously causes me to clash with this issue quite a bit more than seems to be the case with the rest of the Erlang world. That said, there are many tools/environments that make manipulating utf8 much easier than Erlang.
-Craig
More information about the erlang-questions
mailing list