[erlang-questions] strings vs binaries

Wed Aug 19 13:44:47 CEST 2015

On 2015年8月19日 水曜日 12:27:38 Jesper Louis Andersen wrote:
> On Wed, Aug 19, 2015 at 11:44 AM, Robert Virding <rvirding@REDACTED> wrote:
> 
> > Someone did some benchmarking on this a few years ago to compare the speed
> > of using binary utf-8 encoded strings vs list strings. They arrived at the
> > for them surprising result that using list strings was actually faster.
> 
> 
> Indeed, the speed characteristics is a complex affair involving CPU types,
> caches, workload, how data is passed around and so on. What I am
> specifically trying to warn about is the memory imprint, which is much
> greater for string() types.

Separate from the memory issues that can sneak up on you when dealing with binaries (especially the bit you mentioned about references to segments of a large one in memory) is the number of CPU cycles required to identify and assemble individual characters from utf-8 binaries. The space/time tradeoff can speak heavily here, depending on the nature of the work to be done.

Dealing with utf-8 strings that have already been identified and set into a list of discrete character values amortizes that part of the work (it is essentially a tokenization task, though not as staightforward as lexing is usually made to be). Compared to ASCII values where this entire task doesn't even exist, it is not difficult to imagine that in many (but by no means all) tasks strings would be considerably faster to deal with than raw utf-8 binaries. That's not even touching the bazillion different unicode categorization shortcuts ("match on a character that is like an 'x', regardless what accents or stacks of modifiers it has" or even "match on an assembled glyph that includes an 'x', or a complete character that would be an equivalent match", and so on).

It is amazing to me that we have unicode and that usable implementations exist at all. Seriously, WOW! But it is by no means simple or easy.

-Craig