[erlang-questions] strings vs binaries

Wed Aug 19 12:04:44 CEST 2015

On 2015年8月19日 水曜日 10:44:20 Robert Virding wrote:
> Someone did some benchmarking on this a few years ago to compare the speed
> of using binary utf-8 encoded strings vs list strings. They arrived at the
> for them surprising result that using list strings was actually faster. It
> is actually not surprising if you think about it. So what Richard says
> definitely very relevant:
> 
> Always, you need to ask "What am I going to do with this?"

With this in mind...

If the Erlang program is an application service sitting between a data store of some sort on one side and a constellation of client applications on the other, then most of what is happening isn't string manipulation, but shuttling data between the two. It is incidental that much of this data happens to often be textual in the context of social and business application (but that's not a rule). In this case who cares what it is? It doesn't need to be inspected and it all remains binary.

The protocols involved in this case are often textual and its somewhat rare to find a textual protocol that isn't 100% ASCII. Erlang makes parsing/consuming a textual ASCII protocol using binary syntax almost identical to parsing/consuming a binary protocol -- which feels pretty amazing if you've tried doing both in other environments before (sure its not usually super hard in C, but with binary syntax (or even string matches) in Erlang its just "write the protocol format out as matches" and you're done, unless the protocol sucks).

If, however, we need to do anything at all interesting with the text that is coming in -- looking for matches on textual keys, for example, identifying a delimiter and splitting on it, complying with some new government contract that mandates a keyword check (sadly, not a joke), then passing part of the result around, filtering, etc. then it is, in my experience, *much* easier to take the received utf8 binary, convert it once to a string on receipt, and deal with it in those terms for the life of the procedure.

Going out again, of course, iolists can accept binaries and strings, and you can write a string literal as a binary literal with no issue if you like. So you can let your binary and list results converge again at your output functions -- which is once again an extremely convenient detail someone was thoughtful enough to include in Erlang.

I've never measured the performance of the output functions over different types of iolist data, so I have no idea how this plays out. Fortunately, I just haven't experienced such a performance bottleneck where worry over this detail was ever warranted -- but I imagine passing deep lists to format functions, at least, might be expensive(ish), even if it seems to be pretty lightweight to pass ugly deep lists to a socket.

Sadly, neither approach is perfect for all cases. (But we can dream of a day...)

-Craig