[erlang-questions] strings vs binaries

Thu Aug 20 16:51:26 CEST 2015

On 2015年8月20日 木曜日 09:30:26 you wrote:

> Actually, I don’t seem to have ever faced the problem of "get characters 3 to 11”. And I’ve dealt with some pretty diverse protocols...
> 
> If I did, I guess I’d use a small function that calls unicode:characters_to_list… extract them and convert the result back to binary!

In essence this exactly what we do, we just do it with the entire binary input (given that it is supposed to be a utf8 string), and it winds up becoming a binary again on its way out -- but no pieces of the initial binary survive that initial conversion, and this is on purpose. Before the processing phase we only do the work of making X-characters of Z-bytes once, instead of letting that task grow over time into something that happens all over the place.

In my experience it is the only way to maintain sanity in the face of utf8. Count yourself lucky if this is not your situation and you can be sure it never will be!

As I mentioned before, though, most string processing occurs in data services designed to handle text, or in clients that deal with text directly as a core part of what they do. In Erlang, which tends to be in the middle of all this in my case, we just don't do much text processing, at least not as a core task (generally speaking, the application server could care less if the data is a text message, a voice message, a video, a document, game save data, etc.). Where we do deal with text directly it has been consistently easier to manage it in utf8 than anything else, and and in Erlang that has been much easier to deal with as lists than binaries.

-Craig