<div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Aug 18, 2015 at 10:16 PM, Ben Hsu <span dir="ltr"><<a href="mailto:benhsu@gmail.com" target="_blank">benhsu@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">My question is when you will use each one. Are binary strings used for sending data over the wire, and normal strings used internally? what are the tradeoffs?</blockquote></div><br></div><div class="gmail_extra">Somewhat absent from the discussion until now are the 3rd string-like type in Erlang: atom()'s<br><br></div><div class="gmail_extra">* A string() is a linked list of unicode code points on the process heap.<br></div><div class="gmail_extra">* A binary() is a buffer of binary data which can contain utf-8 encoded strings. They are either allocated on the process heap, or for large strings, in the binary() arena in the memory allocator. This arena is shared among all processes.<br></div><div class="gmail_extra">* An atom() is a name which is hashed to an integer internally and then the integer is used in place of the name.<br><br></div><div class="gmail_extra">Caveats:<br><br></div><div class="gmail_extra">* string()'s are easy to manipulate with a lot of functions, but their memory imprint is large. This makes them unsuitable for storing large amounts of data as strings, if that data is in-memory. They are usually fine for smaller things, and chunked flow however.<br></div><div class="gmail_extra">* binary() data has problems if you reference subbinaries and those keep the underlying binary around. This is where `binary:copy/1` comes into the game. The general binary handling documentation in the performance guide linked here has the details of binaries, subbinaries, heap binaries and match contexts.<br></div><div class="gmail_extra">* atom()'s have very fast equality checks. But the atom table is limited in size, you you can't have a program which dynamically creates new atoms. In many cases however, creating a mapping from external data to internal atoms yields very efficient programs.<br><br></div><div class="gmail_extra">The general rule from my blog post still holds true to this day: it is often better to internalize external data into a symbolic format. A language such as Erlang is efficient when it operates on symbolic data:<br><br></div><div class="gmail_extra">intern(<<"fnord">>) -> {ok, fnord};<br></div><div class="gmail_extra">intern(<<"frobnificatization">>) -> {ok, frobnificatization};<br></div><div class="gmail_extra">intern(Unknown) -> {error, {cannot_intern, Unknown}}.<br></div><div class="gmail_extra"><br></div><div class="gmail_extra">i(X) -><br></div><div class="gmail_extra"> {ok, A} = intern(X),<br></div><div class="gmail_extra"> A.<br><br></div><div class="gmail_extra">and so on. Of course, if the string is truly a random string of information for humans to read, you can't parse it, but have to lug it around as satellite data in your application. And then the size caveat of string()'s has to be taken into account for some programs which keep a large set of data in-memory.<br><br clear="all"></div><div class="gmail_extra">(Note: This is without discussing the problem of handling unicode data, which other people covered well already)<br><br></div><div class="gmail_extra">-- <br><div class="gmail_signature">J.</div>
</div></div>