[erlang-questions] lists, binaries or something else?

Erik Søe Sørensen eriksoe@REDACTED
Thu Jul 12 15:26:31 CEST 2012


As so often: "it depends" :-)

What are the lifetimes of the strings?
If the many large strings have to be alive simultaneously, then that points
towards binaries, because of the memory usage.
If the strings have rather short lifespans, on the other hand, then their
sizes may not matter so much as other things.

How are the strings constructed?
When you say that they're much faster to construct as lists, does that
include constructing the list, then transforming it into a binary?
Invoking the GC explicitly should only be necessary in scenarios like if
the many strings are constructed and live in separate processes, and these
processes afterwards do so little work that it takes a long time until the
next GC - that is, scenarios where GC doesn't occur for natural reasons
within a reasonable time.

Do the strings share substrings?
If the strings are built from the same templates, say, or are the results
of generating all permutations of lines of the Divine Comedy - then using
an iolist would be a good idea.

Are the strings built from parts that originally are (or could be) binaries?
Another advantage of using iolists is that you can mix lists and binaries;
this is good news either if some of the source parts can just as easily be
obtained as binaries in the first place, or if you don't want to build it
all as a list because of space, nor build it all as a tbinary because of
time: you can perhaps build pieces at a time as lists, then convert these
pieces to binaries (e.g., build a line at a time), and construct an iolist
containing the binary versions of the pieces.

Using tuples instead of lists, as a twice-as-compact representation, is of
course possible if you're careful - bearing in mind the 64M tuple size
limit - though the only efficient way to construct such a thing would be
list_to_tuple(). It'd be awkward, though, as well, as unconventional, and
you'd have to think the use cases through first.
It wouldn't make sense to use tuples for much else than constant strings;
the only advantage over binaries might be that you'd avoid encoding or
decoding to/from UTF-8 or whatever you'll be using.

Hoping this helps.
/Erik

2012/7/12 CGS <cgsmcmlxxv@REDACTED>

> Hi,
>
> I am trying to find a balance in between processing speed and RAM
> consumption for sets of large strings (over 1 M characters per string). To
> construct such lists is much faster than constructing its binary
> counterpart. On the other hand, lists are using more RAM than binaries, and
> that reduces the number of strings I can hold in memory (unless I transform
> the lists in binaries and call GC after that, but that slows down the
> processing time). Has anyone had this problem before? What was the
> solution? Thoughts?
>
> A middle way in between lists and binaries is using tuples, but handling
> them is not as easy as in the case of lists or binaries, especially at
> variable tuple size. Therefore, working with tuples seems not a good
> solution. But I might be wrong, so, if anyone used tuples in an efficient
> way for this case, please, let me know.
>
> Any thought would be very much appreciated. Thank you.
>
> CGS
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20120712/2cce5416/attachment.htm>


More information about the erlang-questions mailing list