[erlang-questions] lists, binaries or something else?

Thu Jul 12 17:17:03 CEST 2012

On Thu, Jul 12, 2012 at 3:26 PM, Erik Søe Sørensen <eriksoe@REDACTED>wrote:

> As so often: "it depends" :-)
>

Yes, I know. :)

>
> What are the lifetimes of the strings?
> If the many large strings have to be alive simultaneously, then that
> points towards binaries, because of the memory usage.
> If the strings have rather short lifespans, on the other hand, then their
> sizes may not matter so much as other things.
>

Pretty much I don't know. :) There is a dynamic behind those strings by
changing them, adding new ones, deleting old ones (that doesn't depend on
me). The problem is I may need to "squeeze" simultaneously in memory quite
some of them and, depending on the structure I can compute the limitations.

>
> How are the strings constructed?
> When you say that they're much faster to construct as lists, does that
> include constructing the list, then transforming it into a binary?
>

No, I was referring to constructing directly into binaries versus
constructing lists and transforming them in binaries. Much more like
recursion in strings as binaries versus as lists at string expanding point.
On my machine, creating a list of 1 Mchars and transforming it to binary
using list_to_binary/1 is much faster than building up directly binaries.
And I fear that is a general case and not only on my machine. But, correct
me if I am wrong.

> Invoking the GC explicitly should only be necessary in scenarios like if
> the many strings are constructed and live in separate processes, and these
> processes afterwards do so little work that it takes a long time until the
> next GC - that is, scenarios where GC doesn't occur for natural reasons
> within a reasonable time.
>

I was referring at constructing a list, transforming it in binary and
invoking GC to get rid of the list from the memory. I know GC would occur
for natural reasons when the memory becomes insufficient, but then I have
no control over how slow the application will be in those moments.
Therefore, I would go for calling GC after each transformation. But that is
quite inefficiently to do it, I know.

>
> Do the strings share substrings?
> If the strings are built from the same templates, say, or are the results
> of generating all permutations of lines of the Divine Comedy - then using
> an iolist would be a good idea.
>

Well, I haven't thought of iolist, it is true. I will have a closer look at
iolist. Thank you for suggesting it.

>
> Are the strings built from parts that originally are (or could be)
> binaries?
> Another advantage of using iolists is that you can mix lists and binaries;
> this is good news either if some of the source parts can just as easily be
> obtained as binaries in the first place, or if you don't want to build it
> all as a list because of space, nor build it all as a tbinary because of
> time: you can perhaps build pieces at a time as lists, then convert these
> pieces to binaries (e.g., build a line at a time), and construct an iolist
> containing the binary versions of the pieces.
>

As a low memory space requirement for the application, the strings should
be held in binaries as they are the smallest. The iolist, as you said, may
be a nice option.

>
> Using tuples instead of lists, as a twice-as-compact representation, is of
> course possible if you're careful - bearing in mind the 64M tuple size
> limit - though the only efficient way to construct such a thing would be
> list_to_tuple(). It'd be awkward, though, as well, as unconventional, and
> you'd have to think the use cases through first.
> It wouldn't make sense to use tuples for much else than constant strings;
> the only advantage over binaries might be that you'd avoid encoding or
> decoding to/from UTF-8 or whatever you'll be using.
>

About tuples, I had the same feeling. That's why I said they don't seem to
be a good solution. Nevertheless, I needed a confirmation that my way of
thinking was correct.

> Hoping this helps.
>

Thanks a lot for your thoughts. Of course they help and gave me some new
ideas. Feel free to add more thoughts if you have. They will be much
appreciated.

Regards,
CGS

> /Erik
>
> 2012/7/12 CGS <cgsmcmlxxv@REDACTED>
>
>> Hi,
>>
>> I am trying to find a balance in between processing speed and RAM
>> consumption for sets of large strings (over 1 M characters per string). To
>> construct such lists is much faster than constructing its binary
>> counterpart. On the other hand, lists are using more RAM than binaries, and
>> that reduces the number of strings I can hold in memory (unless I transform
>> the lists in binaries and call GC after that, but that slows down the
>> processing time). Has anyone had this problem before? What was the
>> solution? Thoughts?
>>
>> A middle way in between lists and binaries is using tuples, but handling
>> them is not as easy as in the case of lists or binaries, especially at
>> variable tuple size. Therefore, working with tuples seems not a good
>> solution. But I might be wrong, so, if anyone used tuples in an efficient
>> way for this case, please, let me know.
>>
>> Any thought would be very much appreciated. Thank you.
>>
>> CGS
>>
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20120712/78fffd87/attachment.htm>