Regarding GC, it's still a good question whether you'll have one process holding all the strings (in which case GC will probably occur soon enough), or you've got a lot of processes each holding one or a few strings. In the latter case, you may want to look into gen_server's 'hibernate' feature (which builds on proc_lib:hibernate(), in case your processes aren't gen_servers). gen_fsm also supports 'hibernate'. You can use this for reducing heap usage right after the strings have been constructed.<br>

<br>Depending on the nature of the strings in question, and the length of time the strings may be dormant, it might also be relevant to consider compressing the binaries, as Joe suggested. Erlang has nice APIs for this (see zlib).<br>

<br>Much depends on which modification operations you need to perform, of course - if any. And on the access patterns in general, and the nature of the strings (similar or not; repetetive or not; ascii/unicode/dna base pairs...).<br>

<br><br><br><br><br><br><div class="gmail_quote">2012/7/12 CGS <span dir="ltr"><<a href="mailto:cgsmcmlxxv@gmail.com" target="_blank">cgsmcmlxxv@gmail.com</a>></span><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="gmail_quote"><br></div><div class="gmail_quote">Hi Joe,</div><div class="gmail_quote"><br></div><div class="gmail_quote">The main problem is to find out which strings are read-only and which strings are read-write, and that requires an algorithm for itself (processing time and extra space - I don't know how negligible are at this moment) as I don't know from before which string will be used more frequently and which less frequently. The second problem is I would like to minimize the harddisk usage, so, to try to store as much information as possible in RAM, but without slowing down the overall process. I know, I am an idealist. :)</div>


<div class="gmail_quote"><br></div><div class="gmail_quote">I thought also about working with lists and keep them as binaries when I don't use them, but, as I said before, that implies a lot of garbage to collect which either can be collected immediately after invoking list_to_binary/1, either allowing GC to appear naturally when there is insufficient memory, or to invoke it at certain moments (either at regular interval of time or based on a scheduler triggered by the application usage). I am afraid that all may be quite inefficient, but they may work faster than processing binaries directly. That I have no idea yet. That's why I am asking here for opinions.</div>


<div class="gmail_quote"><br></div><div class="gmail_quote">Nevertheless, I didn't think of trying to split the strings in two categories: read-only and read-write. That definitely is something I should take into account.</div>


<div class="gmail_quote"><br></div><div class="gmail_quote">Thanks a lot for your thoughts and shared experience.</div><div class="gmail_quote"><br></div><div class="gmail_quote">Cheers,</div><div class="gmail_quote">CGS</div>


<div class="gmail_quote"><br></div><div class="gmail_quote"><br></div><div class="gmail_quote"><br></div><div class="gmail_quote"><br></div><div class="gmail_quote">On Thu, Jul 12, 2012 at 5:17 PM, Joe Armstrong <span dir="ltr"><<a href="mailto:erlang@gmail.com" target="_blank">erlang@gmail.com</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">As you point out list processing is faster than binary processing.<br>

<br>

I'd keep things as lists as long as possible until you run into memory problems<br>

If you plot the number of strings against response times (or whatever)<br>

you should see a sudden decrease in performance when you start paging.<br>

At that point you have too much in memory - You could turn the oldest strings<br>

into binaries to save space.<br>

<br>

I generally keep string as lists when I'm working on them and turn<br>

them into binaries<br>

when I'm finished - sometimes even compressed binaries.<br>

<br>

Then it depends on the access patterns on the strings - random<br>

read-write access is horrible<br>

if you can split them into a read-only part and a write-part, you<br>

could keep the read-only bit<br>

as a binary and the writable bit as a list.<br>

<br>

It's worth spending a lot of effort to save a single disk access. Then<br>

it depends what you do with your strings. If you have a solid state<br>

disk and want read only access to the strings<br>

then you could store them on disk - or at least arrange so that the<br>

constant parts of the strings<br>

are on disk and the variable parts in memory. SSDs are about 10 times<br>

slower than RAM for reading and usually have multiple controllers so<br>

can be very fast - but you need to think a bit first.<br>

<br>

I'd start with a few measurements, try to stress the system and see<br>

where things go wrong.<br>

Plot the results - it's usually easy to see when things go wrong.<br>

<br>

Cheers<br>

<span><font color="#888888"><br>

/Joe<br>

</font></span><div><div class="h5"><div><div><br>

<br>

<br>

On Thu, Jul 12, 2012 at 2:46 PM, CGS <<a href="mailto:cgsmcmlxxv@gmail.com" target="_blank">cgsmcmlxxv@gmail.com</a>> wrote:<br>

> Hi,<br>

><br>

> I am trying to find a balance in between processing speed and RAM<br>

> consumption for sets of large strings (over 1 M characters per string). To<br>

> construct such lists is much faster than constructing its binary<br>

> counterpart. On the other hand, lists are using more RAM than binaries, and<br>

> that reduces the number of strings I can hold in memory (unless I transform<br>

> the lists in binaries and call GC after that, but that slows down the<br>

> processing time). Has anyone had this problem before? What was the solution?<br>

> Thoughts?<br>

><br>

> A middle way in between lists and binaries is using tuples, but handling<br>

> them is not as easy as in the case of lists or binaries, especially at<br>

> variable tuple size. Therefore, working with tuples seems not a good<br>

> solution. But I might be wrong, so, if anyone used tuples in an efficient<br>

> way for this case, please, let me know.<br>

><br>

> Any thought would be very much appreciated. Thank you.<br>

><br>

> CGS<br>

><br>

><br>

</div></div></div></div><div class="im"><div><div>> _______________________________________________<br>

> erlang-questions mailing list<br>

> <a href="mailto:erlang-questions@erlang.org" target="_blank">erlang-questions@erlang.org</a><br>

> <a href="http://erlang.org/mailman/listinfo/erlang-questions" target="_blank">http://erlang.org/mailman/listinfo/erlang-questions</a><br>

><br>

</div></div></div></blockquote></div><br>

<br>_______________________________________________<br>

erlang-questions mailing list<br>

<a href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a><br>

<a href="http://erlang.org/mailman/listinfo/erlang-questions" target="_blank">http://erlang.org/mailman/listinfo/erlang-questions</a><br>

<br></blockquote></div><br>