<div class="gmail_quote">Hi Bob,</div><div class="gmail_quote"><br></div><div class="gmail_quote">On Fri, Jul 13, 2012 at 2:32 AM, Bob Ippolito <span dir="ltr"><<a href="mailto:bob@redivi.com" target="_blank">bob@redivi.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Sorry, but that's not enough information.</blockquote><div><br></div><div>Sorry, but I didn't know what you meant by that. Here are the answers.</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><br></div><div>Where do these chunks come from: source code, some other process, ets? </div></blockquote>
<div><br></div><div>Mostly from other processes.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>How many chunks are in a string? </div></blockquote>
<div><br></div><div>That is computed from the string size and chunk size (to be decided later).</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>Do you compose strings out of other strings, or just these chunks? </div>
</blockquote><div><br></div><div>Just chunks inserted into existing/new strings.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>Are you constructing them from tail to head like you would a list? </div>
</blockquote><div><br></div><div>Unfortunately, not all the time.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>Is the string constructed all at once, or over some time?</div>
</blockquote><div><br></div><div>If you mean by that the string will be fully given in the same message or whatever by other processes, the answer is no. Over some time may be the answer, but with the remark that I have no idea what means "over some time" as period of time (can get chunks one after the other for the same string or for different strings).</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><span></span></div>
<div><br></div><div>It sounds like you have some code that's doing something with these strings, because you say that it's faster to use lists than binaries. Maybe if you post whatever it is you used to determine that, someone can help you find a better algorithm and/or data structure for that benchmark. </div>
</blockquote><div><br></div><div>That is simple. For benchmark I just took two simple loops (only for insertion):</div><div><br></div><div>-export([loop_list/1, loop_binary/2]).</div><div><br></div><div>loop_list(0) -> [];</div>
<div>loop_list(N) -> [107 | loop_list(N)].</div><div><br></div><div>loop_binary(0,B) -> B;</div><div>loop_binary(N,B) -> loop_binary(N-1,<<107,B/binary>>).</div><div><br></div><div>If you go in powers of tens and measure the execution time, you will see the difference (you will also notice the drop in efficiency for binaries when you need fragmentation for the same string, which is not visible in the case of lists - or at least I couldn't notice). That is not a fully conclusive test for my case, but it is quite a conclusive test for processing speed in the two cases (simple expansion of the string by adding a char at the beginning of the string), in which, for a 10^5 chars string, the list gains at least one order of magnitude in processing time than its binary counterpart (1) (2) (3).</div>
<div><br></div><div>(1) Even if you change the list loop to have an accumulator in which the list exists, there is still one order of magnitude difference.</div><div>(2) The results are specific for the machine on which the tests are done.</div>
<div>(3) That is just an example.</div><div><br></div><div>CGS</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div><div class="h5">
<br>
<br>On Thursday, July 12, 2012, CGS wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Unfortunately, the project is still in the planning stage, so, no real code was written yet. Nevertheless, I plan some open source projects for some parts of the project.<div>
<br></div><div>About each string, it is constructed from chunks of fixed size, usually, much smaller than the string itself, hopefully.</div>
<div><br></div><div><br><br><div>On Thu, Jul 12, 2012 at 7:29 PM, Bob Ippolito <span dir="ltr"><<a>bob@redivi.com</a>></span> wrote:<br><blockquote style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
It would be helpful if you were a lot more specific about how these strings are constructed and what the strings mean. Maybe if you shared some of the code, you'd get better guidance.<div><div>
<br><br><div>On Thu, Jul 12, 2012 at 9:06 AM, CGS <span dir="ltr"><<a>cgsmcmlxxv@gmail.com</a>></span> wrote:<br>
<blockquote style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><br></div><div>Hi Joe,</div><div><br></div><div>
The main problem is to find out which strings are read-only and which strings are read-write, and that requires an algorithm for itself (processing time and extra space - I don't know how negligible are at this moment) as I don't know from before which string will be used more frequently and which less frequently. The second problem is I would like to minimize the harddisk usage, so, to try to store as much information as possible in RAM, but without slowing down the overall process. I know, I am an idealist. :)</div>
<div><br></div><div>I thought also about working with lists and keep them as binaries when I don't use them, but, as I said before, that implies a lot of garbage to collect which either can be collected immediately after invoking list_to_binary/1, either allowing GC to appear naturally when there is insufficient memory, or to invoke it at certain moments (either at regular interval of time or based on a scheduler triggered by the application usage). I am afraid that all may be quite inefficient, but they may work faster than processing binaries directly. That I have no idea yet. That's why I am asking here for opinions.</div>
<div><br></div><div>Nevertheless, I didn't think of trying to split the strings in two categories: read-only and read-write. That definitely is something I should take into account.</div>
<div><br></div><div>Thanks a lot for your thoughts and shared experience.</div><div><br></div><div>Cheers,</div><div>CGS</div>
<div><br></div><div><br></div><div><br></div><div><br></div><div>On Thu, Jul 12, 2012 at 5:17 PM, Joe Armstrong <span dir="ltr"><<a>erlang@gmail.com</a>></span> wrote:<br>
<blockquote style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">As you point out list processing is faster than binary processing.<br>
<br>
I'd keep things as lists as long as possible until you run into memory problems<br>
If you plot the number of strings against response times (or whatever)<br>
you should see a sudden decrease in performance when you start paging.<br>
At that point you have too much in memory - You could turn the oldest strings<br>
into binaries to save space.<br>
<br>
I generally keep string as lists when I'm working on them and turn<br>
them into binaries<br>
when I'm finished - sometimes even compressed binaries.<br>
<br>
Then it depends on the access patterns on the strings - random<br>
read-write access is horrible<br>
if you can split them into a read-only part and a write-part, you<br>
could keep the read-only bit<br>
as a binary and the writable bit as a list.<br>
<br>
It's worth spending a lot of effort to save a single disk access. Then<br>
it depends what you do with your strings. If you have a solid state<br>
disk and want read only access to the strings<br>
then you could store them on disk - or at least arrange so that the<br>
constant parts of the strings<br>
are on disk and the variable parts in memory. SSDs are about 10 times<br>
slower than RAM for reading and usually have multiple controllers so<br>
can be very fast - but you need to think a bit first.<br>
<br>
I'd start with a few measurements, try to stress the system and see<br>
where things go wrong.<br>
Plot the results - it's usually easy to see when things go wrong.<br>
<br>
Cheers<br>
<span><font color="#888888"><br>
/Joe<br>
</font></span><div><div><div><div><br>
<br>
<br>
On Thu, Jul 12, 2012 at 2:46 PM, CGS <<a>cgsmcmlxxv@gmail.com</a>> wrote:<br>
> Hi,<br>
><br>
> I am trying to find a balance in between processing speed and RAM<br>
> consumption for sets </div></div></div></div></blockquote></div></blockquote></div></div></div></blockquote></div></div></blockquote></div></div></div>
</blockquote></div><br>