<div>Yes, you are right. When I did the test, I thought also about that, so, I made a third loop which should have appended the char to the binary string. Unfortunately, I had a typo there and I ended up having the same loop for binaries (and therefore the same numbers). Your message made me check it again and I found that typo. I didn't find quite those ratios you got, but, yes, they are comparable with your ratios. Thanks a lot.</div>

<div><br></div><div><br></div><div><br><div class="gmail_quote">On Fri, Jul 13, 2012 at 4:51 PM, Anders Nygren <span dir="ltr"><<a href="mailto:anders.nygren@gmail.com" target="_blank">anders.nygren@gmail.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">For some cases appending to a binary is much faster, see<br>

<a href="http://www.erlang.org/doc/efficiency_guide/binaryhandling.html#id65923" target="_blank">http://www.erlang.org/doc/efficiency_guide/binaryhandling.html#id65923</a><br>

<br>

So Your simple benchmark is not fair, You are comparing the best case<br>

for lists with the worst case for binaries. If You change it to<br>

appending instead of prepending You will get very different results.<br>

So as one of the first replies said "it depends".<br>

<br>

On my machine:<br>

For small N prepending to a list is ~2x faster than appending to a binary.<br>

When N=1E6 appending to a binary is ~2x faster than prepending to a list.<br>

<span class="HOEnZb"><font color="#888888"><br>

/Anders<br>

</font></span><div class="HOEnZb"><div class="h5"><br>

<br>

On Fri, Jul 13, 2012 at 6:48 AM, CGS <<a href="mailto:cgsmcmlxxv@gmail.com">cgsmcmlxxv@gmail.com</a>> wrote:<br>

> Hi Bob,<br>

><br>

> On Fri, Jul 13, 2012 at 2:32 AM, Bob Ippolito <<a href="mailto:bob@redivi.com">bob@redivi.com</a>> wrote:<br>

>><br>

>> Sorry, but that's not enough information.<br>

><br>

><br>

> Sorry, but I didn't know what you meant by that. Here are the answers.<br>

><br>

>><br>

>><br>

>> Where do these chunks come from: source code, some other process, ets?<br>

><br>

><br>

> Mostly from other processes.<br>

><br>

>><br>

>> How many chunks are in a string?<br>

><br>

><br>

> That is computed from the string size and chunk size (to be decided later).<br>

><br>

>><br>

>> Do you compose strings out of other strings, or just these chunks?<br>

><br>

><br>

> Just chunks inserted into existing/new strings.<br>

><br>

>><br>

>> Are you constructing them from tail to head like you would a list?<br>

><br>

><br>

> Unfortunately, not all the time.<br>

><br>

>><br>

>> Is the string constructed all at once, or over some time?<br>

><br>

><br>

> If you mean by that the string will be fully given in the same message or<br>

> whatever by other processes, the answer is no. Over some time may be the<br>

> answer, but with the remark that I have no idea what means "over some time"<br>

> as period of time (can get chunks one after the other for the same string or<br>

> for different strings).<br>

><br>

>><br>

>><br>

>> It sounds like you have some code that's doing something with these<br>

>> strings, because you say that it's faster to use lists than binaries. Maybe<br>

>> if you post whatever it is you used to determine that, someone can help you<br>

>> find a better algorithm and/or data structure for that benchmark.<br>

><br>

><br>

> That is simple. For benchmark I just took two simple loops (only for<br>

> insertion):<br>

><br>

> -export([loop_list/1, loop_binary/2]).<br>

><br>

> loop_list(0) -> [];<br>

> loop_list(N) -> [107 | loop_list(N)].<br>

><br>

> loop_binary(0,B) -> B;<br>

> loop_binary(N,B) -> loop_binary(N-1,<<107,B/binary>>).<br>

><br>

> If you go in powers of tens and measure the execution time, you will see the<br>

> difference (you will also notice the drop in efficiency for binaries when<br>

> you need fragmentation for the same string, which is not visible in the case<br>

> of lists - or at least I couldn't notice). That is not a fully conclusive<br>

> test for my case, but it is quite a conclusive test for processing speed in<br>

> the two cases (simple expansion of the string by adding a char at the<br>

> beginning of the string), in which, for a 10^5 chars string, the list gains<br>

> at least one order of magnitude in processing time than its binary<br>

> counterpart (1) (2) (3).<br>

><br>

> (1) Even if you change the list loop to have an accumulator in which the<br>

> list exists, there is still one order of magnitude difference.<br>

> (2) The results are specific for the machine on which the tests are done.<br>

> (3) That is just an example.<br>

><br>

> CGS<br>

><br>

><br>

>><br>

>><br>

>><br>

>> On Thursday, July 12, 2012, CGS wrote:<br>

>>><br>

>>> Unfortunately, the project is still in the planning stage, so, no real<br>

>>> code was written yet. Nevertheless, I plan some open source projects for<br>

>>> some parts of the project.<br>

>>><br>

>>> About each string, it is constructed from chunks of fixed size, usually,<br>

>>> much smaller than the string itself, hopefully.<br>

>>><br>

>>><br>

>>><br>

>>> On Thu, Jul 12, 2012 at 7:29 PM, Bob Ippolito <<a href="mailto:bob@redivi.com">bob@redivi.com</a>> wrote:<br>

>>><br>

>>> It would be helpful if you were a lot more specific about how these<br>

>>> strings are constructed and what the strings mean. Maybe if you shared some<br>

>>> of the code, you'd get better guidance.<br>

>>><br>

>>><br>

>>> On Thu, Jul 12, 2012 at 9:06 AM, CGS <<a href="mailto:cgsmcmlxxv@gmail.com">cgsmcmlxxv@gmail.com</a>> wrote:<br>

>>><br>

>>><br>

>>> Hi Joe,<br>

>>><br>

>>> The main problem is to find out which strings are read-only and which<br>

>>> strings are read-write, and that requires an algorithm for itself<br>

>>> (processing time and extra space - I don't know how negligible are at this<br>

>>> moment) as I don't know from before which string will be used more<br>

>>> frequently and which less frequently. The second problem is I would like to<br>

>>> minimize the harddisk usage, so, to try to store as much information as<br>

>>> possible in RAM, but without slowing down the overall process. I know, I am<br>

>>> an idealist. :)<br>

>>><br>

>>> I thought also about working with lists and keep them as binaries when I<br>

>>> don't use them, but, as I said before, that implies a lot of garbage to<br>

>>> collect which either can be collected immediately after invoking<br>

>>> list_to_binary/1, either allowing GC to appear naturally when there is<br>

>>> insufficient memory, or to invoke it at certain moments (either at regular<br>

>>> interval of time or based on a scheduler triggered by the application<br>

>>> usage). I am afraid that all may be quite inefficient, but they may work<br>

>>> faster than processing binaries directly. That I have no idea yet. That's<br>

>>> why I am asking here for opinions.<br>

>>><br>

>>> Nevertheless, I didn't think of trying to split the strings in two<br>

>>> categories: read-only and read-write. That definitely is something I should<br>

>>> take into account.<br>

>>><br>

>>> Thanks a lot for your thoughts and shared experience.<br>

>>><br>

>>> Cheers,<br>

>>> CGS<br>

>>><br>

>>><br>

>>><br>

>>><br>

>>> On Thu, Jul 12, 2012 at 5:17 PM, Joe Armstrong <<a href="mailto:erlang@gmail.com">erlang@gmail.com</a>> wrote:<br>

>>><br>

>>> As you point out list processing is faster than binary processing.<br>

>>><br>

>>> I'd keep things as lists as long as possible until you run into memory<br>

>>> problems<br>

>>> If you plot the number of strings against response times (or whatever)<br>

>>> you should see a sudden decrease in performance when you start paging.<br>

>>> At that point you have too much in memory - You could turn the oldest<br>

>>> strings<br>

>>> into binaries to save space.<br>

>>><br>

>>> I generally keep string as lists when I'm working on them and turn<br>

>>> them into binaries<br>

>>> when I'm finished - sometimes even compressed binaries.<br>

>>><br>

>>> Then it depends on the access patterns on the strings - random<br>

>>> read-write access is horrible<br>

>>> if you can split them into a read-only part and a write-part, you<br>

>>> could keep the read-only bit<br>

>>> as a binary and the writable bit as a list.<br>

>>><br>

>>> It's worth spending a lot of effort to save a single disk access. Then<br>

>>> it depends what you do with your strings. If you have a solid state<br>

>>> disk and want read only access to the strings<br>

>>> then you could store them on disk - or at least arrange so that the<br>

>>> constant parts of the strings<br>

>>> are on disk and the variable parts in memory. SSDs are about 10 times<br>

>>> slower than RAM for reading and usually have multiple controllers so<br>

>>> can be very fast - but you need to think a bit first.<br>

>>><br>

>>> I'd start with a few measurements, try to stress the system and see<br>

>>> where things go wrong.<br>

>>> Plot the results - it's usually easy to see when things go wrong.<br>

>>><br>

>>> Cheers<br>

>>><br>

>>> /Joe<br>

>>><br>

>>><br>

>>><br>

>>> On Thu, Jul 12, 2012 at 2:46 PM, CGS <<a href="mailto:cgsmcmlxxv@gmail.com">cgsmcmlxxv@gmail.com</a>> wrote:<br>

>>> > Hi,<br>

>>> ><br>

>>> > I am trying to find a balance in between processing speed and RAM<br>

>>> > consumption for sets<br>

><br>

><br>

><br>

</div></div><div class="HOEnZb"><div class="h5">> _______________________________________________<br>

> erlang-questions mailing list<br>

> <a href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a><br>

> <a href="http://erlang.org/mailman/listinfo/erlang-questions" target="_blank">http://erlang.org/mailman/listinfo/erlang-questions</a><br>

><br>

</div></div></blockquote></div><br></div>