<div>Yes, you are right. When I did the test, I thought also about that, so, I made a third loop which should have appended the char to the binary string. Unfortunately, I had a typo there and I ended up having the same loop for binaries (and therefore the same numbers). Your message made me check it again and I found that typo. I didn't find quite those ratios you got, but, yes, they are comparable with your ratios. Thanks a lot.</div>
<div><br></div><div><br></div><div><br><div class="gmail_quote">On Fri, Jul 13, 2012 at 4:51 PM, Anders Nygren <span dir="ltr"><<a href="mailto:anders.nygren@gmail.com" target="_blank">anders.nygren@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">For some cases appending to a binary is much faster, see<br>
<a href="http://www.erlang.org/doc/efficiency_guide/binaryhandling.html#id65923" target="_blank">http://www.erlang.org/doc/efficiency_guide/binaryhandling.html#id65923</a><br>
<br>
So Your simple benchmark is not fair, You are comparing the best case<br>
for lists with the worst case for binaries. If You change it to<br>
appending instead of prepending You will get very different results.<br>
So as one of the first replies said "it depends".<br>
<br>
On my machine:<br>
For small N prepending to a list is ~2x faster than appending to a binary.<br>
When N=1E6 appending to a binary is ~2x faster than prepending to a list.<br>
<span class="HOEnZb"><font color="#888888"><br>
/Anders<br>
</font></span><div class="HOEnZb"><div class="h5"><br>
<br>
On Fri, Jul 13, 2012 at 6:48 AM, CGS <<a href="mailto:cgsmcmlxxv@gmail.com">cgsmcmlxxv@gmail.com</a>> wrote:<br>
> Hi Bob,<br>
><br>
> On Fri, Jul 13, 2012 at 2:32 AM, Bob Ippolito <<a href="mailto:bob@redivi.com">bob@redivi.com</a>> wrote:<br>
>><br>
>> Sorry, but that's not enough information.<br>
><br>
><br>
> Sorry, but I didn't know what you meant by that. Here are the answers.<br>
><br>
>><br>
>><br>
>> Where do these chunks come from: source code, some other process, ets?<br>
><br>
><br>
> Mostly from other processes.<br>
><br>
>><br>
>> How many chunks are in a string?<br>
><br>
><br>
> That is computed from the string size and chunk size (to be decided later).<br>
><br>
>><br>
>> Do you compose strings out of other strings, or just these chunks?<br>
><br>
><br>
> Just chunks inserted into existing/new strings.<br>
><br>
>><br>
>> Are you constructing them from tail to head like you would a list?<br>
><br>
><br>
> Unfortunately, not all the time.<br>
><br>
>><br>
>> Is the string constructed all at once, or over some time?<br>
><br>
><br>
> If you mean by that the string will be fully given in the same message or<br>
> whatever by other processes, the answer is no. Over some time may be the<br>
> answer, but with the remark that I have no idea what means "over some time"<br>
> as period of time (can get chunks one after the other for the same string or<br>
> for different strings).<br>
><br>
>><br>
>><br>
>> It sounds like you have some code that's doing something with these<br>
>> strings, because you say that it's faster to use lists than binaries. Maybe<br>
>> if you post whatever it is you used to determine that, someone can help you<br>
>> find a better algorithm and/or data structure for that benchmark.<br>
><br>
><br>
> That is simple. For benchmark I just took two simple loops (only for<br>
> insertion):<br>
><br>
> -export([loop_list/1, loop_binary/2]).<br>
><br>
> loop_list(0) -> [];<br>
> loop_list(N) -> [107 | loop_list(N)].<br>
><br>
> loop_binary(0,B) -> B;<br>
> loop_binary(N,B) -> loop_binary(N-1,<<107,B/binary>>).<br>
><br>
> If you go in powers of tens and measure the execution time, you will see the<br>
> difference (you will also notice the drop in efficiency for binaries when<br>
> you need fragmentation for the same string, which is not visible in the case<br>
> of lists - or at least I couldn't notice). That is not a fully conclusive<br>
> test for my case, but it is quite a conclusive test for processing speed in<br>
> the two cases (simple expansion of the string by adding a char at the<br>
> beginning of the string), in which, for a 10^5 chars string, the list gains<br>
> at least one order of magnitude in processing time than its binary<br>
> counterpart (1) (2) (3).<br>
><br>
> (1) Even if you change the list loop to have an accumulator in which the<br>
> list exists, there is still one order of magnitude difference.<br>
> (2) The results are specific for the machine on which the tests are done.<br>
> (3) That is just an example.<br>
><br>
> CGS<br>
><br>
><br>
>><br>
>><br>
>><br>
>> On Thursday, July 12, 2012, CGS wrote:<br>
>>><br>
>>> Unfortunately, the project is still in the planning stage, so, no real<br>
>>> code was written yet. Nevertheless, I plan some open source projects for<br>
>>> some parts of the project.<br>
>>><br>
>>> About each string, it is constructed from chunks of fixed size, usually,<br>
>>> much smaller than the string itself, hopefully.<br>
>>><br>
>>><br>
>>><br>
>>> On Thu, Jul 12, 2012 at 7:29 PM, Bob Ippolito <<a href="mailto:bob@redivi.com">bob@redivi.com</a>> wrote:<br>
>>><br>
>>> It would be helpful if you were a lot more specific about how these<br>
>>> strings are constructed and what the strings mean. Maybe if you shared some<br>
>>> of the code, you'd get better guidance.<br>
>>><br>
>>><br>
>>> On Thu, Jul 12, 2012 at 9:06 AM, CGS <<a href="mailto:cgsmcmlxxv@gmail.com">cgsmcmlxxv@gmail.com</a>> wrote:<br>
>>><br>
>>><br>
>>> Hi Joe,<br>
>>><br>
>>> The main problem is to find out which strings are read-only and which<br>
>>> strings are read-write, and that requires an algorithm for itself<br>
>>> (processing time and extra space - I don't know how negligible are at this<br>
>>> moment) as I don't know from before which string will be used more<br>
>>> frequently and which less frequently. The second problem is I would like to<br>
>>> minimize the harddisk usage, so, to try to store as much information as<br>
>>> possible in RAM, but without slowing down the overall process. I know, I am<br>
>>> an idealist. :)<br>
>>><br>
>>> I thought also about working with lists and keep them as binaries when I<br>
>>> don't use them, but, as I said before, that implies a lot of garbage to<br>
>>> collect which either can be collected immediately after invoking<br>
>>> list_to_binary/1, either allowing GC to appear naturally when there is<br>
>>> insufficient memory, or to invoke it at certain moments (either at regular<br>
>>> interval of time or based on a scheduler triggered by the application<br>
>>> usage). I am afraid that all may be quite inefficient, but they may work<br>
>>> faster than processing binaries directly. That I have no idea yet. That's<br>
>>> why I am asking here for opinions.<br>
>>><br>
>>> Nevertheless, I didn't think of trying to split the strings in two<br>
>>> categories: read-only and read-write. That definitely is something I should<br>
>>> take into account.<br>
>>><br>
>>> Thanks a lot for your thoughts and shared experience.<br>
>>><br>
>>> Cheers,<br>
>>> CGS<br>
>>><br>
>>><br>
>>><br>
>>><br>
>>> On Thu, Jul 12, 2012 at 5:17 PM, Joe Armstrong <<a href="mailto:erlang@gmail.com">erlang@gmail.com</a>> wrote:<br>
>>><br>
>>> As you point out list processing is faster than binary processing.<br>
>>><br>
>>> I'd keep things as lists as long as possible until you run into memory<br>
>>> problems<br>
>>> If you plot the number of strings against response times (or whatever)<br>
>>> you should see a sudden decrease in performance when you start paging.<br>
>>> At that point you have too much in memory - You could turn the oldest<br>
>>> strings<br>
>>> into binaries to save space.<br>
>>><br>
>>> I generally keep string as lists when I'm working on them and turn<br>
>>> them into binaries<br>
>>> when I'm finished - sometimes even compressed binaries.<br>
>>><br>
>>> Then it depends on the access patterns on the strings - random<br>
>>> read-write access is horrible<br>
>>> if you can split them into a read-only part and a write-part, you<br>
>>> could keep the read-only bit<br>
>>> as a binary and the writable bit as a list.<br>
>>><br>
>>> It's worth spending a lot of effort to save a single disk access. Then<br>
>>> it depends what you do with your strings. If you have a solid state<br>
>>> disk and want read only access to the strings<br>
>>> then you could store them on disk - or at least arrange so that the<br>
>>> constant parts of the strings<br>
>>> are on disk and the variable parts in memory. SSDs are about 10 times<br>
>>> slower than RAM for reading and usually have multiple controllers so<br>
>>> can be very fast - but you need to think a bit first.<br>
>>><br>
>>> I'd start with a few measurements, try to stress the system and see<br>
>>> where things go wrong.<br>
>>> Plot the results - it's usually easy to see when things go wrong.<br>
>>><br>
>>> Cheers<br>
>>><br>
>>> /Joe<br>
>>><br>
>>><br>
>>><br>
>>> On Thu, Jul 12, 2012 at 2:46 PM, CGS <<a href="mailto:cgsmcmlxxv@gmail.com">cgsmcmlxxv@gmail.com</a>> wrote:<br>
>>> > Hi,<br>
>>> ><br>
>>> > I am trying to find a balance in between processing speed and RAM<br>
>>> > consumption for sets<br>
><br>
><br>
><br>
</div></div><div class="HOEnZb"><div class="h5">> _______________________________________________<br>
> erlang-questions mailing list<br>
> <a href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a><br>
> <a href="http://erlang.org/mailman/listinfo/erlang-questions" target="_blank">http://erlang.org/mailman/listinfo/erlang-questions</a><br>
><br>
</div></div></blockquote></div><br></div>