Fwd: [erlang-questions] Re: data sharing is outside the semantics of Erlang, but it sure is useful

Tue Sep 15 22:38:57 CEST 2009

Forgot the "Reply All" ....

Begin forwarded message:

> From: Tony Rogvall <tony@REDACTED>
> Date: ti 15 sep 2009 22.36.32 GMT+02:00
> To: Witold Baryluk <baryluk@REDACTED>
> Subject: Re: [erlang-questions] Re: data sharing is outside the  
> semantics of Erlang, but it sure is useful
>
> Here is nice term encoder/decoder (below) that will solve the size  
> problem.
> The algorithm is not efficient (in Erlang) but if/when implemented  
> in C it could be used as a bif.
> Still it will produce decent results on (all) shared data.
>
> The encoding of Y9 (from previous example) is then:
>
> te:encode(Y9).
>
> {2,
>      {1,1,1,1},
>      {2,2,2,2},
>      {3,3,3,3},
>      {4,4,4,4},
>      {5,5,5,5},
>      {6,6,6,6},
>      {7,7,7,7},
>      {8,8,8,8},
>      {9,9,9,9},
>      {10,10,10,10},
>      11}
>
> That when encoded with term_to_binary lead to a binary of size 107.
> If coded straight it was 2796203 bytes!!!
>
> Other examples that reveal some properties of the algorithm:
>
> te:encode({[a,b],[a,b,c],[a,b,c,d]}).
>
> {a,b,c,d,[],
>   [4|5],
>   [3|6],
>   [2|7],
>   [1|8],
>   [3|5],
>   [2|10],
>   [1|11],
>   [2|5],
>   [1|13],
>   {14,12,9},
>   15}
>
> te:encode({[b,a],[c,b,a],[d,c,b,a]}).
>
> {d,c,b,a,[],[4|5],[3|6],[2|7],[1|8],{7,8,9},10}
>
>
> applied to code (parse trees) you will be able to detect all common  
> sub expressions
> (renaming bindings and removing line numbers helps a lot ;-)
>
> /Tony
>
> -------------
> -module(te).
>
> -export([encode/1, decode/1]).
> -export([binary_to_term/1, term_to_binary/1]).
>
> %%
> %% Decode a term given the pointer map
> %%
> binary_to_term(Bin) ->
>    decode(erlang:binary_to_term(Bin)).
>
> decode(Map) ->
>    decode(1,Map).
>
> %% When implement this in C the setelement is of course destructive
> decode(I,Map) ->
>    E = element(I,Map),
>    if I == size(Map) ->
> 	    element(E,Map);
>       is_list(E), E =/= [] ->
> 	    H = element(hd(E),Map),
> 	    T = element(tl(E),Map),
> 	    decode(I+1, setelement(I, Map, [H|T]));
>       is_tuple(E), size(E) > 0 ->
> 	    T = decode_tuple(E,size(E),Map,[]),
> 	    decode(I+1, setelement(I, Map, T));
>       true ->
> 	    decode(I+1, Map)
>    end.
>
>
> decode_tuple(_T,0,_Map,Acc) ->
>    list_to_tuple(Acc);
> decode_tuple(T,I,Map,Acc) ->
>    Ei = element(I,T),
>    E = element(Ei,Map),
>    decode_tuple(T,I-1,Map,[E|Acc]).
>
> %%
> %% Do term compression/encoding
> %% Encode bottom up while assinging pointer index
> %%
> term_to_binary(Term) ->
>    erlang:term_to_binary(encode(Term)).
>
> encode(Term) ->
>    {E,_K,_D, S} = encode(Term, 1, dict:new(), []),
>    list_to_tuple(lists:reverse([E|S])).
>
> encode(T, K, D, S) ->
>    if is_list(T), T =/= [] ->
> 	    {He,K1,D1,S1} = encode(hd(T), K, D, S),
> 	    {Te,K2,D2,S2} = encode(tl(T), K1, D1, S1),
> 	    insert([He|Te], K2, D2, S2);
>       is_tuple(T), size(T) > 0 ->
> 	    {Te,K1,D1,S1} = encode_tuple(T,size(T),K,D,S,[]),
> 	    insert(Te, K1, D1, S1);
>       true ->
> 	    insert(T, K, D, S)
>    end.
>
> encode_tuple(_T, 0, K, D, S, Acc) ->
>    {list_to_tuple(Acc), K, D, S};
> encode_tuple(T, I, K, D, S, Acc) ->
>    {E,K1,D1,S1} = encode(element(I,T), K, D, S),
>    encode_tuple(T,I-1, K1, D1,S1,[E|Acc]).
>
>
> %% lookup and maybe insert
> insert(Encoded, K, D, S) ->
>    try dict:fetch(Encoded, D) of
> 	Ke -> {Ke,K,D,S}
>    catch
> 	error:_ ->
> 	    De = dict:store(Encoded, K, D),
> 	    {K,K+1,De,[Encoded|S]}
>    end.
> -------------------
>
>
> On 15 sep 2009, at 21.40, Witold Baryluk wrote:
>
>> Dnia 2009-09-14, pon o godzinie 15:36 -0500, James Hague pisze:
>>>> I am missing something here. gb_sets (nor sets, ordsets, rbsets)  
>>>> does not
>>>> make a copy of the data which is put into the set. All that is  
>>>> copied is
>>>> enough of the *tree* to insert the new element. There is no need  
>>>> to copy the
>>>> new data as it is kept within the same process. Only ets makes a  
>>>> copy of the
>>>> data.
>>>
>>> Let's say you've got a long list of strings.  Many of them  
>>> duplicates.
>>> You don't just want to remove the duplicates because that will  
>>> change
>>> the length of the list. The goals is to ensure that identical  
>>> strings
>>> are shared, so there's only one copy in memory.  What's a practical
>>> way of doing that?
>>>
>>> This is irrelevant most of the time, but there are some situations
>>> where it's a huge win.
>>>
>>> (My solution was to build a new list by adding each element to a
>>> binary tree.  If a string is already in the tree, return the version
>>> that's already there (which is not something that gb_sets does).  In
>>> the resulting list, elements are shared as much as possible. I'm
>>> clearly taking advantage of how the runtime works, but it shrunk the
>>> heap size by tens of megabytes.)
>>
>> It looks for me as quite good solution, you depend on memory  
>> saving, so
>> explicitly manages to have shared strings:
>>
>> Any other way i can see is to have this add/retrivial from binary  
>> tree,
>> to be performed implicitly on every step (via some sort of hashing)  
>> of
>> computation to ensure that identical elements are only once in memory
>> (of one process). But this have really big performance problems i  
>> think,
>> and most of times you really don't need perfect sharing.
>>
>> You can limit this "hashing" to only terms which are complex enaugh,
>> like long lists, or nested tuples. But still this is very costly.
>>
>> If you want to have some term (or list of many terms) and shrink  
>> them,
>> by some magic, you can using following pseudo code:
>>
>> maximize_sharing(Term) ->
>> Tree = tree:new(),
>> NewTree = iterate_over_every_subterm_and_add_first_to_tree(Term),
>> NewTerm = iterate_over_every_subterm_and_find_in_tree(Term, NewTree).
>>
>>
>>
>> Lack of sharing is also problematic when sending term_to_binary data,
>> because term_to_binary essentially flattens term be repeatedly  
>> copying
>> data. It would be nice to have term_to_binary which doesn't copies  
>> data
>> more than once, which already is in the same term.
>>
>> Ie.
>>> erlang:process_info(self(), heap_size).
>> {heap_size,1597}
>>
>>> X = {2,2,2,2},
>> Y1 = {X,X,X,X},
>> Y2 = {Y1,Y1,Y1,Y1},
>> Y3 = {Y2,Y2,Y2,Y2},
>> Y4 = {Y3,Y3,Y3,Y3},
>> Y5 = {Y4,Y4,Y4,Y4},
>> Y6 = {Y5,Y5,Y5,Y5},
>> Y7 = {Y6,Y6,Y6,Y6},
>> Y8 = {Y7,Y7,Y7,Y7},
>> Y9 = {Y8,Y8,Y8,Y8},
>> ok.
>>
>>> erlang:process_info(self(), heap_size).
>> {heap_size,4181}
>>
>>>
>>
>> In memory Y9 will be pretty small, and whole process memory  
>> consumption
>> will be small, but term_to_binary(Y9) (binary with 2MB of data, but  
>> not
>> counting to the heap_size), or just displaying using  
>> io_lib:format("~w",
>> [Y9]) (2MB of data after flattening resulting list) or sending it to
>> different node will be disaster.
>>
>>> erlang:process_info(self(), heap_size). % after io_lib:format(), ok.
>> {heap_size,9707190}
>>
>>
>> I don't know how it affects sending Y9 to process on the same node.
>> As we know Y9 need to be copied to heap of the destination process
>> (because sharing heaps between process destroy soft real-time  
>> semantic
>> of garbage collection, but improves other performance metrics for big
>> messages). But is it copied smart enough?
>>
>>
>>> P = spawn(fun() ->  P = receive P2 -> P2 end, io:format("~p ~n ~p  
>>> ~n",
>> [erlang:process_info(self(), heap_size), size(P)]), ok end), P !  
>> Y9, ok.
>>
>> {heap_size,2103540}
>>>
>>
>> Unfortunately no.
>>
>> First good step will be to have version of term_to_binary which
>> preserves sharing of (already shared) subterms (term_to_binary have
>> second argument for options, like minorversion or gzip  
>> compression), and
>> binary_to_term which understand how to unpack them preserving sharing
>> (also across
>> multiple process in the same node, or compact storage in file/db).
>>
>>
>> -- 
>> Witold Baryluk
>