[erlang-questions] Re: data sharing is outside the semantics of Erlang, but it sure is useful

Witold Baryluk baryluk@REDACTED
Tue Sep 15 21:40:51 CEST 2009


Dnia 2009-09-14, pon o godzinie 15:36 -0500, James Hague pisze:
> > I am missing something here. gb_sets (nor sets, ordsets, rbsets) does not
> > make a copy of the data which is put into the set. All that is copied is
> > enough of the *tree* to insert the new element. There is no need to copy the
> > new data as it is kept within the same process. Only ets makes a copy of the
> > data.
> 
> Let's say you've got a long list of strings.  Many of them duplicates.
> You don't just want to remove the duplicates because that will change
> the length of the list. The goals is to ensure that identical strings
> are shared, so there's only one copy in memory.  What's a practical
> way of doing that?
> 
> This is irrelevant most of the time, but there are some situations
> where it's a huge win.
> 
> (My solution was to build a new list by adding each element to a
> binary tree.  If a string is already in the tree, return the version
> that's already there (which is not something that gb_sets does).  In
> the resulting list, elements are shared as much as possible. I'm
> clearly taking advantage of how the runtime works, but it shrunk the
> heap size by tens of megabytes.)

It looks for me as quite good solution, you depend on memory saving, so
explicitly manages to have shared strings:

Any other way i can see is to have this add/retrivial from binary tree,
to be performed implicitly on every step (via some sort of hashing) of
computation to ensure that identical elements are only once in memory
(of one process). But this have really big performance problems i think,
and most of times you really don't need perfect sharing.

You can limit this "hashing" to only terms which are complex enaugh,
like long lists, or nested tuples. But still this is very costly.

If you want to have some term (or list of many terms) and shrink them,
by some magic, you can using following pseudo code:

maximize_sharing(Term) ->
  Tree = tree:new(),
  NewTree = iterate_over_every_subterm_and_add_first_to_tree(Term),
  NewTerm = iterate_over_every_subterm_and_find_in_tree(Term, NewTree).



Lack of sharing is also problematic when sending term_to_binary data,
because term_to_binary essentially flattens term be repeatedly copying
data. It would be nice to have term_to_binary which doesn't copies data
more than once, which already is in the same term.

Ie.
> erlang:process_info(self(), heap_size).
{heap_size,1597}

> X = {2,2,2,2},
Y1 = {X,X,X,X},
Y2 = {Y1,Y1,Y1,Y1},
Y3 = {Y2,Y2,Y2,Y2},
Y4 = {Y3,Y3,Y3,Y3},
Y5 = {Y4,Y4,Y4,Y4},
Y6 = {Y5,Y5,Y5,Y5},
Y7 = {Y6,Y6,Y6,Y6},
Y8 = {Y7,Y7,Y7,Y7},
Y9 = {Y8,Y8,Y8,Y8},
ok.

> erlang:process_info(self(), heap_size).
{heap_size,4181}

>

In memory Y9 will be pretty small, and whole process memory consumption
will be small, but term_to_binary(Y9) (binary with 2MB of data, but not
counting to the heap_size), or just displaying using io_lib:format("~w",
[Y9]) (2MB of data after flattening resulting list) or sending it to
different node will be disaster.

> erlang:process_info(self(), heap_size). % after io_lib:format(), ok.
{heap_size,9707190}


I don't know how it affects sending Y9 to process on the same node.
As we know Y9 need to be copied to heap of the destination process
(because sharing heaps between process destroy soft real-time semantic
of garbage collection, but improves other performance metrics for big
messages). But is it copied smart enough?


> P = spawn(fun() ->  P = receive P2 -> P2 end, io:format("~p ~n ~p ~n",
[erlang:process_info(self(), heap_size), size(P)]), ok end), P ! Y9, ok.

{heap_size,2103540} 
>

Unfortunately no.

First good step will be to have version of term_to_binary which
preserves sharing of (already shared) subterms (term_to_binary have
second argument for options, like minorversion or gzip compression), and
binary_to_term which understand how to unpack them preserving sharing
(also across
multiple process in the same node, or compact storage in file/db).


-- 
Witold Baryluk
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: To jest cz??? wiadomo?ci podpisana cyfrowo
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20090915/ceec8d5f/attachment.bin>


More information about the erlang-questions mailing list