Handling huge amounts of data

Tue Jun 3 18:12:31 CEST 2003

On Tue, 3 Jun 2003, Kent Boortz wrote:
> "Vlad Dumitrescu" <vlad_dumitrescu@REDACTED> writes:
> > I am writing some programs that need to shuffle a lot of data (arrays/lists
> > of some 5.000.000 items). I didn't want to use C/C++, so I settled to C#.
> > None of the above, however, are especially good prototyping.
> >
> > I would have loved to be able to give Erlang a chance, but some simple tests
> > say most of the time goes to garbage collection (and I need some speed when
> > running too, not only when developing). Also, the memory usage seems to
> > always go up (except when a process ends and that memory is freed)[*].
> > Besides that, it's occasionally causing "Abnormal termination"s.
> >
> > Is there some "magic" that will make Erlang work with this kind of problems,
> > or is it just not the right niche?
>
> If you do message passing with huge terms you could try to compile an
> Erlang/OTP that uses a shared heap to avoid the copying of the
> data. You have to build Erlang/OTP from sources and configure like
>
>   % ./configure --enable-shared-heap
>
> I don't know if the shared heap support is stable enough or if the
> garbage collector handle the case well but I think others on this list
> can fill you in on that,
>
> kent

Shared heap should be stable. If not, please send me the program causing
the trouble.

There are many reasons for a system to collect garbage, some of them can
be solved by setting the minimum heap size, some with the shared heap.

Say the size of a process live data is growing and shrinking over and over
again. When there is a lot of live data the heap grows during garbage
collection (gc). At some later point a gc occurs when there is little live
data causing the heap to shrink again, which leads to a new wave of gc's
when the live data is accuculating the next time.
Setting the minimum heap size for this process may be a good way to solve
it if you can live with the fact that a single process will lock a huge
ammount of heap space even in the periods of low live data. I think the
shared heap is a better candidate for this kind of programs. Once a shared
heap has grown, it won't shrink again. In the periods of low live data,
other processes can use the heap space. And if you know when the heap is
likely to be low on live data, perhaps it might be a good time to force a
gc..? (Not that I want to encourage forcing gc's.)

If you have a program where several processes accumulate tons of live data
only to do some small thing and then die, setting the minimum heap size is
probably a better choice than shared heap, since you can get rid of all
those gs's just like that.
    __
___(  |_______________________  _______________________________________________
    | | ,---. ,--.,--.,--.   ( (
    | ||  _  || o ) o ) p )   ) ) "Beware of bugs in the above code;
    | || (_) || r'| r'| -×--.( (  I have only proved it correct, not tried it."
o,--' | `---' |_| |_| `-----' ) )                               -- Donald Knuth
_`----'______________________( (_______________________________________________
Jesper Wilhelmsson, jesperw@REDACTED                         +46 (0)18 471 1046
Computing Science Department, Uppsala University, Sweden     +46 (0)733 207 207
-------------------------------------------------------------------------------