[erlang-questions] Python Generators In Erlang and constant memory usage

Fri Mar 20 02:06:35 CET 2009

First time posting here, I've been reading this mailing lists for a few 
months now, programming Erlang for a few months too.

Python and its most functionnal parts is what eventually brought me to 
Erlang. One feature that I loved in Python and instantly noticed was not 
available in Erlang was generators. Generators are basically normal 
functions, except they return an iterator, letting you fetch each result 
one by one, effectively letting you work with constant memory. An 
example of the way Python does it:

    def gen(n):
        i = 0
        while i<n:
            yield i
            i+=1
        return

    x = gen(2)
    x.next() # returns 0
    x.next() # returns 1
    x.next()  # throws an exception

In python, using yield creates a 'next' method, which is automatically 
called by most functions that can iterate over text, lists, tuples, 
dictionaries, etc.

The idea is to not load all the results from a function at once and pass 
it around, but to send each part individually.

As an example, say I have a result set consisting of 1000x5kb tuples 
(whatever they contain) which I want to map a function over. The way 
Erlang works, you'd need to have a list that is 5000kb in size in memory 
and manipulate it every single time the mapped function iterates over it.

The end goal is to attempt to bypass having to evaluate the whole result 
set at once (heavy) and instead get each element individually (lighter). 
This is especially good for smaller systems like phones, web tablets 
(Nokia) or really anything with limited amounts of RAM.

While possible to implement in Erlang, generators are quite problematic 
because every list BIF at the moment only operates on complete lists. 
This means adding generators in Erlang would require one to do something 
like duplicate the lists module.

This may not be the first time this whole idea is discussed, but fast 
googling didn't reveal much so I've let myself write a small naive and 
simple prototype of what the implementation could be.

The idea is quite simple: Have a recursive function as a different 
spawned process waiting to receive {Pid, next} before returning data. 
I've attached the file demonstrating the implementation, but in case I 
got it wrong, I also copy/pasted it there: http://pastebin.com/f31ae9731

What the implementation does:

    - define tail-recursive fibonacci functions, a normal one and a
    generator version
    - define mockups for the sum and fold (foldl) functions from the
    list module adapted to generators
    - short unscientific tests to compare performances.

What it doesn't do:

    - consider any kind of safety in case there is an error in a generator
    - contain anything past what you'd do for a proof of concept.

The way it works, a normal tail-recursive way to get all the fibonacci 
numbers until the nth one could be done the following way:

    t_fib(1) -> [1];
    t_fib(2) -> [1,1];
    t_fib(L) -> t_fib(3,L,[1,1]).

    t_fib(L,L,[N1,N2|_]=F) ->
        lists:reverse([N1+N2|F]);
    t_fib(C, L, [N1,N2|_]=F) ->
        t_fib(C+1,L,[N1+N2|F]).

That's not the prettiest function in the world, but it still does the 
job. Here's how it would be as a generator (with the macros I have 
defined in the file):

    y_fib(L) ->
        y_fib(1, L, 1, 0).
    y_fib(L, L, N1 ,N2) ->
        ?last_yield(N1+N2);
    y_fib(C, L, N1, N2) ->
        N = N1+N2,
        ?yield(N),
        y_fib(C+1, L, N2, N).

The code overhead one gets writing it that way is minimal and denies the 
need to use a list as an accumulator.

The generator sum function is written the following way:

    sum({M,Gen,Args}) ->
        Pid = spawn(M, Gen, Args),
        sum(Pid,0).
    sum(Pid,S) ->
        Pid ! {self(), next},
        receive
            {Pid, N} -> sum(Pid, S+N);
            {Pid, N, done} -> S+N
        end.

Called with the custom sum function in the generators module (This could 
be made better, I agree):

     > generators:sum({generators,y_fib,[100]}).  
    927372692193078999175

Now for performance, I've tested it  with Erlang R13A. My computer:
    AMD phenom 9950 quad-core 2.6ghz processor, 4gb RAM, Windows 7 Beta 
build 7000 (32 bits).
batch_test/3 will compare both versions with a sequence/list of 
Fibonacci numbers of length 10000 and 50000, 20 times each, and round 
the results showing how much faster the generator is (1.0 is equal).

    26> generators:batch_test(sum, 10000, 20).
    0.9438942747654998
    27> generators:batch_test(sum, 50000, 20).
    2.406425935803285
    28> generators:batch_test(fold, 10000, 20).
    0.6706634763086374
    29> generators:batch_test(fold, 50000, 20).
    1.3918125332818239

This seems to show that using a generator scales much better, while 
becoming faster as the list grows and becomes expansive to generate and 
carry around.

The real advantage, though, is the fact that using a function that won't 
accumulate lots of data (sum, folds, foreach, all, any, nth, etc.) will 
be done with constant memory. This is especially good for small systems 
or scaling as a whole.

The testing I've done is minimal, and while I don't expect the Erlang 
community to jump on the idea and adopt it, I'm one of these people 
who'd like to eventually use this in production code. What kind of stuff 
would you see missing, or likely to cause trouble in the current 
implementation?
What ways would I have to test this giving better results?

Any other comments, tips? Thanks in advance! (Have a nice day also.)
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: generators.erl
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20090319/a45c746e/attachment.ksh>