[erlang-questions] pg2...a warning

Evans, Matthew mevans@REDACTED
Thu Apr 29 22:59:59 CEST 2010


I agree, and was going to reply with almost what you wrote.

It seems that this occurs when we reboot a card that is part of an Erlang cluster.

There is also the following code:

handle_info({nodeup, Node}, S) ->
    gen_server:cast({?MODULE, Node}, {exchange, node(), all_members()}),
    {noreply, S};


Correct me if I'm wrong, but in a large cluster, won't 'Node' get this message "number of nodes in cluster (-1)" times?

Basically, I'm seeing very strange behavior here. I'll dig a little deeper.

I too am tempted to modify pg2 to only allow a process to join once.

Matt

-----Original Message-----
From: Geoff Cant [mailto:nem@REDACTED] 
Sent: Thursday, April 29, 2010 4:11 PM
To: Evans, Matthew
Cc: Ulf Wiger; erlang-questions@REDACTED
Subject: Re: [erlang-questions] pg2...a warning

I've also had some trouble with pg2 recently (the R13B04 version). It seems we have some code that joins groups repeatedly (or some other network condition - nodes parting/joining the cluster?) causes pg2 to think that pids have joined groups lots of times (for 180K values of lots). This causes spectacularly pathological behaviour in a pg2 cluster - you can get to the point where new nodes are no longer able to start pg2 as they run out of ram and abort when exchanging group definitions with other nodes.

(pg2 does something like:

all_members() ->
    [[G, group_members(G)] || G <- all_groups()].
group_members(Name) ->
    [P || 
        [P, N] <- ets:match(pg2_table, {{member, Name, '$1'},'$2'}),
        _ <- lists:seq(1, N)].
all_groups() ->
    [N || [N] <- ets:match(pg2_table, {{group,'$1'}})].

when exchanging group membership information. In the pathological case I've run into it means sending [group_name, [ 180000 x Pid1, 180000 x Pid2, ... ] ] at startup. Clearly our code has some bugs as we're joining too often, but this behaviour is just nuts - it'd be cheaper to send our entire ets pg2_table over the wire)

I'm pretty sure in our use of pg2 we want the 2nd and subsequent joins to be nops, and it's almost tempting to write 'pg3' just for that.

Cheers,
-Geoff

On 2010-04-29, at 12:23 , Evans, Matthew wrote:

> Thanks Ulf,
> 
> Steve told mea bout your gproc work, It looks interesting.
> 
> I actually ran into another pg2 strangeness today on another application.
> 
> This process does a pg2:join in the init function. This is the ONLY place where this occurs.
> 
> I would therefore like to know why pg2:get_members/2 reports two entries for that process?
> 
> Very strange.
> 
> -----Original Message-----
> From: Ulf Wiger [mailto:ulf.wiger@REDACTED] 
> Sent: Thursday, April 29, 2010 3:29 AM
> To: Evans, Matthew
> Cc: erlang-questions@REDACTED
> Subject: Re: [erlang-questions] pg2...a warning
> 
> Evans, Matthew wrote:
>> 
>> The problem is that an asynchronous operation, beyond our control,
>> can cause pg2:join/2 to be called many times for the same process.
>> The result of which is that pg2:get_closest_pid/1 will not be random
>> (e.g. process on node 1 gets 5 "joins", and node 2 gets 3 "joins").
>> Or rather it will not be random in how we consider it to be (i.e. we
>> only want a process to join a group a single time).
> 
> This made me curious.
> 
> I will admit to not having used pg2, but the other day I was inspired
> to explore how to emulate pg2's behaviour using gproc [1].
> 
> I noted the part in the documentation stating that you can join
> several times, but didn't catch the fact that get_members/1 would
> include each pid once for each time it has joined. This seems to imply
> that joining several times serves an entirely different purpose than
> relieving the programmer of the trouble of keeping track of whether
> or not it has joined the group before.
> 
> OTOH, the man page doesn't mention this at all, which makes me believe
> that it's a bug rather than a feature. It talks about how you should
> use pg2:get_members/1 when you want to send a message to all members
> of a group. This would be a good place to highlight the fact that
> you need to remove duplicates from the list if you want the message
> to be sent only once to each member.
> 
> BR,
> Ulf W
> 
> [1] Making a pg2-like module on top of gproc is actually quite easy,
> but requires the distributed gproc to work, which has not been the
> case until now. I am in the process of verifying it, and will
> hopefully be able to push a new version very soon.

A solid, documented version of gproc would be a most welcome addition to OTP in my opinion.


More information about the erlang-questions mailing list