[erlang-questions] How to distribute code using code server to slave nodes (II) (altering slave module for configurable timeout)

Angel Alvarez clist@REDACTED
Wed Jul 22 21:03:39 CEST 2009


Tested and works!!

Just in case someone finds it interesting


scenario:

Master Node "One" starts slaves by means of slave:start() for cluster operation

Slaves bootup using remote code server provided by Master.
Slaves timeout on LAN 100Mbits or less preveneting bootup (remote node start bur slave throws timeout error cancelling operation)

stdlib/slave.erl has hardcoded timeout preventing proper operations due to full remote code loading overhead

Solution (FOR TESTING)

Instance stdlib/slave.erl into private myslave.erl adding minimal code to allow pass options down to inner receive construct where timeout needs 
to be changed on demand.

minimal changes to stdlib/slave -> myslave to allow passing options to start/n 
Now you can  set timeout as {timeout, Value} or defaul value 32000


TEST:
Set timeout to 2 mins and wait for slave to finish bootup...

myslave:start("192.168.9.240","one", [{timeout, 120000}], lists:concat(["-host 192.168.128.192 -loader inet ","-setcookie ", erlang:get_cookie()]) ).
{error,timeout}

Network isnt fast enough!

Lets set timeout to 3 mins
 
myslave:start("192.168.9.240","one", [{timeout, 180000}], lists:concat(["-hosts 192.168.128.192 -loader inet ","-setcookie ", erlang:get_cookie()]) ).
{ok,'one@REDACTED'}

Now my remote node can finish bootup sequence using the my local code server, and i dont have to distribute beam files for my application
to all the nodes of my distributed cluster... :-)

In addition to alter several start/1..5 wrapper "interesting" code lies in wait_for_slave()
the code is clearly bad  merey for testing porpouses...


wait_for_slave(Parent, Host, Name, Node, Options, Args, LinkTo, Prog) ->
    Waiter = register_unique_name(0),

%%  
    case Options of
        [{timeout, Timeout }] -> SlaveTimeOut = Timeout;
        true -> SlaveTimeOut = 32000
    end,
%%

    case mk_cmd(Host, Name, Args, Waiter, Prog) of
        {ok, Cmd} ->
%%          io:format("Command: ~s~n", [Cmd]),
            open_port({spawn, Cmd}, [stream]),
            receive
                {SlavePid, slave_started} ->
                    unregister(Waiter),
                    slave_started(Parent, LinkTo, SlavePid)
            after SlaveTimeOut ->
                    %% If it seems that the node was partially started,
                    %% try to kill it.
                    Node = list_to_atom(lists:concat([Name, "@", Host])),
                    case net_adm:ping(Node) of
                        pong ->
                            spawn(Node, erlang, halt, []),
                            ok;
                        _ ->
                            ok
                    end,
                    Parent ! {result, {error, timeout}}
            end;
        Other ->
            Parent ! {result, Other}
    end.


Im not confident to provide industrial grade patches for this issue, but maybe the OTP team would provide
proper upgrades on next releases if this feature is seen interesting.
future directions include

get list of "Master" loaded modules and compute total byte size.
transfer over rsh mechanism equal amount of data to compute aproximate network bandwith
Set apropiate timeout prior to bootup slaves... 



Regards /Angel



El Miércoles, 22 de Julio de 2009 Angel Alvarez escribió:
> Hi
> 
> stdlib/slave.erl
> 
> function wait_for_slave has a timeout of 32 milliseconds that seem to be too short for a whole node startup sequence (on WLAN 54Mbits).
> 
> Ill try to copy this module to private path and rename to myslave so altering the after clause dont mess with standard
> module and see wahts happens...
> 
> 
> wait_for_slave(Parent, Host, Name, Node, Args, LinkTo, Prog) ->
>     Waiter = register_unique_name(0),
>     case mk_cmd(Host, Name, Args, Waiter, Prog) of
>         {ok, Cmd} ->
> %%          io:format("Command: ~s~n", [Cmd]),
>             open_port({spawn, Cmd}, [stream]),
>             receive
>                 {SlavePid, slave_started} ->
>                     unregister(Waiter),
>                     slave_started(Parent, LinkTo, SlavePid)
>             after 32000 ->
>                     %% If it seems that the node was partially started,
>                     %% try to kill it.
>                     Node = list_to_atom(lists:concat([Name, "@", Host])),
>                     case net_adm:ping(Node) of
>                         pong ->
>                             spawn(Node, erlang, halt, []),
>                             ok;
>                         _ ->
>                             ok
>                     end,
>                     Parent ! {result, {error, timeout}}
>             end;
>         Other ->
>             Parent ! {result, Other}
>     end.
> 
> 
> Regards /angel
> 
> El Miércoles, 22 de Julio de 2009 Angel escribió:
> > Hi again
> > 
> > More on this issue ive managed to start a remote node using the code server
> > 
> > start erlang in distribute mode.. on the Master Node.
> > 
> > start the code server...
> > code_server4:start( MyRemoteSlave).
> > 
> > and finally  
> > 
> > slave:start(MyRemoteSlave,Nost,Name,SomeArgs).
> > 
> > where SomeArgs contains "-setcookie mycookie and -hosts <master node IP> -loader inet"
> > 
> > resulting in a timeout failure to start and setup the remote slave.
> > 
> > But if i logon on the slave node and do...
> > 
> > sinosuke@REDACTED:~> erl -hosts <MasterIP> -loader inet -id one@<MyRemoteSlave> -name one@<MyRemoteSlave> -setcookie mycookie
> > 
> > ¡¡it success!!
> > 
> > Erlang R13B01 (erts-5.7.2) [source] [smp:2:2] [rq:2] [async-threads:0] [hipe] [kernel-poll:false]
> > 
> > Eshell V5.7.2  (abort with ^G)
> > (one@REDACTED)1>
> > 
> > 
> > tcpdump shows network activity while the remote code server ask Master code server for the required stuff..
> > (adding +l to erl cmdline show remote code loading)
> > 
> > So the aproach is correct but the remote node fails to load all the required code in time and the slave module on the master
> > complains about timeout.
> > 
> > ¿¿Can i tweak this timeout value in order to let the remote node finish its code loading??      
> > 
> > I whant to use "-loader inet" to avoid having to deploy my app code all over the remote nodes by hand.
> > 
> > Regards Angel
> > 
> > 
> > El Miércoles, 22 de Julio de 2009 00:06:06 Angel Alvarez escribió:
> > > Hi
> > > 
> > > Im learning erlang so i use to write toy programs just for fun
> > > now Im learning to use the pool module, so i cant spread workers over some computers.
> > > 
> > > The problem is that prior to pspawn processes i have to manually distribute code among all members of the pool.
> > > I use to write something like this (i saw this somewhere on the net...)
> > > 
> > > 	pool:start(pooltest, lists:concat(["-setcookie ", erlang:get_cookie()])),
> > > 	distribute_app([mymodule1,mymodule2, ... , mymodulen]),
> > > 
> > > 
> > > with distribute_app as:
> > > 
> > > distribute_app(Modules) ->
> > >         %% extraer todos los nodos excepto el Master
> > >         RemoteNodes = [X || X <- pool:get_nodes(), X =/= node()],
> > >         %% Transferir el código
> > >         lists:foreach(fun(Node) -> transfer_code(Node, Modules) end, RemoteNodes).
> > > 
> > > transfer_code(Node, Modules) ->
> > >         [transfer_module(Node, Module) || Module <- Modules].
> > > 
> > > transfer_module(Node, Module) ->
> > >         {_Module, Binary, FileName} = code:get_object_code(Module),
> > >         rpc:call(Node, code, load_binary, [Module, FileName, Binary]).
> > > 
> > > Instead of doing this, can i instruct remote code servers to ask master code server for code when
> > > it need to locate new refences?
> > > 
> > > Is seems to be related to starting erl with "--loader inet" ¿Can anyone prove me with some pointers about this? 
> > > 
> > > Thanks
> > > 
> > 
> > 
> > 
> 
> 
> 



-- 
No imprima este correo si no es necesario. El medio ambiente está en nuestras manos.
->>--------------------------------------------------

 Angel J. Alvarez Miguel, Sección de Sistemas 
 Area de Explotación, Servicios Informáticos
 
 Edificio Torre de Control, Campus Externo UAH
 Alcalá de Henares 28806, Madrid  ** ESPAÑA **
 
 RedIRIS Jabber: angel.uah.es@REDACTED
------------------------------------[www.uah.es]-<<-- 
MySQL5: Vale, corromper los datos de forma silente no era una buena idea despues de todo.

-- No imprima este correo si no es necesario. El medio ambiente está en nuestras manos.->>-----------------------------------------------    Clist UAH a.k.a Angel---------------------------------[www.uah.es]-


More information about the erlang-questions mailing list