[erlang-questions] initializing process

Sat Sep 19 00:57:01 CEST 2009

> exactly this is a application server that is getting requests from  
> its clients via http and then it is doing some computation. but the  
> key feature of the server is talking to other peer in local network  
> and gathering information about other devices in the area, process  
> that information and send it to clients.
>
> the problem is that the other computer is a third-party hardware and  
> software black-box that I connect via tcp or rss and can switches  
> off itself. my server must be 100% sure that the other side is alive  
> before starting everything up. if my server is not responding  
> clients will notify local stuff that something is going on and  
> someone will go check. the server may loose connection with black- 
> box and again will disconnect client. it is maybe weird but help  
> troubleshooting and I know that my server is ok but it is THEIR :)  
> fault.
This is dysfunctional, but helpful to know.  I had assumed you were  
writing the client and the server.  Since you connect to the "black  
box" using TCP and RSS, I wouldn't think you could do ping an Erlang  
node on it (since you couldn't install one).  Even if you can, the  
Erlang node running isn't the same as their TCP connections not  
working, so I'd recommend a pure-TCP solution.

Here's what I'm thinking.  At the top, put a single supervisor,  
running a one_for_all strategy.  It should have one child, and be  
registered under a name.

Have the child be a gen_fsm with initialized and uninitialized modes.   
In the init function, start in an 'uninitialized' state and specify a  
very short timeout (maybe 500 milliseconds).  When you receive a  
timeout in that state, try to use the gen_tcp:connect (or an http  
client to a harmless URL, if it's an HTTP device).  If there is a good  
connection, tell the supervisor to add a temporary child (the  
supervisor for your real workers).  If it's a bad connection, set a  
timeout again.

In the running state, have the timeout check, but perhaps with a  
longer loop time.  If it gets a good connection, set the timeout.  If  
it gets a bad connection, terminate the gen_fsm by returning  
{stop,"Couldn't make a connection",State}.  If you want to be a bit  
more forgiving, keep track of successive failures to connect and only  
terminate after maybe five consecutive failures (or somesuch).

I think that this has the correct behavior.  Assuming that your actual  
supervisor has a good supervision tree it does the following:

1) When uninitialized, tries to connect every 500 ms (or perhaps  
longer if packets to the remote server go into a black hole).
2) When running, tries to connect every so often.
3) When running, has added a supervisor for the worker
4) When running and connections fail, exits, causing the supervisor to  
kill the actual supervisor (and I believe the temporaryness of it  
causes it not to restart, if not, try transient).
5) Everything is managed by the supervision tree, as it should be.
6) The gen_fsm provides good logging of when and why everything died.
7) The ping function can be made arbitrarily complex but is still  
isolated from the actual workers.
8) No modification of the black box is required.
9) It's probably 30-50 lines of code for the gen_fsm, and one, static  
supervisor.

This is a good problem.  I teach an Erlang class.  If I decide to use  
this as an example, I'll send you a link to the code.

> what do you mean by application level and "You can set up a  
> distributed Erlang application to start in "phases"" how you imagine  
> the architecture for that?

In Erlang, you are supposed to package things as an application.   
Applications are defined by a .app file in the ebin directory of their  
providing module.  For example, on my system, /opt/local/lib/erlang/ 
lib/mnesia-4.4.9/ebin/mnesia.app contains the application definition  
for Mnesia.  This is used by the release system to create a boot  
script that starts Mnesia, if it's requested.  When you do  
application:start(mnesia), this is where it finds the information to  
start it up.  See here:  http://www.erlang.org/doc/design_principles/applications.html#7

Once you have an application, look at the start_phases stuff in the  
application documentation.  This shows pretty well how to make an  
application that has multiple phases, and synchronizes them across  
nodes (including doing exciting things like failover).  It's worth  
understanding, although you can probably avoid it for now.

In theory, you should develop your module in some sort of OTP-like  
root, under lib/module-vsn.  When the time comes, you can roll a boot- 
script that will start your system (using systools:make_script/2),  
automatic upgrade instructions (using systools:make_relup/4) even a  
whole OTP install (using systools:make_tar/2).  Very few people go  
through the trouble, but I'm working on stuff to make it easier (as is  
Ericsson, see reltool).  Note that this can be a complicated process,  
since making the upgrade scripts (i.e. relup) requires having two  
copies of the application in the code_path, which Erlang doesn't like  
to do, by default.

If you want to see roughly what the structure should look like, I have  
a git repository on GitHub:  http://github.com/jvantuyl/erl-skel

It's not complete in terms of automation, but the scripts/make_release  
file gives an idea as to how it's done.  The directory structure is  
right, and the Rakefile will handle building the code for most simple  
cases (GNU make was a bit of a problem, and rake is generally  
everywhere now).  Actually making automated releases is still on the  
TODO list, and it doesn't try to build any C extensions at all.

With proper automation, you can eventually have it build a tarfile.   
With the tarfile, you can do an initial install just by uncompressing  
it.  When you run "erl -heart" in the uncompressed directory, it will  
automagically start all of your applications with the parameters you  
specified in your release (.rel) files, handle crashes of the entire  
system, and easily run as a daemon.  It also makes managing multiple  
deployments as easy as versioning a bunch of .rel files.  If you go  
this route, you can even use the automatic code updating stuff (i.e.  
code_change/3 in gen_*, update instructions in .appup files, etc.) to  
update a running system.  With proper preparation, this deploy process  
can even handle downgrades or updating Erlang itself, live!  Like most  
of Erlang, it's powerful, but the learning curve is steep.

-- 
Jayson Vantuyl
kagato@REDACTED