[erlang-questions] initializing process
Jayson Vantuyl
kagato@REDACTED
Sat Sep 19 00:57:01 CEST 2009
> exactly this is a application server that is getting requests from
> its clients via http and then it is doing some computation. but the
> key feature of the server is talking to other peer in local network
> and gathering information about other devices in the area, process
> that information and send it to clients.
>
> the problem is that the other computer is a third-party hardware and
> software black-box that I connect via tcp or rss and can switches
> off itself. my server must be 100% sure that the other side is alive
> before starting everything up. if my server is not responding
> clients will notify local stuff that something is going on and
> someone will go check. the server may loose connection with black-
> box and again will disconnect client. it is maybe weird but help
> troubleshooting and I know that my server is ok but it is THEIR :)
> fault.
This is dysfunctional, but helpful to know. I had assumed you were
writing the client and the server. Since you connect to the "black
box" using TCP and RSS, I wouldn't think you could do ping an Erlang
node on it (since you couldn't install one). Even if you can, the
Erlang node running isn't the same as their TCP connections not
working, so I'd recommend a pure-TCP solution.
Here's what I'm thinking. At the top, put a single supervisor,
running a one_for_all strategy. It should have one child, and be
registered under a name.
Have the child be a gen_fsm with initialized and uninitialized modes.
In the init function, start in an 'uninitialized' state and specify a
very short timeout (maybe 500 milliseconds). When you receive a
timeout in that state, try to use the gen_tcp:connect (or an http
client to a harmless URL, if it's an HTTP device). If there is a good
connection, tell the supervisor to add a temporary child (the
supervisor for your real workers). If it's a bad connection, set a
timeout again.
In the running state, have the timeout check, but perhaps with a
longer loop time. If it gets a good connection, set the timeout. If
it gets a bad connection, terminate the gen_fsm by returning
{stop,"Couldn't make a connection",State}. If you want to be a bit
more forgiving, keep track of successive failures to connect and only
terminate after maybe five consecutive failures (or somesuch).
I think that this has the correct behavior. Assuming that your actual
supervisor has a good supervision tree it does the following:
1) When uninitialized, tries to connect every 500 ms (or perhaps
longer if packets to the remote server go into a black hole).
2) When running, tries to connect every so often.
3) When running, has added a supervisor for the worker
4) When running and connections fail, exits, causing the supervisor to
kill the actual supervisor (and I believe the temporaryness of it
causes it not to restart, if not, try transient).
5) Everything is managed by the supervision tree, as it should be.
6) The gen_fsm provides good logging of when and why everything died.
7) The ping function can be made arbitrarily complex but is still
isolated from the actual workers.
8) No modification of the black box is required.
9) It's probably 30-50 lines of code for the gen_fsm, and one, static
supervisor.
This is a good problem. I teach an Erlang class. If I decide to use
this as an example, I'll send you a link to the code.
> what do you mean by application level and "You can set up a
> distributed Erlang application to start in "phases"" how you imagine
> the architecture for that?
In Erlang, you are supposed to package things as an application.
Applications are defined by a .app file in the ebin directory of their
providing module. For example, on my system, /opt/local/lib/erlang/
lib/mnesia-4.4.9/ebin/mnesia.app contains the application definition
for Mnesia. This is used by the release system to create a boot
script that starts Mnesia, if it's requested. When you do
application:start(mnesia), this is where it finds the information to
start it up. See here: http://www.erlang.org/doc/design_principles/applications.html#7
Once you have an application, look at the start_phases stuff in the
application documentation. This shows pretty well how to make an
application that has multiple phases, and synchronizes them across
nodes (including doing exciting things like failover). It's worth
understanding, although you can probably avoid it for now.
In theory, you should develop your module in some sort of OTP-like
root, under lib/module-vsn. When the time comes, you can roll a boot-
script that will start your system (using systools:make_script/2),
automatic upgrade instructions (using systools:make_relup/4) even a
whole OTP install (using systools:make_tar/2). Very few people go
through the trouble, but I'm working on stuff to make it easier (as is
Ericsson, see reltool). Note that this can be a complicated process,
since making the upgrade scripts (i.e. relup) requires having two
copies of the application in the code_path, which Erlang doesn't like
to do, by default.
If you want to see roughly what the structure should look like, I have
a git repository on GitHub: http://github.com/jvantuyl/erl-skel
It's not complete in terms of automation, but the scripts/make_release
file gives an idea as to how it's done. The directory structure is
right, and the Rakefile will handle building the code for most simple
cases (GNU make was a bit of a problem, and rake is generally
everywhere now). Actually making automated releases is still on the
TODO list, and it doesn't try to build any C extensions at all.
With proper automation, you can eventually have it build a tarfile.
With the tarfile, you can do an initial install just by uncompressing
it. When you run "erl -heart" in the uncompressed directory, it will
automagically start all of your applications with the parameters you
specified in your release (.rel) files, handle crashes of the entire
system, and easily run as a daemon. It also makes managing multiple
deployments as easy as versioning a bunch of .rel files. If you go
this route, you can even use the automatic code updating stuff (i.e.
code_change/3 in gen_*, update instructions in .appup files, etc.) to
update a running system. With proper preparation, this deploy process
can even handle downgrades or updating Erlang itself, live! Like most
of Erlang, it's powerful, but the learning curve is steep.
--
Jayson Vantuyl
kagato@REDACTED
More information about the erlang-questions
mailing list