[erlang-questions] Looking for the wisdon of experience...

Sun Apr 13 04:29:58 CEST 2008

Hi! I'm new to Erlang, but experienced with many other languages.
I've read Joe Armstrong's book a couple times and most of the
man pages at least once, I think.  The language doesn't throw
me, but many of the libraries are unobvious to me in their use.

This is kind of long, because I'm asking a bunch of questions
about the best place to learn how to use the libraries. Basically,
a big "Which M to RTF?" here. I read most, but many are so general
that the "right" way to use them and the interrelationships between them 
aren't obvious to me.

Right now, I have a system that has a central server that communicates
(invisibly to the customers) with geographically distributed clusters
in various cities.  (www.skyclix.com, or www.poundsky.com for the
consumer site, if you care. Works best in San Diego, Los Angelos,
and San Jose right now.) Basically, you call in on your cell
phone, we process the audio, figure out what broadcast source or
prerecorded audio you might be listening to, and send you back
a URL to a web page that lets you do stuff with the audio (buy
the song, talk to the DJ, driving directions to the advertiser
you recorded, etc.)

I've implemented various functions using SQL databases and custom code
for distributed communication. (Mostly Tcl, PHP for web pages, MySql, a 
bit of C++ for the audio processing, if you care.) I'm looking for clues 
like which modules/functional blocks handle the things I already 
implemented for myself, and whether my reading of Mnesia functionality 
means I need to handle my large datasets the way I think I do.

Anyway, my problem with understanding how to use this technology
seems to be in how to apply the powerful and generic functionality
in a way that will keep working in the face of failures. I'm hoping 
those who have used Erlang to build systems already could point me in 
the right directions. I hope it's appropriate to ask this sort of thing 
here.

So, to the questions....

* * 1 * *

Right now, we have geographically-distributed clusters of servers.
For example, in each city, we have a rack of machines, each of
which is running a monitor and a program to listen to
radio channels. (The monitor monitors temp, disk status, etc,
as well as machine-specific stuff like "are we getting audio
that sounds clear?") One in each cluster is running an audio
server and a SQL server and a monitor and a "query engine",
which is what the GC contacts to see if you were listening to
any of the radio stations in that city.

I'm assuming the monitor, the query engine, the thing listening to
the radio stations, etc would likely each be one "node", rather than
putting them all into one node? So I'd be running 'qe@REDACTED' and
'qe@REDACTED' and 'radio@REDACTED' and 'radio@REDACTED', rather
than having all the processing on one machine all in the same server,
right? That's a better way to go?

Right now, each running process logs a "heartbeat" to the central
"ground control" server, so I know if something died or I lost
ISP connectivity. I'm assuming the right way to do this in Erlang
is to watch for nodeup and nodedown messages. Yes? If this is the
case, and connectivity fails and then comes back, does the Erlang
runtime try to reestablish connectivity automatically? (Some of
our ISPs have been less than wonderful.)

Also, I'm wondering whether it would be better to have a
specific TCP server communicating between cities (i.e.,
a custom gen_tcp server) or whether it would be sufficient
to simply use the normal inter-node commuincation stuff?
Or would something like Mr Armstrong's "lib_chan" be a
better start?  It looks like with even trivial abstraction,
this wouldn't really be a problem either way, considering
how easy it is to ship terms over TCP.

* * 2 * *

Right now, when something fails (anywhere), it logs a "failure"
message, which propagates to Ground Control, which sends me a page.
If the problem clears up before I get to it, it logs a "clear"
message, which GC also pages to me. (In addition, there are
warnings and notices that get summarized and emailed appropriately.)

Also, I log various statistics (like how many ads we served, to match
against the ad network's accounting), messages about intermittent programs
succeeding, progress messages for the bits that can take hours to run, etc.

I also log debug messages (like print statements) and "info" messages
(which get logged even if debugging is off, and which I can easily search
for in admin reports).

My first guess, from reading, is that the Event Tracer modules would
be the appropriate way of shipping most of this information back to the
central servers. In particular, providing fairly complex terms to
et:phone_home (snicker) would be sufficient, along with arranging
to have the event collector/selector/viewer back at ground control
talking to a web server with its results. How well does this work
with unreliable communication between nodes? It looks like it would
be OK for me to log debugging info and such, but I wouldn't want
auditing logs done this way, right? Is it efficient enough to
filter things at the local nodes? Is event tracing efficient
enough to leave on all the time? Does it do the filtering at
the local nodes, or is it going to ship back all traced
processes and then filter out everything but phone_home?

Or would the tools working with the standard error_logger module be the
best way to replicate these things? Mr Armstrong's book implies that's
the way to go, but the example given is textbook oversimplified. I'd like
to have these messages all interlaced, so I can see when (for example)
the processing at the node taking the phone calls fails because the node
listening to the radio was restarting or some such. I didn't see anything
in the error_logger documentation that implied it was easy to merge
widely distributed messages into one log; do I need to read closer? I
also didn't see any good examples of searchable custom messages, where
I could (for example) pull out all the messages caused by a particular
phone number and put them in order and see why the result wasn't what
was expected, or even to find all the messages with a particular term
or combinations of terms in their tags. The "rb" module looks like it
really only works with the "standard" error_logger terms.

Would I be better off with a custom event handler (which, like the
current one, takes the event, saves it locally, and then specifically
propagates it with appropriate timestamp to Ground Control) and only
looking at error_logger for actual crashes? Or is it pretty easy
to build a custom handler, hook it into the local error_logger, log
more complex terms than just info/warn/error, and have it send the
records to the central database when appropriate? How do people
handle this in real systems, I guess I'm asking? It just doesn't
seem like one would want to be looking at each machine individually
when tracking down a fault.

* * 3 * *

Cron-like functionality: How is this best handled? If I have some Erlang
task I want to do once a day, or once a week, or to retry something every
couple of hours until it succeeds then sleep for a few days... Do people
just set this up as a separate process? Or is there something like "cron"
written already inside of the Erlang libraries that I just haven't found?
Clearly it wouldn't be difficult to implement. Or do people fire
up a separate node via cron to run something like this, then let it
shut itself down when it's done?  I saw the "hibernate" BIF, which
looks like the sort of thing useful for this.

* * 4 * *

Part of the problem I forsee using Erlang in general and Mnesia in
particular for this is the lack of a convenient ordered index for
data. Ordered tables have to fit in memory. (Seems odd to me that
nobody already needed this, but there ya go. :-)

It looks like you can tell Mnesia to make an index on an element of a
tuple, but it doesn't look like you can tell Mnesia that you want an
ordered index of an element on a table that's otherwise too big to fit
in memory? Have I missed something in TFM?

Basically, I have lots of places in my current code where the
SQL says
   SELECT * FROM table WHERE blah blah
   AND timestamp < '...'
   ORDER BY timestamp DESC LIMIT 1
so I'm looking for the one most recent event before
the provided time that matches some set of conditions.
(Sometimes the conditions are empty, i.e., "true".)

I guess what the rambling below distills down to is "how do you
best handle ordered columns on Mnesia tables too big to fit in RAM
conveniently?"  Obviously the table needs to be split up, but does one
do that by maintaining links and indecies oneself, updating linked lists
of records as they're inserted?  Or maintaining a second table with
just the timestamps and "where" conditions?  Or keeping such in memory
when one can, and flushing out lists as separate records?  (Actually,
I guess that would be hard in Mnesia, since it wants the same format
for every record. You would have to use a separate table just for that,
wouldn't you?)

Is there a best way to handle this in Mnesia? Does that
mechanism work when the tables get arbitrarily large?

I'm thinking that I could build a (or several) linked list(s), and each
time I insert a record, I add the predicessor and successor record IDs
into the record. I could also keep a copy of just the timestamp/ID pairs
for the most recent records in memory as a set or a Mnesia ordered_set
table, rebuilding it from the linked lists on disk if I need to. When
I close out the table, I would write the head and tail links (and maybe
some intermediate, like "first of the day" links) into a specific record
(maybe in another table) just for that purpose.

Alternately, I could store each week/month/whatever as a separate
disc-copy ordered set, and load each separately as needed. My only
concern there is when 20 different customers all ask for information
from 20 different tables at once.

Alternately, each month I could have a separate process that runs
through the disc-copy table and builds a disc-only table with a separate
index of just the chronology as one of the records. I.e., suck the
table in, build an in-memory ordered list for each index, and write
it back out as a unique record I can load later if I need it. Or
add the links of the linked list at that time so I'm not maintaining
them on every insert.

Also, the pre-recorded audio bits are (so far) several million records
occupying dozens of gigabytes. We expect this to grow one or two more
orders of magnitude. However, since most of that is "data" (i.e.,
binaries even I don't know the internals of), and we only really
index that by a code number, I don't see that being a problem,
other than the sheer bandwidth of shipping it around. Having them
on a disk and being able to open the file remotely would suffice
for getting them into the custom audio match servers.

I'm thinking I'm likely going to have to split things up into lots of
tables.  The customer table would have the recognition event details for
all the events inside the record as a list, but I would have a process
that iterates over the customers with new events and takes any events
unlikely to be referenced soon (too old, too many, etc) and moves
them into a file keyed by recognition event number, and trims the
details out of the customer record. This keeps the rebuild time
for the customer records short, and keeps me from sucking up too
much memory when we pull in someone who has done hundreds or
thousands of events.

Are these the sorts of things other folks do when working with databases
too large for memory (i.e., >2G tables)?  Or am I missing something,
either obvious or clever, that would make it more straightforward?

This is the most vexing bit, because it seems like the hardest part to
roll-my-own with. The tracing and logging and such I could easily do
custom if I didn't want to try to do it the "usual" way, but this
seems like it *should* have a known solution that I'm not finding.

Oh, and yes, I saw the ODBC driver. Right now, stuff isn't too fast,
isn't easy to split across servers, and needs downtime for things
like changing the schema. That's what I'm trying to avoid, or I'd
just keep the same database we have. :-)

* * * *

Thanks in advance for any words of wisdom you might feel like
providing! And thanks for reading this far. ;-)

-- 
   Darren New / San Diego, CA, USA (PST)
     "That's pretty. Where's that?"
          "It's the Age of Channelwood."
     "We should go there on vacation some time."