[erlang-questions] "actor database" - architectural strategy question
Miles Fidelman
mfidelman@REDACTED
Mon Feb 17 15:20:52 CET 2014
[Enough with the threads on Erlang angst for a while - time for some
real questions :-) ]
BACKGROUND:
A lot of what I do is systems engineering, and a lot of that ends up in
the realm of technology assessment - picking the right platform and
tools for a particular system. My dablings in Erlang are largely in
that category - I keep seeing it as potentially useful for a class of
systems, keep experimenting with it, done a couple proof-of-concept
efforts, but haven't built an operational system at scale with it
(yet). The focus, so far, has been in modeling and simulation (I first
discovered Erlang when chasing R&D contracts for a firm that built
simulation engines for military trainers. I was flabbergasted to
discover that everything was written in C++, every simulated entity was
an object, with 4 main loops threading through every object, 20 times a
second. Talk about spaghetti code. Coming from a data comm.
protocol/network background - where we'd spawn a process for everything
- I asked the obvious question, and was told that context switches would
bring a 10,000 entity simulation to its knees. My instinctual response
was "bullshit" - and went digging into the technology for massive
concurrency, and discovered Erlang.)
Anyway.... For years, I've been finding myself in situations, and on
projects, that have a common characteristic of linked documents that
change a lot - in the general arena of planning and workflow. Lots of
people, each editing different parts of different documents - with
changes rippling through the collection. Think linked spreadsheets,
tiered project plans, multi-level engineering documents with lots of
inter-dependencies. To be more concrete: systems engineering documents,
large proposals, business planning systems, command and control systems.
Add in requirements for disconnected operation that lead to
distribution/replication requirements rather than keeping single,
central copies of things (as the librarians like to say, "Lots of Copies
Keeps Stuff Safe").
So far we've always taken conventional approaches - ranging from manual
paper shuffling and xeroxing, to file servers with manual organization,
to some of MS Office's document linking capabilities, to document
databases and sharepoint. And played with some XML database technologies.
But.... I keep thinking that there are a set of underlying functions
that beg for better tools - something like a distributed CVS that's
optimized for planning documents rather than software (or perhaps
something like a modernized Lotus Notes).
And I keep thinking that the obvious architectural model is to treat
each document (maybe each page) as an actor ("smart documents" if you
will), with communication through publish-subscribe mechanisms. Interact
with a (copy of) a document, changes get pushed to groups of documents
via a pub-sub mechanism. (Not unlike actor based simulation approaches.)
And, of course, when I think actors, I think Erlang. The obvious
conceptualization is "every document is an actor."
At which point an obvious question comes up: How to handle long-term
persistence, for large numbers of inactive entities.
But... when I go looking for examples of systems that might be built
this way, I keep finding that, even in Erlang-based systems, persistence
is handled in fairly conventional ways:
- One might think that CouchDB treats every document as an actor, but
think again
- Paulo Negri has given some great presentations on how Wooga implements
large-scale social gaming - and they implement an actor per session -
but when a user goes off-line they push state into a more conventional
database (then initialize a gen_server from the database, when the user
comes back online)
At which point the phrase "actor-oriented database" keeps coming back to
mind, with the obvious analogy to "object-oriented databases." I.e.,
something with the persistence and other characteristics of a database,
where the contents are actors - with all the characteristics and
functionality of those actors preserved while stored in the database.
ON TO THE QUESTIONS:
I have a pretty good understanding of how one would build things like
simulations, or protocol servers, with Erlang - not so much how one
might build something with long-term persistence - which leads to some
questions (some, probably naive):
1. So far, I haven't seen anything that actually looks like an
"actor-oriented database." Document databases implemented in Erlang,
yes (e.g., CouchDB), but every example I find ultimately pushes
persistent data into files or a more conventional database of some
sort. Can anybody point to an example of something that looks more like
"storing actors in a database?"
- It strikes me that the core issues with doing so have to do with
maintaining "aliveness" - i.e., dealing with addressability, routing
messages to a stored actor, waking up after a timeout (i.e., the
equivalent of triggers)
2. One obvious (if simplistic) thought: Does one really need to think in
terms of a "database" at all - or might this problem be approached
simply by creating each document as an Erlang process, and keeping it
around forever? Most of what I've seen built in Erlang focuses on
relatively short-lived actors - I'd be really interested in comments on:
- limitations/issues in persisting 100s of 1000s, or maybe millions of
actors, for extended periods of time (years, or decades)
- are there any tools/models for migrating (swapping?) inactive
processes dynamically to/from disk storage
3. What about backup for the state of a process? 'Let it crash' is
great for servers supporting a reliable protocol, not so great for an
actor that has internal state that has to be preserved (like a
simulated tank, or a "smart document"). Pushing into a database is
obvious, but...
- are there any good models for saving/restoring state within a tree of
supervised processes?
- what about models for synchronizing state across replicated copies of
processes running on different nodes?
- what about backup/restore of entire Erlang VMs (including anything
that might be swapped out onto disk)
4. For communications between/among actors: Erlang is obviously
excellent for writing pub-sub engines (RabbitMQ and ejabberd come to
mind), but what about pub-sub or multicast/broadcast models or messaging
between Erlang processes? Are there any good libraries for
defining/managing process groups, and doing multicast or broadcast
messaging to/among a group of processes.
Thank you very much for any pointers or thoughts.
Miles Fidelman
--
In theory, there is no difference between theory and practice.
In practice, there is. .... Yogi Berra
More information about the erlang-questions
mailing list