[erlang-questions] "actor database" - architectural strategy question

Mon Feb 17 19:04:12 CET 2014

Guys, please - I'm not looking at modeling "a document" - I'm looking at 
modeling 100s of copies of documents, each with local changes, that are 
currently being updated (say, my copy of a project plan); as well as 
families of related documents (say a schedule, a budget, progress 
reports). At any given time, I may make a marginal note, or an edit, or 
an edit - and I want that to propagate according to local rules.

The best conceptual model I come up with is: each document is an actor, 
with a mailbox, and a multi-cast pub-sub mechanism for sending changes 
to a group of people who have copies of the document.  (Think of a 
loose-leaf binder, and change pages coming through the mail).  For that 
matter, think USENET news and NNTP - except with each message being 
addressible - instead of a thread as 100 messages, it becomes one 
message, plus 99 updates to that message, each processed by code within 
the 1st message.

If I wanted to model this as a standard database, or serializing state 
into a traditional database, I wouldn't be asking the questions I 
asked.  Can anybody talk to the questions I actually asked, about:
- handling large numbers of actors that might persist for years, or 
decades (where actor = Erlang-style process)
- backup up/restoring state of long-running actors that might crash
- multi-cast messaging among actors

Erlang is the closest I've found to an environment where this might be 
practical - but that remains an open question.  So please - while 
appreciated, I'm really not looking for information about alternative 
conceptual models; I'm looking for hard information about how this 
conceptual model might be implemented in Erlang.

Miles

Michael Radford wrote:
> I'd suggest taking a look at Riak, and also Basho's library riak-core.
> With Riak and a bit of Erlang, you can easily model a document as a
> sequence of change operations which are composed on-demand to present
> the latest version. On top of that, you get mechanisms for maintaining
> the database without any single point of failure, and for dealing with
> simultaneous/competing changes from multiple users.
>
> For persisting actors, the nice thing about Erlang is that pretty much
> whatever your actor's state is [*], you can store its term_to_binary
> representation in whatever database you choose. [* except for
> anonymous functions, which you can always turn into atoms as long as
> they're not completely arbitrary.]
>
> On Mon, Feb 17, 2014 at 6:50 AM, Jesper Louis Andersen
> <jesper.louis.andersen@REDACTED> wrote:
>> A document is a trace of events. These events records edits to the document
>> and when we play all of the events, we obtain the final document state.
>> Infinite undo is possible by looking back and replaying with a point-in-time
>> recovery option. An actor is a handler that can apply events to a state in
>> order to obtain a new state.
>>
>> Events are persisted in an event log and WAL fashion. So even if the system
>> dies, we can replay its state safely. Once in a while, living processes
>> checkpoint their state to disk so they can boot up faster than having to
>> replay from day 0.
>>
>> Multiple edits to the same document can be handled by operational transforms
>>
>> http://en.wikipedia.org/wiki/Operational_transformation
>>
>> Idle documents terminate themselves after a while by checkpointing
>> themselves to disk. Documents register themselves into gproc and if there is
>> no document present in gproc, you go to a manager and get it set up either
>> from disk or by forming a new document.
>>
>> For easy storage, you can use a single table in a database for the log.
>>
>>
>> On Mon, Feb 17, 2014 at 3:20 PM, Miles Fidelman <mfidelman@REDACTED>
>> wrote:
>>> [Enough with the threads on Erlang angst for a while - time for some real
>>> questions :-) ]
>>>
>>> BACKGROUND:
>>> A lot of what I do is systems engineering, and a lot of that ends up in
>>> the realm of technology assessment - picking the right platform and tools
>>> for a particular system.  My dablings in Erlang are largely in that category
>>> - I keep seeing it as potentially useful for a class of systems, keep
>>> experimenting with it, done a couple proof-of-concept efforts, but haven't
>>> built an operational system at scale with it (yet).  The focus, so far, has
>>> been in modeling and simulation (I first discovered Erlang when chasing R&D
>>> contracts for a firm that built simulation engines for military trainers.  I
>>> was flabbergasted to discover that everything was written in C++, every
>>> simulated entity was an object, with 4 main loops threading through every
>>> object, 20 times a second.  Talk about spaghetti code.  Coming from a data
>>> comm. protocol/network background - where we'd spawn a process for
>>> everything - I asked the obvious question, and was told that context
>>> switches would bring a 10,000 entity simulation to its knees.  My
>>> instinctual response was "bullshit" - and went digging into the technology
>>> for massive concurrency, and discovered Erlang.)
>>>
>>> Anyway....  For years, I've been finding myself in situations, and on
>>> projects, that have a common characteristic of linked documents that change
>>> a lot - in the general arena of planning and workflow. Lots of people, each
>>> editing different parts of different documents - with changes rippling
>>> through the collection.  Think linked spreadsheets, tiered project plans,
>>> multi-level engineering documents with lots of inter-dependencies.  To be
>>> more concrete: systems engineering documents, large proposals, business
>>> planning systems, command and control systems.
>>>
>>> Add in requirements for disconnected operation that lead to
>>> distribution/replication requirements rather than keeping single, central
>>> copies of things (as the librarians like to say, "Lots of Copies Keeps Stuff
>>> Safe").
>>>
>>> So far we've always taken conventional approaches - ranging from manual
>>> paper shuffling and xeroxing, to file servers with manual organization, to
>>> some of MS Office's document linking capabilities, to document databases and
>>> sharepoint.  And played with some XML database technologies.
>>>
>>> But.... I keep thinking that there are a set of underlying functions that
>>> beg for better tools - something like a distributed CVS that's optimized for
>>> planning documents rather than software (or perhaps something like a
>>> modernized Lotus Notes).
>>>
>>> And I keep thinking that the obvious architectural model is to treat each
>>> document (maybe each page) as an actor ("smart documents" if you will), with
>>> communication through publish-subscribe mechanisms. Interact with a (copy
>>> of) a document, changes get pushed to groups of documents via a pub-sub
>>> mechanism.  (Not unlike actor based simulation approaches.)
>>>
>>> And, of course, when I think actors, I think Erlang.  The obvious
>>> conceptualization is "every document is an actor."
>>>
>>> At which point an obvious question comes up:  How to handle long-term
>>> persistence, for large numbers of inactive entities.
>>>
>>> But... when I go looking for examples of systems that might be built this
>>> way, I keep finding that, even in Erlang-based systems, persistence is
>>> handled in fairly conventional ways:
>>> - One might think that CouchDB treats every document as an actor, but
>>> think again
>>> - Paulo Negri has given some great presentations on how Wooga implements
>>> large-scale social gaming - and they implement an actor per session - but
>>> when a user goes off-line they push state into a more conventional database
>>> (then initialize a gen_server from the database, when the user comes back
>>> online)
>>>
>>> At which point the phrase "actor-oriented database" keeps coming back to
>>> mind, with the obvious analogy to "object-oriented databases."  I.e.,
>>> something with the persistence and other characteristics of a database,
>>> where the contents are actors - with all the characteristics and
>>> functionality of those actors preserved while stored in the database.
>>>
>>> ON TO THE QUESTIONS:
>>> I have a pretty good understanding of how one would build things like
>>> simulations, or protocol servers, with Erlang - not so much how one might
>>> build something with long-term persistence - which leads to some questions
>>> (some, probably naive):
>>>
>>> 1. So far, I haven't seen anything that actually looks like an
>>> "actor-oriented database."  Document databases implemented in Erlang, yes
>>> (e.g., CouchDB), but every example I find ultimately pushes persistent data
>>> into files or a more conventional database of some sort.  Can anybody point
>>> to an example of something that looks more like "storing actors in a
>>> database?"
>>> - It strikes me that the core issues with doing so have to do with
>>> maintaining "aliveness" - i.e., dealing with addressability, routing
>>> messages to a stored actor, waking up after a timeout (i.e., the equivalent
>>> of triggers)
>>>
>>> 2. One obvious (if simplistic) thought: Does one really need to think in
>>> terms of a "database" at all - or might this problem be approached simply by
>>> creating each document as an Erlang process, and keeping it around forever?
>>> Most of what I've seen built in Erlang focuses on relatively short-lived
>>> actors - I'd be really interested in comments on:
>>> - limitations/issues in persisting 100s of 1000s, or maybe millions of
>>> actors, for extended periods of time (years, or decades)
>>> - are there any tools/models for migrating (swapping?) inactive processes
>>> dynamically to/from disk storage
>>>
>>> 3. What about backup for the state of a process?  'Let it crash' is great
>>> for servers supporting a reliable protocol, not so great for an actor that
>>> has  internal state that has to be preserved (like a simulated tank, or a
>>> "smart document"). Pushing into a database is obvious, but...
>>> - are there any good models for saving/restoring state within a tree of
>>> supervised processes?
>>> - what about models for synchronizing state across replicated copies of
>>> processes running on different nodes?
>>> - what about backup/restore of entire Erlang VMs (including anything that
>>> might be swapped out onto disk)
>>>
>>> 4. For communications between/among actors:  Erlang is obviously excellent
>>> for writing pub-sub engines (RabbitMQ and ejabberd come to mind), but what
>>> about pub-sub or multicast/broadcast models or messaging between Erlang
>>> processes?  Are there any good libraries for defining/managing process
>>> groups, and doing multicast or broadcast messaging to/among a group of
>>> processes.
>>>
>>> Thank you very much for any pointers or thoughts.
>>>
>>> Miles Fidelman
>>>
>>>
>>>
>>>
>>> --
>>> In theory, there is no difference between theory and practice.
>>> In practice, there is.   .... Yogi Berra
>>>
>>> _______________________________________________
>>> erlang-questions mailing list
>>> erlang-questions@REDACTED
>>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>>
>>
>> --
>> J.
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>>

-- 
In theory, there is no difference between theory and practice.
In practice, there is.   .... Yogi Berra