[erlang-questions] "actor database" - architectural strategy question

Mon Feb 17 20:15:17 CET 2014

This sounds interesting. To start wit,  I think swapping processes to disk
is just an optimization.
In theory you could just keep everything in RAM forever. I guess processes
could keep their state in dictionaries (so you could roll them back) or ets
tables (if you didn't want to roll them back).

You would need some form of crash recovery so processes should write some
state information
to disk at suitable points in the program.

What I think is a more serious problem is getting data into the system in
the first place.
I have done some experiments with document commenting and annotation
systems and
found it very difficult to convert things like word documents into a form
that looks half
decent in a user interface.

I want to parse Microsoft word files and PDF etc. and display them in a
format that is
recognisable and not too abhorrent to the user. I also want to allow
on-screen manipulation of
documents (in a browser) - all of this seems to require a mess of
Javascript (in the browser)and a mess of parsing programs inn the server.

Before we can manipulate documents we must parse them and turn them into a
format
that can be manipulated. I think this is more difficult that the storing
and manipulating documents
problem. You'd also need support for full-text indexing, foreign language
and multiple character sets and so
on. Just a load of horrible messy small problems, but a significant barrier
to importing large amounts
of content into the system.

You'd also need some quality control of the documents as they enter the
system (to avoid rubbish in rubbish out), also to maintain the integrity of
the documents.

If you have any ideas of now to get large volumes of data into the system
from proprietary formats
(like ms word) I'd like to hear about it.

Cheers

/Joe

On Mon, Feb 17, 2014 at 3:20 PM, Miles Fidelman
<mfidelman@REDACTED>wrote:

> [Enough with the threads on Erlang angst for a while - time for some real
> questions :-) ]
>
> BACKGROUND:
> A lot of what I do is systems engineering, and a lot of that ends up in
> the realm of technology assessment - picking the right platform and tools
> for a particular system.  My dablings in Erlang are largely in that
> category - I keep seeing it as potentially useful for a class of systems,
> keep experimenting with it, done a couple proof-of-concept efforts, but
> haven't built an operational system at scale with it (yet).  The focus, so
> far, has been in modeling and simulation (I first discovered Erlang when
> chasing R&D contracts for a firm that built simulation engines for military
> trainers.  I was flabbergasted to discover that everything was written in
> C++, every simulated entity was an object, with 4 main loops threading
> through every object, 20 times a second.  Talk about spaghetti code.
>  Coming from a data comm. protocol/network background - where we'd spawn a
> process for everything - I asked the obvious question, and was told that
> context switches would bring a 10,000 entity simulation to its knees.  My
> instinctual response was "bullshit" - and went digging into the technology
> for massive concurrency, and discovered Erlang.)
>
> Anyway....  For years, I've been finding myself in situations, and on
> projects, that have a common characteristic of linked documents that change
> a lot - in the general arena of planning and workflow. Lots of people, each
> editing different parts of different documents - with changes rippling
> through the collection.  Think linked spreadsheets, tiered project plans,
> multi-level engineering documents with lots of inter-dependencies.  To be
> more concrete: systems engineering documents, large proposals, business
> planning systems, command and control systems.
>
> Add in requirements for disconnected operation that lead to
> distribution/replication requirements rather than keeping single, central
> copies of things (as the librarians like to say, "Lots of Copies Keeps
> Stuff Safe").
>
> So far we've always taken conventional approaches - ranging from manual
> paper shuffling and xeroxing, to file servers with manual organization, to
> some of MS Office's document linking capabilities, to document databases
> and sharepoint.  And played with some XML database technologies.
>
> But.... I keep thinking that there are a set of underlying functions that
> beg for better tools - something like a distributed CVS that's optimized
> for planning documents rather than software (or perhaps something like a
> modernized Lotus Notes).
>
> And I keep thinking that the obvious architectural model is to treat each
> document (maybe each page) as an actor ("smart documents" if you will),
> with communication through publish-subscribe mechanisms. Interact with a
> (copy of) a document, changes get pushed to groups of documents via a
> pub-sub mechanism.  (Not unlike actor based simulation approaches.)
>
> And, of course, when I think actors, I think Erlang.  The obvious
> conceptualization is "every document is an actor."
>
> At which point an obvious question comes up:  How to handle long-term
> persistence, for large numbers of inactive entities.
>
> But... when I go looking for examples of systems that might be built this
> way, I keep finding that, even in Erlang-based systems, persistence is
> handled in fairly conventional ways:
> - One might think that CouchDB treats every document as an actor, but
> think again
> - Paulo Negri has given some great presentations on how Wooga implements
> large-scale social gaming - and they implement an actor per session - but
> when a user goes off-line they push state into a more conventional database
>  (then initialize a gen_server from the database, when the user comes back
> online)
>
> At which point the phrase "actor-oriented database" keeps coming back to
> mind, with the obvious analogy to "object-oriented databases."  I.e.,
> something with the persistence and other characteristics of a database,
> where the contents are actors - with all the characteristics and
> functionality of those actors preserved while stored in the database.
>
> ON TO THE QUESTIONS:
> I have a pretty good understanding of how one would build things like
> simulations, or protocol servers, with Erlang - not so much how one might
> build something with long-term persistence - which leads to some questions
> (some, probably naive):
>
> 1. So far, I haven't seen anything that actually looks like an
> "actor-oriented database."  Document databases implemented in Erlang, yes
> (e.g., CouchDB), but every example I find ultimately pushes persistent data
> into files or a more conventional database of some sort.  Can anybody point
> to an example of something that looks more like "storing actors in a
> database?"
> - It strikes me that the core issues with doing so have to do with
> maintaining "aliveness" - i.e., dealing with addressability, routing
> messages to a stored actor, waking up after a timeout (i.e., the equivalent
> of triggers)
>
> 2. One obvious (if simplistic) thought: Does one really need to think in
> terms of a "database" at all - or might this problem be approached simply
> by creating each document as an Erlang process, and keeping it around
> forever?  Most of what I've seen built in Erlang focuses on relatively
> short-lived actors - I'd be really interested in comments on:
> - limitations/issues in persisting 100s of 1000s, or maybe millions of
> actors, for extended periods of time (years, or decades)
> - are there any tools/models for migrating (swapping?) inactive processes
> dynamically to/from disk storage
>
> 3. What about backup for the state of a process?  'Let it crash' is great
> for servers supporting a reliable protocol, not so great for an actor that
> has  internal state that has to be preserved (like a simulated tank, or a
> "smart document"). Pushing into a database is obvious, but...
> - are there any good models for saving/restoring state within a tree of
> supervised processes?
> - what about models for synchronizing state across replicated copies of
> processes running on different nodes?
> - what about backup/restore of entire Erlang VMs (including anything that
> might be swapped out onto disk)
>
> 4. For communications between/among actors:  Erlang is obviously excellent
> for writing pub-sub engines (RabbitMQ and ejabberd come to mind), but what
> about pub-sub or multicast/broadcast models or messaging between Erlang
> processes?  Are there any good libraries for defining/managing process
> groups, and doing multicast or broadcast messaging to/among a group of
> processes.
>
> Thank you very much for any pointers or thoughts.
>
> Miles Fidelman
>
>
>
>
> --
> In theory, there is no difference between theory and practice.
> In practice, there is.   .... Yogi Berra
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20140217/7d0ed3a7/attachment.htm>