<div dir="ltr">This sounds interesting. To start wit, I think swapping processes to disk is just an optimization.<div>In theory you could just keep everything in RAM forever. I guess processes could keep their state in dictionaries (so you could roll them back) or ets tables (if you didn't want to roll them back).</div>
<div><br></div><div>You would need some form of crash recovery so processes should write some state information</div><div>to disk at suitable points in the program.</div><div><br></div><div>What I think is a more serious problem is getting data into the system in the first place.</div>
<div>I have done some experiments with document commenting and annotation systems and</div><div>found it very difficult to convert things like word documents into a form that looks half</div><div>decent in a user interface.</div>
<div><br></div><div>I want to parse Microsoft word files and PDF etc. and display them in a format that is</div><div>recognisable and not too abhorrent to the user. I also want to allow on-screen manipulation of</div><div>
documents (in a browser) - all of this seems to require a mess of Javascript (in the browser)and a mess of parsing programs inn the server.</div><div><br></div><div>Before we can manipulate documents we must parse them and turn them into a format</div>
<div>that can be manipulated. I think this is more difficult that the storing and manipulating documents </div><div>problem. You'd also need support for full-text indexing, foreign language and multiple character sets and so</div>
<div>on. Just a load of horrible messy small problems, but a significant barrier to importing large amounts</div><div>of content into the system.</div><div><br></div><div>You'd also need some quality control of the documents as they enter the system (to avoid rubbish in rubbish out), also to maintain the integrity of the documents.</div>
<div><br></div><div>If you have any ideas of now to get large volumes of data into the system from proprietary formats</div><div>(like ms word) I'd like to hear about it.</div><div><br></div><div>Cheers</div><div><br>
</div><div>/Joe</div>
<div><br></div><div><br></div><div><br></div><div><div class="gmail_extra"><br><br><div class="gmail_quote">On Mon, Feb 17, 2014 at 3:20 PM, Miles Fidelman <span dir="ltr"><<a href="mailto:mfidelman@meetinghouse.net" target="_blank">mfidelman@meetinghouse.net</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">[Enough with the threads on Erlang angst for a while - time for some real questions :-) ]<br>
<br>
BACKGROUND:<br>
A lot of what I do is systems engineering, and a lot of that ends up in the realm of technology assessment - picking the right platform and tools for a particular system. My dablings in Erlang are largely in that category - I keep seeing it as potentially useful for a class of systems, keep experimenting with it, done a couple proof-of-concept efforts, but haven't built an operational system at scale with it (yet). The focus, so far, has been in modeling and simulation (I first discovered Erlang when chasing R&D contracts for a firm that built simulation engines for military trainers. I was flabbergasted to discover that everything was written in C++, every simulated entity was an object, with 4 main loops threading through every object, 20 times a second. Talk about spaghetti code. Coming from a data comm. protocol/network background - where we'd spawn a process for everything - I asked the obvious question, and was told that context switches would bring a 10,000 entity simulation to its knees. My instinctual response was "bullshit" - and went digging into the technology for massive concurrency, and discovered Erlang.)<br>
<br>
Anyway.... For years, I've been finding myself in situations, and on projects, that have a common characteristic of linked documents that change a lot - in the general arena of planning and workflow. Lots of people, each editing different parts of different documents - with changes rippling through the collection. Think linked spreadsheets, tiered project plans, multi-level engineering documents with lots of inter-dependencies. To be more concrete: systems engineering documents, large proposals, business planning systems, command and control systems.<br>
<br>
Add in requirements for disconnected operation that lead to distribution/replication requirements rather than keeping single, central copies of things (as the librarians like to say, "Lots of Copies Keeps Stuff Safe").<br>
<br>
So far we've always taken conventional approaches - ranging from manual paper shuffling and xeroxing, to file servers with manual organization, to some of MS Office's document linking capabilities, to document databases and sharepoint. And played with some XML database technologies.<br>
<br>
But.... I keep thinking that there are a set of underlying functions that beg for better tools - something like a distributed CVS that's optimized for planning documents rather than software (or perhaps something like a modernized Lotus Notes).<br>
<br>
And I keep thinking that the obvious architectural model is to treat each document (maybe each page) as an actor ("smart documents" if you will), with communication through publish-subscribe mechanisms. Interact with a (copy of) a document, changes get pushed to groups of documents via a pub-sub mechanism. (Not unlike actor based simulation approaches.)<br>
<br>
And, of course, when I think actors, I think Erlang. The obvious conceptualization is "every document is an actor."<br>
<br>
At which point an obvious question comes up: How to handle long-term persistence, for large numbers of inactive entities.<br>
<br>
But... when I go looking for examples of systems that might be built this way, I keep finding that, even in Erlang-based systems, persistence is handled in fairly conventional ways:<br>
- One might think that CouchDB treats every document as an actor, but think again<br>
- Paulo Negri has given some great presentations on how Wooga implements large-scale social gaming - and they implement an actor per session - but when a user goes off-line they push state into a more conventional database (then initialize a gen_server from the database, when the user comes back online)<br>
<br>
At which point the phrase "actor-oriented database" keeps coming back to mind, with the obvious analogy to "object-oriented databases." I.e., something with the persistence and other characteristics of a database, where the contents are actors - with all the characteristics and functionality of those actors preserved while stored in the database.<br>
<br>
ON TO THE QUESTIONS:<br>
I have a pretty good understanding of how one would build things like simulations, or protocol servers, with Erlang - not so much how one might build something with long-term persistence - which leads to some questions (some, probably naive):<br>
<br>
1. So far, I haven't seen anything that actually looks like an "actor-oriented database." Document databases implemented in Erlang, yes (e.g., CouchDB), but every example I find ultimately pushes persistent data into files or a more conventional database of some sort. Can anybody point to an example of something that looks more like "storing actors in a database?"<br>
- It strikes me that the core issues with doing so have to do with maintaining "aliveness" - i.e., dealing with addressability, routing messages to a stored actor, waking up after a timeout (i.e., the equivalent of triggers)<br>
<br>
2. One obvious (if simplistic) thought: Does one really need to think in terms of a "database" at all - or might this problem be approached simply by creating each document as an Erlang process, and keeping it around forever? Most of what I've seen built in Erlang focuses on relatively short-lived actors - I'd be really interested in comments on:<br>
- limitations/issues in persisting 100s of 1000s, or maybe millions of actors, for extended periods of time (years, or decades)<br>
- are there any tools/models for migrating (swapping?) inactive processes dynamically to/from disk storage<br>
<br>
3. What about backup for the state of a process? 'Let it crash' is great for servers supporting a reliable protocol, not so great for an actor that has internal state that has to be preserved (like a simulated tank, or a "smart document"). Pushing into a database is obvious, but...<br>
- are there any good models for saving/restoring state within a tree of supervised processes?<br>
- what about models for synchronizing state across replicated copies of processes running on different nodes?<br>
- what about backup/restore of entire Erlang VMs (including anything that might be swapped out onto disk)<br>
<br>
4. For communications between/among actors: Erlang is obviously excellent for writing pub-sub engines (RabbitMQ and ejabberd come to mind), but what about pub-sub or multicast/broadcast models or messaging between Erlang processes? Are there any good libraries for defining/managing process groups, and doing multicast or broadcast messaging to/among a group of processes.<br>
<br>
Thank you very much for any pointers or thoughts.<span><font color="#888888"><br>
<br>
Miles Fidelman<br>
<br>
<br>
<br>
<br>
-- <br>
In theory, there is no difference between theory and practice.<br>
In practice, there is. .... Yogi Berra<br>
<br>
______________________________<u></u>_________________<br>
erlang-questions mailing list<br>
<a href="mailto:erlang-questions@erlang.org" target="_blank">erlang-questions@erlang.org</a><br>
<a href="http://erlang.org/mailman/listinfo/erlang-questions" target="_blank">http://erlang.org/mailman/<u></u>listinfo/erlang-questions</a><br>
</font></span></blockquote></div><br></div></div></div>