[erlang-questions] "actor database" - architectural strategy question

Mon Feb 17 21:42:31 CET 2014

“Large number of processes with very long persistence”

You *will* run into GC issues here, and of all kinds
   - design artifacts (“hmm, the number of lists that I manipulate increases relentlessly…”)
   - misunderstanding (“But I passed the binary on, without manipulating it at all!”)
   - Bugs (Fred has a great writeup on this somewhere)

Just keep in mind that in the end, you will almost certainly end up doing some form of manual GC activities.  Again, the Heroku gang can probably provide a whole bunch of pointers on this…

chees
Mahesh Paolini-Subramanya
That tall bald Indian guy..  
Google+  | Blog   | Twitter  | LinkedIn

On February 17, 2014 at 3:22:22 PM, Miles Fidelman (mfidelman@REDACTED) wrote:

Joe Armstrong wrote:  
> This sounds interesting. To start wit, I think swapping processes to  
> disk is just an optimization.  
> In theory you could just keep everything in RAM forever. I guess  
> processes could keep their state in dictionaries (so you could roll  
> them back) or ets tables (if you didn't want to roll them back).  
>  
> You would need some form of crash recovery so processes should write  
> some state information  
> to disk at suitable points in the program.  

Joe... can you offer any insight into the dynamics of Erlang, when  
running with large number of processes that have very long persistence?  
Somehow, it strikes me that 100,000 processes with 1MB of state, each  
running for years at a time, have a different dynamic than 100,000  
processes, each representing a short-lived protocol transaction (say a  
web query).  

Coupled with a communications paradigm for identifying a group of  
processes and sending each of them the same message (e.g., 5000 people  
have a copy of a book, send all 5000 of them a set of errata; or send a  
message asking 'who has updates for section 3.2).  

In some sense, the conceptual model is:  
1. I send you an empty notebook.  
2. The notebook has an address and a bunch of message handling routines  
3. I can send a page to the notebook, and the notebook inserts the page.  
4. You can interact with the notebook - read it, annotate it, edit  
certain sections - if you make updates, the notebook can distribute  
updates to other copies - either through a P2P mechanism or a  
publish-subscribe mechanism.  

At a basic level, this maps really well onto the Actor formalism - every  
notebook is an actor, with it's own address. Updates, interactions,  
queries, etc. are simply messages.  

Since Erlang is about the only serious implementation of the Actor  
formalism, I'm trying to poke at the edge cases - particularly around  
long-lived actors. And who better to ask than you :-)  

In passing: Early versions of Smalltalk were actor-like, encapsulating  
state, methods, and process - but process kind of got dropped along the  
way. By contrast, it strikes me that Erlang focuses on everything being  
a process, and long-term persistence of state has taken a back seat.  
I'm trying to probe the edge cases. (I guess another way of looking at  
this is: to what extent is Erlang workable for writing systems based  
around the mobile agent paradigm?)  

>  
> What I think is a more serious problem is getting data into the system  
> in the first place.  
> I have done some experiments with document commenting and annotation  
> systems and  
> found it very difficult to convert things like word documents into a  
> form that looks half  
> decent in a user interface.  

Haven't actually thought a lot about that part of the problem. I'm  
thinking of documents that are more form-like in nature, or at least  
built up from smaller components - so it's not so much going from Word  
to an internal format, as much as starting with XML or JSON (or tuples),  
building up structure, and then adding presentation at the final step.  
XML -> Word is a lot easier than the reverse :-)  

On the other hand, I do have a bunch of applications in mind where  
parsing Word and/or PDF would be very helpful - notably stripping  
requirements out of specifications. (I can't tell you how much of my  
time I spend manually cutting and pasting from specifications into  
spreadsheets - for requirements tracking and such.) Again, presentation  
isn't that much of an issue - structural and semantic analysis is. But,  
while important, that's a separate set of problems - and there are some  
commercial products that do a reasonably good job.  

> I want to parse Microsoft word files and PDF etc. and display them in  
> a format that is  
> recognisable and not too abhorrent to the user. I also want to allow  
> on-screen manipulation of  
> documents (in a browser) - all of this seems to require a mess of  
> Javascript (in the browser)and a mess of parsing programs inn the server.  
>  
> Before we can manipulate documents we must parse them and turn them  
> into a format  
> that can be manipulated. I think this is more difficult that the  
> storing and manipulating documents  
> problem. You'd also need support for full-text indexing, foreign  
> language and multiple character sets and so  
> on. Just a load of horrible messy small problems, but a significant  
> barrier to importing large amounts  
> of content into the system.  
>  
> You'd also need some quality control of the documents as they enter  
> the system (to avoid rubbish in rubbish out), also to maintain the  
> integrity of the documents.  

Again, for this problem space, it's more about building up complex  
documents from small pieces, than carving up pre-existing documents.  
More like the combination of an IDE and a distributed CVS - where fully  
"compiled" documents are the final output.  

>  
> If you have any ideas of now to get large volumes of data into the  
> system from proprietary formats  
> (like ms word) I'd like to hear about it.  
>  

Me too :-) Though, I go looking for such things every once in a while, and:  
- there are quite a few PDF to XML parsers, but mostly commercial ones  
- there are a few PDF and Word "RFP stripping" products floating around,  
that are smart enough to actually analyze the content of structured  
documents (check out Meridian)  
- later versions of Word export XML, albeit poor XML  
- there are quite a few document analysis packages floating around,  
including ones that start from OCR images - but they generally focus on  
content (lexical analyis) and ignore structure (it's easier to scan a  
document and extract some measure of what it's about - e.g. for indexing  
purposes; it's a lot harder to find something that will extract the  
outline structure of a document)  

Cheers,  

Miles  

--  
In theory, there is no difference between theory and practice.  
In practice, there is. .... Yogi Berra  

_______________________________________________  
erlang-questions mailing list  
erlang-questions@REDACTED  
http://erlang.org/mailman/listinfo/erlang-questions  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20140217/108859e2/attachment.htm>