[erlang-questions] "actor database" - architectural strategy question

Tue Feb 18 00:22:13 CET 2014

[By the way folks - all the other threads going on be damned - this is a 
great community.  Thank you all for the rapid and useful input to what 
is, as yet, a vaporous system concept!]

Hi Joe,

First off, thanks for the response!

Following-up, inline:

Joe Armstrong wrote:
>
>
>
> On Mon, Feb 17, 2014 at 9:22 PM, Miles Fidelman 
> <mfidelman@REDACTED <mailto:mfidelman@REDACTED>> wrote:
>
>
>     Joe...  can you offer any insight into the dynamics of Erlang,
>     when running with large number of processes that have very long
>     persistence? 
>
>
> No - this area has not to my knowledge been investigated. The "use 
> lots of processes" or "as many processes as necessary" has an implicit 
> assumption that  a) the processes are not very large and
> b) not very long lived. At the back of my mind I'm thinking of a) as 
> "a few hundred KB resident size" and
> b) a few seconds to minutes. I'm *not* thinking MBs and years. The 
> latter requirements fit into our
> "telecoms domain" - a few thousands to tens of thousands of 
> computations living for "the length of a telephone call" ie (max) 
> hours but not years.
>
> Some kind of "getting things out of memory and onto disk when not 
> needed" layer is needed for your problem,

Ok.  After reading what others have said about garbage collection, this 
is clearly issue number one that I'll need to address.

At first glance, it strikes me that the hibernate BIF does at least part 
of what's needed - any thoughts/suggestions as to whether it might make 
sense to approach this by extending hibernate, vs. something at the 
application layer?  And, if it makes sense to play with the BIF, any 
quick pointers to where I might find detailed documentation on how it's 
implemented?

>
>      Somehow, it strikes me that 100,000 processes with 1MB of state,
>     each running for years at a time, have a different dynamic than
>     100,000 processes, each representing a short-lived protocol
>     transaction (say a web query).
>
>
> My first comment is, thanks for providing some numbers :-) I keep 
> saying time and time
> again, don't ask questions without numbers. 100K processes with 1MB of 
> state = 10^11 bytes
> so you'd need a really big machine to do this. Assuming say 8GB of 
> memory and 1MB of state
> you'd have an upper limit of 8K processes. This assumes a regular 
> spinning disk. I guess if you have
> a big SSD the story changes.
>
> So you either have to reduce the size of the state, or the number of 
> processes. The state can (I suppose) be partitioned into a (small) 
> index and a (larger) content. So I'd keep the index in memory and the 
> content
> on disk (or cached).

Which also brings us back to keeping most of the documents in some kind 
of hibernation, stored on disk, but ready to wake up if called on.

>
>     Coupled with a communications paradigm for identifying a group of
>     processes and sending each of them the same message (e.g., 5000
>     people have a copy of a book, send all 5000 of them a set of
>     errata; or send a message asking 'who has updates for section 3.2).
>
>
> Hopefully all 5000 people will not want the errata at the same time

Here's where I think pub-sub and replication.
>
>
>     In some sense, the conceptual model is:
>     1. I send you an empty notebook.
>     2. The notebook has an address and a bunch of message handling
>     routines
>     3. I can send a page to the notebook, and the notebook inserts the
>     page.
>     4. You can interact with the notebook - read it, annotate it, edit
>     certain sections - if you make updates, the notebook can
>     distribute updates to other copies - either through a P2P
>     mechanism or a publish-subscribe mechanism.
>
>     At a basic level, this maps really well onto the Actor formalism -
>     every notebook is an actor, with it's own address.  Updates,
>     interactions, queries, etc. are simply messages.
>
>     Since Erlang is about the only serious implementation of the Actor
>     formalism, I'm trying to poke at the edge cases - particularly
>     around long-lived actors.  And who better to ask than you :-)
>
>
> It's a very good question. I like questions that poke around at the 
> edges of what is possible :-)
>
>
>     In passing: Early versions of Smalltalk were actor-like,
>     encapsulating state, methods, and process - but process kind of
>     got dropped along the way.  By contrast, it strikes me that Erlang
>     focuses on everything being a process, and long-term persistence
>     of state has taken a back seat. 
>
>
> Yes - I guess the real solution would be to change the scheduler to 
> swap processes to disk after they had waited for more than (say) 10 
> minutes for a message, and resurrect them when they are sent a message.

Any thoughts on how to do this - perhaps in combination with extending 
the hibernate BIF?

Cheers,

Miles

------ nothing new below here --------
>
> The idea that they might be swapped out for years hadn't occurred to me.
>
>      I'm trying to probe the edge cases. (I guess another way of
>     looking at this is: to what extent is Erlang workable for writing
>     systems based around the mobile agent paradigm?)
>
>
> Pass - at the moment you'd have to implement you own object layer to 
> do this .
> I guess you could do this yourself by making send and receive library 
> routines and
> making the state of a process explicit rather than implicit, then 
> slicking everything into
> a large store (like riak). If you cache the active processes in memory 
> this might behave
> well enough.
>
>
>
>
>
>
>         What I think is a more serious problem is getting data into
>         the system in the first place.
>         I have done some experiments with document commenting and
>         annotation systems and
>         found it very difficult to convert things like word documents
>         into a form that looks half
>         decent in a user interface.
>
>
>     Haven't actually thought a lot about that part of the problem. I'm
>     thinking of documents that are more form-like in nature, or at
>     least built up from smaller components - so it's not so much going
>     from Word to an internal format, as much as starting with XML or
>     JSON (or tuples), building up structure, and then adding
>     presentation at the final step.  XML -> Word is a lot easier than
>     the reverse :-)
>
>     On the other hand, I do have a bunch of applications in mind where
>     parsing Word and/or PDF would be very helpful - notably stripping
>     requirements out of specifications.  (I can't tell you how much of
>     my time I spend manually cutting and pasting from specifications
>     into spreadsheets - for requirements tracking and such.)  Again,
>     presentation isn't that much of an issue - structural and semantic
>     analysis is.  But, while important, that's a separate set of
>     problems - and there are some commercial products that do a
>     reasonably good job.
>
>
>         I want to parse Microsoft word files and PDF etc. and display
>         them in a format that is
>         recognisable and not too abhorrent to the user. I also want to
>         allow on-screen manipulation of
>         documents (in a browser) - all of this seems to require a mess
>         of Javascript (in the browser)and a mess of parsing programs
>         inn the server.
>
>         Before we can manipulate documents we must parse them and turn
>         them into a format
>         that can be manipulated. I think this is more difficult that
>         the storing and manipulating documents
>         problem. You'd also need support for full-text indexing,
>         foreign language and multiple character sets and so
>         on. Just a load of horrible messy small problems, but a
>         significant barrier to importing large amounts
>         of content into the system.
>
>         You'd also need some quality control of the documents as they
>         enter the system (to avoid rubbish in rubbish out), also to
>         maintain the integrity of the documents.
>
>
>     Again, for this problem space, it's more about building up complex
>     documents from small pieces, than carving up pre-existing
>     documents.  More like the combination of an IDE and a distributed
>     CVS - where fully "compiled" documents are the final output.
>
>
>
>         If you have any ideas of now to get large volumes of data into
>         the system from proprietary formats
>         (like ms word) I'd like to hear about it.
>
>
>     Me too :-)  Though, I go looking for such things every once in a
>     while, and:
>     - there are quite a few PDF to XML parsers, but mostly commercial ones
>
>
> Suck - then you have to buy them to find out if they are any good
>
>     - there are a few PDF and Word "RFP stripping" products floating
>     around, that are smart enough to actually analyze the content of
>     structured documents (check out Meridian)
>
>     - later versions of Word export XML, albeit poor XML
>
>
> Which sucks
>
>     - there are quite a few document analysis packages floating
>     around, including ones that start from OCR images - but they
>     generally focus on content (lexical analyis) and ignore structure
>     (it's easier to scan a document and extract some measure of what
>     it's about - e.g. for indexing purposes; it's a lot harder to find
>     something that will extract the outline structure of a document)
>
>
>     Cheers,
>
>     Miles
>
>
>
>     -- 
>     In theory, there is no difference between theory and practice.
>     In practice, there is.   .... Yogi Berra
>
>     _______________________________________________
>     erlang-questions mailing list
>     erlang-questions@REDACTED <mailto:erlang-questions@REDACTED>
>     http://erlang.org/mailman/listinfo/erlang-questions
>
>

-- 
In theory, there is no difference between theory and practice.
In practice, there is.   .... Yogi Berra