[erlang-questions] Private ets table vs. Process directory

Thu Feb 8 07:28:32 CET 2018

That works fine in the simple case, but I'm contemplating repeatedly 
adjusting weights deep within a nested data structure. Your approach 
would result in creating an altered copy of the entire structure for 
each recursion.  This is probably only about 1KB or so of information, 
so doing this a few times isn't a problem, but doing it millions of time 
quickly becomes a problem.

This can be addressed by either ets or the process directory, and those 
allow the internal structure to be safely modified.  In the process 
directory it's safe because the information is never exported from the 
process (except for i/o, which must be special cased).  Similarly a 
private ets can handle it without problems. And so can a global ets, as 
then a unique process specific id (NOT pid, as this needs to survive 
restarts) can be used as a part of the key.  So those three methods 
would work.  The question in my mind is how to predict the tradeoffs as 
it scales up.  I suspect that the process directory would use the least 
memory, though possibly it would be the global ets table.  A private ets 
table seems the most natural approach, but it looks, to my naive eyes, 
as if it would scale poorly WRT memory use.

What I'd really like is to use a Mnesia system which kept a cache of 
active entries, but didn't require everything to be rolled in from 
disk.  AFAIKT, however, my choices with a Mnesia table are to keep 
everything in memory or to keep everything rolled out to disk.

I also haven't been able to determine whether processes that are waiting 
to receive a message can be rolled out to inactive memory.  There are 
some indications ("use enough processes, but not too many") that they 
can't.  This means that I need to adapt my memory use to the systems 
that are being run on rather carefully.  If background processes keep 
activating every live process to check it's status I could easily end up 
with severe thrashing.  And *THAT* will affect the design.  If I need to 
hand manage the caching, then I loose a lot of the benefits that I'm 
hoping to get from Erlang.

The basic design calls for a huge number of "processes" to be doing n x 
m communication, and the simple design calls for each "process" to be 
able to send messages to each other process, though only a subset of the 
messages would be actually sent.  My first sketch of a design called for 
each "process" to be mapped to a separate Erlang process, but this 
doesn't work, because Erlang doesn't like to have that many processes.  
Even this simple design, however, required to figure for allowing 1000 
inputs and 1000 outputs to each "process", and probably well over 
100,000 "processes".  Most of them would be idle most of the time, but 
all would need to be "activatable" when messaged, and all would need to 
become dormant when just waiting for a message.  The idea is not a 
neural net, but it has certain similarities.

Now if I could actually have one process per "process", then your 
proposal, which I recognize as the normal Erlang approach, would make 
sense, but that isn't going to work.  This could be done in that case by 
having lots of variables, so that there wouldn't be the need to have any 
modifications of deeply nested items, so not much would need to be copied.

As for KISS, that's a great approach, but it doesn't reveal scaling 
problems.  When one is adapting an approach one should always KISS, but 
when designing which approach to try it's important to pick one that 
will work when the system approaches its initial design goal.

On 02/07/2018 03:45 PM, zxq9@REDACTED wrote:
> On 2018年2月7日水曜日 8時56分01秒 JST Charles Hixson wrote:
>> ...so passing the state as function parameters would
>> entail huge amounts of copying.  (Essentially I'd be modifying nodes
>> deep within trees.)
>>
>> Mutable state would allow me to avoid the copying, and the state is not
>> exported from the process...
> You seem to be confused a bit about the nature of mutability. If I set 
> a variable X and in my service loop alter X, the next time the service 
> loop recurses (loops) X will be a different value -- it will have 
> mutated, but within the context of a single call of the service loop 
> function the thing labelled X at the time of the function call will be 
> immutable.
>
> -module(simple).
> -export([start/1]).
>
> start(X) ->
>    spawn(fun() -> loop(X) end).
>
> loop(X) ->
>    ok = io:format("X is ~p~n", [X]),
>    receive
>      {add, Y} ->
>        NewX = X + Y,
>        loop(NewX);
>      {sub, Y} ->
>        NewX = X - Y,
>        loop(NewX);
>      stop ->
>        ok = io:format("Bye!~n"),
>        exit(normal);
>      Unexpected ->
>        ok = io:format("I don't understand ~tp~n", [Unexpected]),
>        loop(X)
>    end.
>
>
> 1> c(simple).
> {ok,simple}
> 2> P = simple:start(10).
> X is 10
> <0.72.0>
> 3> P ! {add, 15}.
> X is 25
> {add,15}
> 4> P ! {sub, 100}.
> X is -75
> {sub,100}
>
>
> That is all there is to state maintenance, and this is how gen_servers 
> work. This is also the form that has the least mysterious memory 
> management model in the normal case, and the form that gives you all 
> that nifty memory isolation and fault tolerance Erlang is famous for. 
> Note that X is *not* copied every time we enter loop/1. If we send a 
> message containing X to another process, though, *then* X is copied 
> into the context of the process receiving that message.
>
> It doesn't matter at all what sort of a structure X is. Here it is a 
> number, but it could be anything. Gigantic tuples chock full of maps 
> and gb_trees and other process references and lists of things and 
> queues and whatnot are the norm -- and none of this causes trouble in 
> the normal case.
>
> As for mucking around in deep tree structures, altering nodes in trees 
> does not necessarily entail making a copy of the whole tree. To you as 
> a programmer there are two versions of the data which are effectively 
> distinct, but that does not necessarily mean that they are two 
> complete versions of the data in memory. The nature of copying (or 
> whether copying happens at all under the hood) and how fast things can 
> be garbage collected has to do with the nature of the task and what 
> kind of data structures you are using. Because of immutability you 
> *actually* get to share more data in the underlying implementation 
> than otherwise.
>
> Fred provided a great explanation a while back here:
> http://erlang.org/pipermail/erlang-questions/2015-December/087040.html
>
> The general approach to performance issues -- whether memory, I/O 
> bottlenecks, messaging bottlenecks, or raw thunk time -- is to start 
> out writing your processes in the vanilla way using state variables in 
> a loop and only stepping away from that when some extreme deficiency 
> is demonstrated. If you are going to be spawning a ton of processes at 
> once to do things then you've really got no way of knowing what is 
> going to break first until you actually have some working code and can 
> see it break for yourself. People get themselves into trouble with the 
> process dictionary, ETS, NIFs, etc. all the time because the use cases 
> often do not warrant the use of these techniques.
>
> So keep it simple. Write an example of what you want to do. Try it 
> out. You might wind up just saturating your processor or memory bus 
> way before you hit an actual space problem. If something breaks try to 
> measure why -- but right now without telling anyone the kind of data 
> you're dealing with or what kinds of operations you're doing or any 
> example code that is known to break in a certain way at a certain 
> scale we can't really give you much helpful advice.
>
> -Craig
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20180207/c7a40745/attachment.htm>