[erlang-questions] Private ets table vs. Process directory
Charles Hixson
charleshixsn@REDACTED
Thu Feb 8 07:28:32 CET 2018
That works fine in the simple case, but I'm contemplating repeatedly
adjusting weights deep within a nested data structure. Your approach
would result in creating an altered copy of the entire structure for
each recursion. This is probably only about 1KB or so of information,
so doing this a few times isn't a problem, but doing it millions of time
quickly becomes a problem.
This can be addressed by either ets or the process directory, and those
allow the internal structure to be safely modified. In the process
directory it's safe because the information is never exported from the
process (except for i/o, which must be special cased). Similarly a
private ets can handle it without problems. And so can a global ets, as
then a unique process specific id (NOT pid, as this needs to survive
restarts) can be used as a part of the key. So those three methods
would work. The question in my mind is how to predict the tradeoffs as
it scales up. I suspect that the process directory would use the least
memory, though possibly it would be the global ets table. A private ets
table seems the most natural approach, but it looks, to my naive eyes,
as if it would scale poorly WRT memory use.
What I'd really like is to use a Mnesia system which kept a cache of
active entries, but didn't require everything to be rolled in from
disk. AFAIKT, however, my choices with a Mnesia table are to keep
everything in memory or to keep everything rolled out to disk.
I also haven't been able to determine whether processes that are waiting
to receive a message can be rolled out to inactive memory. There are
some indications ("use enough processes, but not too many") that they
can't. This means that I need to adapt my memory use to the systems
that are being run on rather carefully. If background processes keep
activating every live process to check it's status I could easily end up
with severe thrashing. And *THAT* will affect the design. If I need to
hand manage the caching, then I loose a lot of the benefits that I'm
hoping to get from Erlang.
The basic design calls for a huge number of "processes" to be doing n x
m communication, and the simple design calls for each "process" to be
able to send messages to each other process, though only a subset of the
messages would be actually sent. My first sketch of a design called for
each "process" to be mapped to a separate Erlang process, but this
doesn't work, because Erlang doesn't like to have that many processes.
Even this simple design, however, required to figure for allowing 1000
inputs and 1000 outputs to each "process", and probably well over
100,000 "processes". Most of them would be idle most of the time, but
all would need to be "activatable" when messaged, and all would need to
become dormant when just waiting for a message. The idea is not a
neural net, but it has certain similarities.
Now if I could actually have one process per "process", then your
proposal, which I recognize as the normal Erlang approach, would make
sense, but that isn't going to work. This could be done in that case by
having lots of variables, so that there wouldn't be the need to have any
modifications of deeply nested items, so not much would need to be copied.
As for KISS, that's a great approach, but it doesn't reveal scaling
problems. When one is adapting an approach one should always KISS, but
when designing which approach to try it's important to pick one that
will work when the system approaches its initial design goal.
On 02/07/2018 03:45 PM, zxq9@REDACTED wrote:
> On 2018年2月7日水曜日 8時56分01秒 JST Charles Hixson wrote:
>> ...so passing the state as function parameters would
>> entail huge amounts of copying. (Essentially I'd be modifying nodes
>> deep within trees.)
>>
>> Mutable state would allow me to avoid the copying, and the state is not
>> exported from the process...
> You seem to be confused a bit about the nature of mutability. If I set
> a variable X and in my service loop alter X, the next time the service
> loop recurses (loops) X will be a different value -- it will have
> mutated, but within the context of a single call of the service loop
> function the thing labelled X at the time of the function call will be
> immutable.
>
> -module(simple).
> -export([start/1]).
>
> start(X) ->
> spawn(fun() -> loop(X) end).
>
> loop(X) ->
> ok = io:format("X is ~p~n", [X]),
> receive
> {add, Y} ->
> NewX = X + Y,
> loop(NewX);
> {sub, Y} ->
> NewX = X - Y,
> loop(NewX);
> stop ->
> ok = io:format("Bye!~n"),
> exit(normal);
> Unexpected ->
> ok = io:format("I don't understand ~tp~n", [Unexpected]),
> loop(X)
> end.
>
>
> 1> c(simple).
> {ok,simple}
> 2> P = simple:start(10).
> X is 10
> <0.72.0>
> 3> P ! {add, 15}.
> X is 25
> {add,15}
> 4> P ! {sub, 100}.
> X is -75
> {sub,100}
>
>
> That is all there is to state maintenance, and this is how gen_servers
> work. This is also the form that has the least mysterious memory
> management model in the normal case, and the form that gives you all
> that nifty memory isolation and fault tolerance Erlang is famous for.
> Note that X is *not* copied every time we enter loop/1. If we send a
> message containing X to another process, though, *then* X is copied
> into the context of the process receiving that message.
>
> It doesn't matter at all what sort of a structure X is. Here it is a
> number, but it could be anything. Gigantic tuples chock full of maps
> and gb_trees and other process references and lists of things and
> queues and whatnot are the norm -- and none of this causes trouble in
> the normal case.
>
> As for mucking around in deep tree structures, altering nodes in trees
> does not necessarily entail making a copy of the whole tree. To you as
> a programmer there are two versions of the data which are effectively
> distinct, but that does not necessarily mean that they are two
> complete versions of the data in memory. The nature of copying (or
> whether copying happens at all under the hood) and how fast things can
> be garbage collected has to do with the nature of the task and what
> kind of data structures you are using. Because of immutability you
> *actually* get to share more data in the underlying implementation
> than otherwise.
>
> Fred provided a great explanation a while back here:
> http://erlang.org/pipermail/erlang-questions/2015-December/087040.html
>
> The general approach to performance issues -- whether memory, I/O
> bottlenecks, messaging bottlenecks, or raw thunk time -- is to start
> out writing your processes in the vanilla way using state variables in
> a loop and only stepping away from that when some extreme deficiency
> is demonstrated. If you are going to be spawning a ton of processes at
> once to do things then you've really got no way of knowing what is
> going to break first until you actually have some working code and can
> see it break for yourself. People get themselves into trouble with the
> process dictionary, ETS, NIFs, etc. all the time because the use cases
> often do not warrant the use of these techniques.
>
> So keep it simple. Write an example of what you want to do. Try it
> out. You might wind up just saturating your processor or memory bus
> way before you hit an actual space problem. If something breaks try to
> measure why -- but right now without telling anyone the kind of data
> you're dealing with or what kinds of operations you're doing or any
> example code that is known to break in a certain way at a certain
> scale we can't really give you much helpful advice.
>
> -Craig
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20180207/c7a40745/attachment.htm>
More information about the erlang-questions
mailing list