[erlang-questions] Large sparse/dense matrix server

Mon Jun 25 01:16:52 CEST 2012

Greetings,

I have mostly a mental question as, although been using erlang for a few
minor servers for a few years now, this is my first 'large' project that
does not seem to fit functional programming perfectly well for efficiency
reasons, so I am curious at any and all thoughts that may help the design
of this system.

I have been writing an Erlang server that handles multiple very large
matrices.  By very large they extend to an unbounded distance, but 'chunks'
of it are only generated upon initial access of said chunk.  Each cell is
an integer of 32 bits in size (although I may need to expand it to 48 or 64
bits soon).  Effectively the cells will be read quite often in large
neighborhoods of chunks and sent out.  Occasionally there will be a single
cell change requested by a client, but this cell change may cause a ripple
of other nearby cell changes and updates need to be sent out to clients
that are watching a chunk for updates.  They need to be persisted to disk,
but it is not imperative that it is flushed on every change, even being
cached for for a couple of minutes would not be bad (although no more than
a few seconds is best).

Right now I have it set up like this:

 - If a client watches an area:
- - All chunks in such an area are loaded
 - - client gets a full dump of all the chunk data in the area

- When a chunk is loaded, a new process is started and it loads a file from
disk that contains its gzipped saved state of its entire chunk area
 - - If no state file exists then it generates one from its initial
generation function, then stores that data to disk so the costly generation
is not done again
 - - Currently a chunk is 16x16x16=4096 cells in size and multiple chunks
are stored in a single gzipped file to prevent tiny file creation explosion
on disk
- - If nothing is watching a chunk anymore, then after a time delay based
on various things (especially proximity to other nearby loaded chunks and
if a time for an inside cell is soon to expire) then the process will do a
full flush of its data to disk then die.

- If a cell is requested to change then the change request is sent to the
chunk handling process
 - - The chunk handling process runs a few things in turn when a cell
request change is set:
- - - That the client is authorized to make a change to that cell (a 3d
range check for permissions in another process)
 - - - Pre-Listeners for that cell are called to allow them to veto a change
- - - The cell is changed, the specific integer value of the cell, the
first 16 bits are used as an index lookup for a module name
 - - - - The module for that new cell type is called, it returns a full
cell sized integer of what needs to be stored or if it needs to deny the
change
 - - - - The module for that new cell type may also set extra data for that
chunk that is a key/value with the key being the {Matrix, X, Y, Z} of the
cell changed
- - - - Any pending timers in the scheduler for the cell are killed.
- - - Post-Listeners for that cell are called that allow them to react to
that change, potentially making other changes to other cells following this
same pattern
- - - - The 'client' for rippling changes is basically just {'system',
Client} to signify that it is a system caused change, but initially started
by whatever Client

- A scheduler, sometimes a cell wants to update itself, sometimes once
every second, perhaps once in 5 seconds then again in a minute then never
again until its state changes again or so
- - When time elapses, which may not happen 'instantly' when it wants it to
happen but 'soon', the cell id is looked up and its module update function
is called with the coordinates, full cell value, delta of how 'late' the
callback is, and any data given to the scheduler to be passed back in.

Pre-listeners are synchronous callback funs, they are supposed to be short
and very simple checks, but of course may still require a call to other
processes, authentication, etc... so it has not yet always gone as planned.
Post-listeners are PIDs that messages are sent to, to enforce that they
must happen 'later', although I am about to change that to submit an update
to the scheduler instead.

There are a few other minor things, but this is the big part.  Right now it
is running in two Erlang VM's on one 6-core server with 16 gigs of ram,
4.2TB storage, and it is quite not efficient (it grew up in this way, needs
a redesign...).

First change I was thinking of was potentially using riak-core to handle
creation and management of the chunk processes, since I do want to grow
this server from two VM's on one computer to multiple VM's across multiple
computers.  My main issue with this is that I would prefer 'nearby' chunks
to try to be loaded on the same node, and far away chunks to be on
different nodes, some areas in the entire matrix may have massive areas
loaded so that would need to be distributed as well as possible.  And of
course just as timers/schedulers elapse new chunks will need to be loaded
to do a rather miniscule amount of processing then to unload again.

Second change, to facilitate the non-central storage is to use Riak (unless
there is something better suited for this purpose) as the storage.  A
Key/value row in Riak would hold an entire chunk, plus some extra handling
data as the value, and the key would just be the chunk coordinates, a
4-tuple {MatrixLayer, X/16, Y/16, Z/16} where MatrixLayer is just one of
multiple different unbounded matrices of some erlang term (likely atom,
sometimes integers, but may be something else for others later on).

Third change (maybe), as costly as it is to do initial generation, maybe I
only want to store a chunk of there were any changes in it, meaning I may
want to prevent initial updates unless caused by a client, disk space has
not yet been an issue however, no where close actually.

Plus who knows what else.  Preferably the system needs to always remain up,
and hot code loading has been quite useful here (not been using release
updates, probably should, but right now this is still a 'dev' system so I
actually do my programming on the live system, later on when actually
rolled out for its heavy usage that will change).

I have even thought of something as radical as doing a row per cell in
riak, I could put the access data and all in the function for it, it would
need read access to its neighbor cells, but not write (as writes would just
be pushed for 'later'), and I 'think' I can do that in riak using its
mapreduce interfaces, but not checked, and I need it to be quite fast.
However it sounds costly in that it would be difficult to batch cell
updates together as I do now (especially at generation time).

I need the time guarantee of the system to be within a few seconds response
time for the clients, timer callbacks, pretty much everything for a very
high percentile.  And I can expand out to more servers, however
communication overhead seems like it may get substantial rather quickly
based on past projects with Riak.

So what would be the best design for such a system?  And the system itself
has a lot of parts and a lot of rather nasty interactions, I am not opposed
to a complete rewrite if it will help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20120624/74bae5ba/attachment.htm>