How to get configuration data to a large number of threads?

Wed Oct 27 11:10:06 CEST 2004

> We need to be able to handle large
> volumes of transactions in a short time in bursts (SMS system using
> oserl http://oserl.sourceforge.net/)

> Unfortunately the entrie config set is needed
> for this and it could be as large as 5k (worst case, 1-2k is probably
> more realistic.)

> Some quick profiling shows that we can expect in the order of thousands
> of processes active at the same time, making the memory overhead a
> problem.

   Now I really  have to ask "have you done any  measurements" - or are
you  guessing the  outcome  of an  experiment  that you  have not  yet
performed?

   If  I  interpret "thousands  of  processes"  as  meaning (say)  5000
processes and take  your realistic case as 2K -  then the total memory
reqirement is about 10MBytes now this is  not a lot of data - have you
said "hundreds of  thousands" of processes then it  would be a diffent
story.

> In order to keep the processing speed as fast as possible I want as many
> prallel processes as I can manage.  Obviosly if I can get clever with
> the config data, this would mean more processes.  Failing that, I will
> have to place a lower limit on the number of processes so that they will
> fit into memory or to prevent memory churn.

   One way  of "being clever with  the configuration data"  might be to
organise  it into  a number  of small  servers, each  of  which answer
queries about a specify sub-set of the gloab configuration data.

   No matter how you do things you should abstract away from the
details of *how* you get the configuration data.

   What I would do is as follows:

   1) Define a configuartion api

      config:get_data(Key) => Data

          do you need more than this :-)

   2) Write the most beautiful and inefficient config.erl you can think of

   3) Measure

      If fast enough - hooray
      If not write a more ugly config.erl

   You might also like to think about how long clients hold the data.

   Vsn1:

   foo() ->
 	SomeBigDataStructure = config:get_data(big)
 	loop(SomeBigDataStructure).

   loop(SomeBigDataStructure) ->
 	receive
 	   ... ->
 	      loop(SomeBigDataStructure)
 	end

   May not be a good idea

   Vsn2:

   foo() -> loop().

   loop() ->
 	receive
 	   Msg1 ->
 		SomeSmallDataStructure = config:get_data(small),
 		... some local code which uses SomeSmallDataStructure ...
 		loop()
 	   Msg1 ->
 		...
 	end

   Retains the configuration data you  nned for a small amount of time.
After  calling  loop() in  the  reception of  Msg1  the  data will  be
available for garbage collection  ((or possibly earlier depending upon
the smartness of the compiler)).

   Note the  trade-of between caching  all the configuration  data, and
keeping it around until you need it (Vsn1) or fetching a small ammount
of data, using it and consuming it (Vsn2).

   To allow any possibilities for optimisation its probably a better to
work with  many different keys in  the configuration data,  so you can
get  small  amounts of  related  data when  you  need  it rather  than
everything.

>> From the coments up to now it seems that I am most likely going to have
> to be satisfied with passing the config data as a parameter to the
> function that executes in each thread and rely on the GC to keep things
> sane.
>

But this is the keep everything solution.

/Joe