[erlang-questions] beginner: Updating Data Structures

Mon Oct 29 18:26:05 CET 2007

While an Erlang system has the ability to update its program on the fly,
updating data structures on the fly seems a bit more difficult.  Unless you
can upgrade all nodes simultaneously, some nodes will be expecting the old
data structure while others then new.  My question therefore, is how to
structure my data?  Is there an approach that I am missing that is both
upgrade-friendly and ETS/Mnesia-compatible?  Please see the following
paragraphs for my analysis so far.

Suppose we are writing an inventory control application.  We decide to
create a record to contain our information about items in our inventory.
Not much to say about items, really, so we’re just going to hold the item’s
name in a record.  If something else ever needs to be tracked regarding
these items, we can always upgrade our data, right?

-record(item,

       { name

       }).

So we roll out our new inventory system to 3,000 nodes in our 25 warehouses
in 6 different countries, and everything works swimmingly.  For a while.

However, some time later, our accounting department decides we need a way to
value our inventory, and each item should have a value associated with it.
That way, we can calculate inventory value simply by multiplying value by
quantity at each location.  Unfortunately, we cannot now use our record
structure.  What to do?

Well, naïvely, we decide to just modify our item record.

-record(item,

       { name

       , value

       }).

This new record structure is incompatible with the old item record
structure, so we will also write some code that upgrades our items in the
system to the new structure when we upgrade the system.  Unfortunately,
unless our entire worldwide operation is upgraded all at once, any process
using the old structure will crash when it encounters a new-style item, and
vice versa.  Simultaneous upgrading all 3,000 nodes is impractical, so we’ll
have to rethink our original decision.

We could have created the original record structure with expansion slots
available for future use.

-record(item,

       { name

       , 'X1'

       , 'X2'

       , 'X3'

       }).

Now when Accounting wants us to add the value of the item to the item
record, we simply redefine one of the expansion slots.

-record(item,

       { name

       , value

       , 'X2'

       , 'X3'

       }).

This will not crash any process, since the size of resulting tuple is still
the same.  Unfortunately, we might run out of expansion slots if we don’t
allocate enough of them.  The example runs out of slots once Accounting also
gets their cost-basis and GL-class elements added, leaving us in the same
boat as before.  We simply delayed the inevitable.  We might get bright and
allocate the new slots hierarchically by department, for instance, so
Accounting gets only one slot for all of its information, and we define a
new record for the information in that slot.

-record(item,

       { name

       , acctg

       , 'X2'

       , 'X3'

       }).

-record(acctg_item,

       { value

       , cost_basis

       , gl_class

       , 'X1'

       , 'X2'

       , 'X3'

       }).

However, this approach once again only delays the inevitable.  When
Inventory Control and Manufacturing take up the other two expansion slots,
there is no room for Engineering’s data.  Plus, we have multiplied this
problem, since it occurs for each of our subrecords, which can also run out
of expansion slots.

Another alternative might be to have only one expansion slot, which is
filled in by the next version of the item record.

-record(item,

       { name

       , v2

       }).

-record(item_v2,

       { value

       , cost_basis

       , gl_class

       , v3

       }).

Now when we have more elements to add, we create an item_v3 record (with a
v4 element to accommodate future expansion), and so on.  The problems with
this, however, are that programmers need to know which version of the record
a certain data element is, and that by the time we go through a few score
enhancements and we’re up to version 68, it becomes quite cumbersome, and is
little better than had we used a linked list.

In fact, a linked list may well be better.  Instead of writing functions
with the record syntax, we can use lists.

item_value([item, _Name, Value | _])

       ->

                Value

              .

To retrieve the value, we only need to know its position in the list.  This
approach suffers from a couple of problems: (1) You need to know the
position of each element in the list; (2) This list will be repeated quite
frequently, so when you have 300 attributes your code will be brittle,
repetitive, and difficult to maintain.

Perhaps an alternative approach is to define each record version
independently, instead of additively as we tried earlier.

-record(item1,

       { name

       }).

-record(item2,

       { item

       , value

       , cost_basis

       , gl_class

       }).

Now in our code, we have versions of each function matching on the record
structure, and a function that handles the no-match case (in case you’re
running v2 code when you receive a v3 record).  Once again, however, we run
into a couple of obstacles: (1) We must implement a different version of
each function for each version of the record (this will get tiresome around
version 68); (2) new versions are not backward compatible: a node running a
previous version of the code will not recognize future-versioned data
structures, even though the only fields it needs are those from its own
version.

Let’s borrow a page from object-oriented design principles.  Why not let the
item provide its own methods for data access through functions contained on
the structure.  We define a record “class” which has two slots: one for the
methods, and one for the data.  By doing this, items carry around their own
methods and so it doesn’t really matter what version of an item something
is, so long as the item knows how to use its own data.  First we define some
infrastructure.

-module(class).

-export([invoke/3]).

-record(class,

       { methods

       , data

       }).

invoke(Method_ID, Object = #class{methods = Methods}, Args)

       ->

                Method = Methods(Method_ID),

                Method(Object, Args)

              .

To call a method on an object, syntax is simply “invoke(Method_ID, Object,
Args)”, such as

X = item:new ("X"),  % Create a new item "X"

X_Name = class:invoke(get_name, X, []),  % Returns "X"

Y = class:invoke(set_name, X, ["Y"]).  % Changes item name

This is great for encapsulation!  The implementation is straightforward.

-module(item).

-export([new/1]).

-include("class.hrl").

-record(item,

       { name

       }).

new(Name)

       ->

                #class{ methods = fun(get_name) -> fun get_name/2

                                  ;  (set_name) -> fun set_name/2

                                  end

                      , data    = #item{ name = Name }

                      }

              .

get_name(#class{data = #item{name = Name}}, _)

       ->

                Name

              .

set_name(Object = #class{data = Item}, [Name])

       ->

                Object#class{data = Item#item{name = Name}}

              .

Alas, there is a fly in this ointment, too.  While it would appear that the
method functions are being carried around along with the data (in fact, the
item tuple is “{class,#Fun<item.0.96410792>,{item,"X"}}”), those functions
are really not carried around from node to node.  Instead, Erlang only
carries around references to the functions.  This means if this item shows
up on a node where the function does not exist, an error will occur when a
method is invoked.

The fact that you cannot safely sling functions around with your data from
node to node indicates that perhaps we need a very simple interface with
functions that will never change.  Maybe instead of using records at all, we
can use basic OTP library functions to associate item properties with their
values.  Sounds kind of like what proplists were designed for.

X = [{name, "X"}],  % Create a new item "X"

X_Name = proplists:get_value(name, X),  % Returns "X"

Y = [{name, "Y"} | proplists:delete(name, X1)].  % Changes item name

A similar effect can be had with dicts, with the decision probably to be
made based on performance.  (Not only that, but the decision can be made
dynamically at run-time, since there are functions for converting between
the two.)

X = dict:from_list([{name, "X"}]),  % Create a new item "X"

X_Name = dict:fetch(name, X),  % Returns "X"

Y = dict:store(name, "Y", X).  % Changes item name

This approach has the advantage of being completely backward-compatible with
respect to my code-base.  Should a later version of our inventory
application add a property, it will not change the operation of any previous
version.  Once again, however, there are problems with this approach: (1)
property values cannot be used for matching in function definitions; (2)
these structures are not easily indexed: ETS and Mnesia require record data
types.  While Disadvantage 1 might be easily managed by performing lookups
and conditionals within the function, Disadvantage 2 is probably
intractable.

To repeat my question, gentle readers, how ought I structure my data?  Is
there an approach that I am missing that is both upgrade-friendly and
ETS/Mnesia-compatible?

Thank-you.

Cheers,

David

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20071029/d8b7aee4/attachment.htm>