[erlang-questions] Theoretically Stuck

Wed Sep 6 05:46:33 CEST 2006

Rudolph van Graan had questions about modeling data and erlang...

I was a bit confused when I read your post because you have mixed 
several concepts together.  The post title indicated to me you were 
interested in expressing a problem most closely to its description, 
while the body of the message indicated your were concerned with 
performance.

 > Restatement of some of the original message:

The original problem posed involved database records which evolved over 
time.

-record(person, {name, surname}).

What happens when new fields are added to some records?

 > End restatement

I would pose a few questions for you to consider so that you can decide 
what is your primary goal:

1) Suppose you had no programming language, just an SQL database.
    a) How would you model the original person table?
    b) What would you do when a new column is needed?
    c) Would you really create a separate table every time you added fields?

2) If you have three people who all have name, surname, but they each 
have different other attributes, will you use them interchangeably in 
some method invocations if you were using an OO language?

3) If you had 1M people, each of which had different attributes, would 
you expect to be able to use them interchangeably?

4) Would you expect performance to be the same in all three cases?

There are several concepts intertwined:

1) How do I store data in a database that I know will migrate to a new 
schema?

2) What do I do when the schema changes?

3) How do I deal with statically typed languages when my data is dynamic?

4) Inheritance seems like an efficient way to avoid repeating code (DRY 
principle).  How can I use it in erlang?

5) What whizzy language feature of erlang makes these problems go away?

What are you more worried about?

A) The problem is accurately modeled.
B) Code clarity, lack of repetition and ease of modification.
C) Performance and size of application.
D) Ability to morph the data structures on a per instance basis.
E) Migration of the architecture over the course of years.
F) Database management and performance.
G) Language specific constructs that make the code minimal and beautiful.

Not all of these are orthogonal, but you need to constrain the problem.  
Designing and coding consists of a series of tradeoffs.  You can't get 
all of A-E with a single best design or coding style.

Your post title said to me, "Forget performance, what is the _correct_ 
way philosophically to overcome my problem" (read A is most important, 
although B could apply just as well).  If the main problem is that the 
data model changes regularly and the instances can all be differently 
shaped (as in D), then:

- Use proplists or dictionaries.  You may want to store your objects in 
erlang terms file using dets or consult.  Avoid using a structured 
database like SQL.

If the data changes less frequently, but the attribute set changes when 
they do and you want efficient database management (as in F):

- Use an OO database with schema and object versions.  I know of no 
erlang adaptors, so pick a different language.

If your case is E, don't worry so much about rigid data structures:

- Use records "properly" (i.e., normalize the SQL tables and have 
corresponding records for each table), migrate all data in your schema.  
You could do this online or offline depending on your requirments (or 
node by node even).  Just write the extra code every time it changes, 
code is less important than a clean database structure.

For option C, speed of access, and reusable functions are more 
important.  Code using separate modules for the additional features.  
Create your original records using a grouping record:

-record(person, {std_attrs, moda_attrs, modb_attrs}).

Then in each of std, module a and module b you can define a record that 
is consistent for the functions coded in the module.  To extend, add a 
new module and a new field on the person record.

-module(std).
-record(std_attrs, {name, surname}).
-record(moda_attrs, {type}).
-record(modb_attrs, {stuff}).

This can be mapped to an SQL database structure that changes 
periodically as in case E.  Case F is probably best covered in this way 
if you don't want to use records/SQL "properly".  Here you aren't stuck 
using slow proplists or hashtables, but can tune and structure the data 
for efficient functional access inside each module.

ROK's dictionaries cover G.  Right now they are not available.  If you 
try one of the above techniques, maybe you will learn enough about the 
problem to be able to contribute to the effort to implement them, or at 
least will have specific examples of code savings and clarity that might 
help convince the OTP team to implement them.

There are probably lots of other options available (e.g., implement 
ROK's dictionaries using the equivalent existing functions he 
describes).  The limiting factor for implementation will always be the 
choice of data structure.

I don't think your example of using OO inheritance is a good approach.  
You will eventually have a mess of the type system and will have to 
restructure everything once it is already spaghetti.  If objects can be 
used interchangeably, they can be subclasses, but if you add or subtract 
methods you will have a heck of a time dealing with a collection of 
randomly related person objects, some of which have dispatch methods and 
some of which don't.

(The approach you describe is what I would call implementing dynamic 
data by hacking a static typing system.  A more philosophically correct 
way would be to implement schema versioning and a declarative attribute 
set using a hash and then encapsulate the version and the instance in a 
single object -- which could be done in erlang with ets or proplist and 
a record defining the version and the instance attributes at the expense 
of pattern matching on fields and values.)

I would take these steps in architecting a new system:

1) Identify the key characteristics ranked in importance (as in a subset 
of A-G above)
2) Determine what features you really need from a database (can you use 
flat files, dets, or consult type term files?).
3) Don't worry about performance or efficiency until you have the code 
working.
4) Ok, if you know you need 1M objects in memory at once, you can think 
about performance, but really don't sweat it yet.
5) Measure and tweak performance

Generally you want to start with goals of Clarity, Succinctness and then 
Performance.  Maintainability will come along if you can achieve those.

jay