Making Erlang records more reliable

Thu Sep 22 07:47:35 CEST 2005

Improving the reliability of Erlang records.

Now that people want to print Erlang records in a way that reflects the
original record declaration, we are finding that many record names are
used over and over again (especially 'state'), so that it is not possible
to figure out which declaration to use given just the record name and arity.

The model for Erlang records goes like this:

1.  Record forms are introduced by a declaration
    -record(Name, {Field1(=Default1),...,FieldN(=DefaultN)}).
    They are a function of the preprocessor rather than being a real
    built-in data type.

2.  A record with N fields is represented by a tuple with N+1 elements,
    where the last N elements of the tuple correspond in order to the N
    fields of the record.  The first field of the record holds a tag.
    The tag is supposed to identity the record and make it unlikely that
    a record pattern will accidentally unify with a record of a
    different type.

3,  A record tag is an atom.

4.  A record tag is the declared name of the record type.

We could change each of these decisions.

1.  We could introduce a proper type for records.  I have written an
    unpublished paper about this, but not implemented it.  Basically,

        <Name{Field1=Val1,...,FieldN=ValN}>

    is like a tuple

        {Descriptor,Val1,...,ValN}

    except that Descriptor points to something like a tuple

	{Name,Field1,...,FieldN},

    the field names having first been sorted in both cases.  This is as
    space-efficient as current Erlang records, and rather more robust
    (if you add a field to a record, code that matches against it just
    keeps working), but it is not as time-efficient because name
    matching happens at run time.  However, fetching k fields out of a
    record with n fields is still O(n+k), so when most of the fields are
    of interest it's not too bad.

2.  There is no law of nature that says the tag information has to be a
    single field.  We could use {Name,Source,Field1,...,FieldN} where
    Source is some kind of code derived from the place where the record
    type is declared.  However, it makes a lot of sense that pattern
    matching, after checking the arity, should match the tag first, so
    making the tag the first (or first few) elements is a good idea.
    And using tuples makes more sense than lists, while Erlang has
    nothing other than tuples or lists to choose from.

3.  A tag could be some other kind of value.  It could be a {Name,Module}
    pair.  Tag comparison has to be fast, so it has to be an atom, an
    immediate number, or something else that compares fast.  It is
    tempting to suggest a reference, but we want two Erlang nodes that
    use the same record type to use the same tag, even before they first
    communicate.  This rules out most alternatives.

4.  The tag is exactly the record name.  This decision avoids invention,
    but it results in conflct.

To me it seems obvious both that 1 is the right thing to do long term,
but that in order to get something useful as quickly as possible, the
smallest change that could possibly work is to overturn point 4.
Instead of the atom being the record name, it will include it, but it
will or may include other stuff as well.  However, the record tag has to
be derived from available information in such a way that all Erlang
compilers on all Erlang nodes can generate the same tag.

Step 1: introduce  a new preprocessor declaration, -record_tag/2.

We start by adding a new preprocessor declaration form.

      -record_tag(RecordName, TagName).

Within a module there may be at most one explicit -record_tag
declaration for each record type.  In fact each record type visible in a
module will have exactly one -record_tag declaration; the others will be
generated implicitly.  For any given TagName known in a module, there
must be exactly one (explicit or implicit) -record_tag declaration.  One
effect of this declaration is to define a new macro as if by

      -define(RecordName_TAG, TagName).

For example,

      -record_tag(state, 'WFTP_Control_State').

not only tells the -record code to use 'WFTP_Control_State' as the tag for
#state, but also does

      -define(state_TAG, 'WFTP_Control_State').

The purpose of this macro is to allow

      is_record(Expr, state)

to be replaced by

      is_record(Expr, ?state_TAG).

I have wondered whether

      -define(is_RecordName(X), is_record(X, TagName)).

might be better, but we are talking about the least possible change here.

It is a fatal error if a -record_tag declaration follows the
corresponding -record declaration.  If there is a -record_tag
declaration for which there is no corresponding -record declaration, a
warning message should be issued, but this need not be a fatal error.

A -record_tag declaration must be in exactly the same file as its
corresponding -record declaration.  This means that it is impossible for
a record declaration in a .hrl file to be given one tag in one module
and another tag in another module.

The -record_tag declaration is NOT intended for normal use.
It has two intended uses:
(a) as a transition tool for moving from the current scheme
    to the (slightly) improved scheme.
(b) as an escape hatch for rare cases when the implicitly generated
    tags conflict or do not carry necessary information.

Step 2.  Implicit -record_tag declarations.

If there is a -record declaration with no explicit -record_tag
declaration preceding it, it might be intended as a private declaration
for use within the containing module only, or it might be intended for
sharing with other modules.  The preprocessor can use a simple rule to
tell these cases apart.  Any -record declaration that is meant to be
shared between modules must be in an -included file; any -record
declaration that is not -included must be intended to be private.

The default tag for a private record type is

      '<module name>:<record name>#<magic code>'.

The default tag for a shared record type is

      '<record name>#<magic code>'.

What is the magic code?  Ideally, we would like to ensure that two tags
are the same if and only if the corresponding -record declarations are
the same.  It would be nice to base this on the identity of the file
containing the declaration, but thanks to hard links, symbolic links,
aliases, and networked file systems, it is possible for a file to have
many different names.  The present scheme checks that the record name is
the same and the number of fields is the same, but it does not check for
the names of the fields or their order.  So if we have

      -record(foo, {a, b}).

in one module and

      -record(foo, {b,a}).

in another module, a #foo{a=1,b=2} in the first will be mistaken for a
#foo{b=1,a=2} in the second.  So at a minimum the <magic code> should
depend on the names of the fields and their order.  Some hash code based
on [Field1,...,FieldN] should suffice.  (One obvious alternative would
be to encode the actual field names, but it would then be tempting to
rely on that for decoding, but for the sake of backwards compatibility
and to cope with -record_tag declarations, it's better that the field
names should never be visible in the tag name although they should
influence it.

At this time I do not want to say anything about the hashing algorithm,
except to note that 30 to 60 bits of hash code encoded in Base 64 should
be adequate.

This scheme does not prevent collision, although it should be very
unlikely.  In any situation where the implicit tag is unsatisfactory,
you can povide an explicit one.  If someone wanted to use
'state@REDACTED' as a tag name, I'm not going to stop them.

The only effect of these changes is to change the atom that is used
as the tag for a record declaration.  Everything else done by the
compiler remains unchanged, the instruction set remains unchanged, and
the emulator remains unchanged.  HiPE would be unchanged.  Even the way
the compiler handles is_record would be unchanged.

Step 3.  The record-printing code has to learn how to undo tags.

Somehow the present setup conveys to the run-time system "in module
so-and- so, tag name such-and-such is associated with these fields."
This has to be changed slightly to "in module so-and-so, tag name
such-and-such is associated with record name thingy having these
fields."  In fact it should usually be possible to ignore the module
name.

Evaluation.

This is not a perfect scheme.  Collision is still possible, but it is
unlikely, and if it does happen a manual -record_tag can fix it up.
However, it satisfies one major requirement:
    WE CAN GET THERE FROM HERE.
The first stage adds
   -ifdef("this is an Erlang version that understands -record_tag").
   -record_tag(rname, rname).
   ...
   -else.
   -define(rname_TAG, rname).
   ...
    -endif.
to each file containing at least on -record declaration.  These extra
declarations ensure that code compiled using the new preprocessor will
use the same tags as code compiled using the old one.  These patches can
be automatically generated!

The second stage is to go through all the code fixing up is_record/2 tests.

When we are finally ready to switch over to the new system, the third
stage is to remove the preprocessor lines added in the first stage,
which can again be an automatic process.

What this doesn't help with, of course, is people taking records apart
or building them with hand-written code.  However, because design
properties 1, 2, 3 are preserved, data-walking functions that know
about records in general as opposed to specific record types should not
be affected, and thanks to the implied -define of <record name>_TAG,
even code that does build or match records "by hand" should be
comparatively easy to convert.