dictionaries (was Re: new syntax - a provocation)

Fri Sep 26 06:35:43 CEST 2003

Chris Pressey <cpressey@REDACTED> replied to some of my points:
	I don't follow _any_ of these arguments.  It's my understanding that the
	representation of dicts is _undefined_.  (You didn't go peeking and
	write code like 'case A of {dict,_,_,...} -> ...', did you?  :)

It doesn't matter _which_ represenation it is, only that it _has_ a
represenation which is ambiguous between "I am a dict" and "I am
an Erlang value that just happens to look like dicts currently look".
In particular, if I have a list of things which are dicts, there is
no way that a UBF encoder can _know_ that they are dicts and encode
them appropriately.

This could be said for _any_ data structure, which is why one has to
pick one's data structures with care.  Lists offer incremental
construction.  Tuples offer O(1) access by number.  Dictionaries offer
something between O(1) and O(lg n) -- depending on implementation --
access by name.

	Because the representation of dicts is undefined, they could change
	anytime in the future, perhaps drastically (e.g. becoming BIFs.)

	> (1) When we know that something is a dictionary, we can use a more
	>     space-efficient representation.  For example, instead of
	>     [{K1,V1},...,{Kn,Vn}] 	-- (n list cells + n 2-tuples)
	>     we can use
	>     {dict,K1,V1,...,Kn,Vn}	-- one 2n+1-tuple (not really a tuple)
	>     which is likely to take somewhat less than half the space.

The dict module is written in Erlang.  That means it can only use existing
Erlang data structures.  That limits the range of possibilities that are
implementable.

	> (2) When we know that something is a dictionary, we can use a more
	>     time-efficient representation.  For example, if the keys K1...Kn
	>     as shown above are in standard order, lookups can be done in
	>     O(lg n) time instead of O(n) time.

Same point as above.  This of course applies to any data structure.

	> (3) When the representation is not visible to the programmer, the
	> run-time
	>     system may use several different representations.  For example,
	>     here are some obvious ones:  linear array as shown above, some
	>     kind of binary search tree (so that unchanged chunks can be
	>     shared), some kind of hash table (so that lookup can be O(1)).

Because the dict module must use *some* Erlang representation of the
data, this possibility (of changing the representation of a value at
run time) simply is not available to it.  It can pick from several
different representations at construction time, sure, but it _can't_,
just plain can't, switch representation after construction.  Such
representation changes are not against the spirit of functional
programming, because they don't change the value that is represented;
no functional test can detect any change.  All that changes is performance.

Now, if you are going to rewrite the dict module so that it uses some
sort of special internal data structure, all you have to do to get to
my proposal is adopt some syntax for these things as well, and why not?

	> (4) There are lots of "functional array"/"persistent array"
	> implementations
	>     in the literature which could be used to make setelement/3 take
	>     O(1) time; these could be adapted to make updates to dictionaries
	>     fast too, provided the representation is not exposed.

What's not tog et about this?  Like the others, it is simply true.
(Nearly all of the techniques I have in mind rely on changing a
representation at run time after construction, without changing the
value which is logically represented.)

	> (5) With the 'dict' module you get a somewhat less than obvious
	> textual
	>     representation, which could mean other things.  With a built in
	>     data type you can get "just right" textual representation.

Prolog has a kluge for this, which is portray/1.  By adding clauses

portray(empty_dict) :-
    portray_dict(empty_dict).
portray(non_empty_dict(K,V,L,R)) :-
    portray_dict(non_empty_dict(K,V,L,R)).

I can arrange for dictionaries to be printed in a way that makes it clear
what they are, _provided_ everyone else knows to avoid the functors in
question.  Erlang does not have this kluge.  It's basically the same
point I made earlier:  because 'dict' uses *some* Erlang data structures,
there is no way for any other code to tell whether a data value really
*is* meant to be a dictionary or just *happens* to look like one.

In the absence of an always-enforced type system, the only types you
can rely on distinguishing are the ones for which there are built-in
guard predicates which are guaranteed to be disjoint.

	> (6) With 'dict' you don't get to do dictionary stuff in patterns or
	> guards.

	All that's needed for that is a special _notation_ for
	dictionaries (which, to be clear, is _not_ the same thing as a
	_representation_ in this sense.)

What's needed is a notation *and* an implementation which is sufficiently
good to replace records in everyday use.  If you have one, but not the
other, you still find yourself wanting records, which may be Erlang's
biggest wart.

	- implement dictionaries directly in BEAM (any number of
	available C implementations would probably do the trick
	sufficiently - in fact we already have a very efficient
	associative storage mechanism that might work just fine.  what's
	it called again?  oh yah.  'ETS' :)

No, the strength of ETS is *big* dictionaries.  Quoting the reference
manual, "These provide the ability to store very large amounts of data".
These dictionaries are supposed to be small, cheap, fast, immutable.
You are supposed to be able to have millions of them.  Quoting the
ETS manual, "The number of tables stored at one Erlang node is limited.
The current default limit is approximately 1400 tables."  And, quite
disastrously for the intended use of dictionaries, "Note that there is
no automatic garbage collection for tables."

ETS tables are important and useful, but they are about as far from what
we want here as they could possibly be.

	- map the dict module to BIFs (we don't need a new API for this

We don't need a new interface for _that_, but do you really _like_
the existing interface?  There were _reasons_ why I proposed a different
interface.

	- add a syntax for pattern-matching dictionaries (the only real
	advance)

Well, if you chop what I propose into two pieces, then suggest implementing
the hardest part, then the other half _is_ the only real advance left.
Just like if you eat 3/4 of a pie, what's left is only 1/4 of a pie.

Using pattern syntax for *constructing* dictionaries is also a large
advance.  Writing

    ets2:new(<{name->freddy, type->set, access->protected, keypos->2}>)

is clearer than

    ets2:new([{name,freddy}, {type,set}, {access,protected}, {keypos,2}])

not least because it makes it obvious that the order doesn't matter.
Having to write

    ets2:new(dict:from_list([
	{name,freddy}, {type,set}, {access,protected}, {keypos,2}]))

would rather miss the point, which is not just to have dictionaries,
but to have something principled to use instead of records.