[erlang-questions] Maps

Sat May 11 02:41:19 CEST 2013

I am a bit disturbed by the concern over Frames.  Since I started using Erlang I have always wanted native, efficient dictionary functionality as an actual type within the language (not just a dict module, despite the fact dict has one of the best possible implementations within native Erlang source code).  It is a major disappointment coming to the Erlang language, being given the process dictionary, and then being told that you should never use it, since it is dirty mutable state that ruins our pure referential transparency..  So, the native dictionary type, Maps, seems long (long) overdue when Erlang is compared with newer languages like Python, Ruby., Perl and Java.

Having a newer and better type of records named Frames seems like an opportunity to take something decent and make it slower (though I understand the extreme effort and attention to make performance close).  The argument is hopefully not: "But wait, we have neat syntax which allows us to make neat examples, and they might be useful.... and now everyone should use Frames in their code instead of records", since that seems least desirable.  I think it should be hard to make something more efficient than a simple array index lookup, which seems to be the basic idea behind a tuple (with the record syntactic sugar).  I don't necessarily care about the records syntax being this preprocessor trick since that is made clear within the documentation.  I think the idea that Frames provide a record alternative relies on the benefits of Frames, which appear too minimal because they seem to be focusing on improving the type system, more than extending functionality (so the concern looks
more academic than pragmatic).  One of the basic benefits of Frames that has been mentioned is that the type can be used explicitly for external data, however an external API could easily wrap the creation of an atom and tuple, so that justification of Frames doesn't feel very compelling (though the Frame type would make a record/tuple object more naturally a C struct-like type within Erlang).

I understand often performance is not a concern, but I can't help but think it should always be the most important concern for compiler writers and VM writers (at the very least, not to mention the database writers, the message bus writers, and the algorithm library writers), which is why I believe it carries so much importance to the decision of the implementation of Maps or Frames.  So, I am very happy with the pursuit of the Map type, while the pursuit of the Frame type seems inappropriate (in the sense of a feature triage).

On 05/09/2013 08:03 PM, Richard A. O'Keefe wrote:
> On 9/05/2013, at 8:10 PM, Max Lapshin wrote:
>
>> Richard. Would you, kindly, tell: where is the difference between maps
>> and frames?
>>
>> As far as I understand, it is not about syntax, it is about internal
>> way of storing data?
> I thought I already had, but here goes.
>
> There is one central semantic difference, which leads to two
> others.
>
> (U) What do you want to optimise the design for?
>
>     Frames are optimised (pared to the bone, in fact) for use in
>     record-like ways.  They are somewhere between pathetic and
>     hopeless as general purpose dictionaries.
>
>     Maps optimised for service as general-purpose dictionaries.
>     They are somewhere between almost usable and pathetic for
>     record-like uses.
>
>     I do not believe that it is possible to design a data structure
>     which is *good* at both tasks.  I suspect that it is possible to
>     prove this, but I must admit I haven't tried.
>
>     THIS is the reason that Björn-Egil Dahlberg was able to sneer at
>     frames as "just a replacement".  I don't try to build perpetual
>     motion machines.  I don't try to square the circle with ruler and
>     compass.  I don't try to design bottle-openers that make good
>     chisels.  And given two sets of conflicting requirements (even if
>     the sets overlap substantially), I don't try to satisfy them both
>     with a single data structure.  (Using a Java ArrayList as a queue
>     is *possible*, but stupid.)
>
> (K) Frames only allow atoms as keys, maps allow any term.
>     Frames can be manipulated using special syntax <{ K1 ~ V1, ... }>,
>     as can maps.  In the special syntax, the keys must be visibly and
>     obviously literal atoms; they may not be expressions; maps allow
>     keys to be expressions in expression context and variables (or is
>     it guard expressions, and if not, why not?) in pattern context.
>
>     Frames and maps can also be manipulated using functions; that is
>     not a difference.
>
> (T) Frames take O(M+N) time to update/insert M fields into a (copy of a)
>     frame that already has N keys.  Maps take O(lg N) time to update/
>     insert a single field into an N-key map.  Frames take O(M+N) time to
>     match M fields in an N-key frame; this can be reduced to O(M.lg N)
>     for small M and large N, but since the simple code can be blisteringly
>     fast, there's not really much point.
>
>     Frames go with a programming style in which instead of
>
> 	... R#r.a ...
> 	... R#r.b ...
> 	... R#r.a ... R#r.c ...
>
>     you write
>
> 	<{ a ~ A, b ~ B, c ~ C }> = R,
> 	... A ...
> 	... B ...
> 	... A ... C ...
>
>     which has the advantage of failing early if some of the required fields
>     are not present.  In fact my personal style for current Erlang says
>     "never use R#r.a".
>
> These are not independent properties, and they bear on the key
> use-case difference:
>
>     Frames were designed to be close to records in space and close to
>     records in performance, for *record*-like uses (fixed collections
>     of heterogeneous data) while also allowing a some flexibility,
>     especially for upgrading.
>
>     Maps were designed to be good for use as *dictionaries*; the design
>     is optimised for *large* numbers of key/value associations.  Such a
>     design intrinsically satisfies the "functional requirements" for a
>     record replacement (at least in an untyped language), but not the
>     "non-functional" or quality requirements.
>
> As for implementation, Björn-Egil Dahlberg wrote:
>
>> Current prototypes stems from using sparse tuples in a
>> HAMT-like data structure and tuple-like data structures.
>> The HAMT-like data structure is discontinued and will be 
>> replaced by some ordered tree.
> The HAMT-like data structure for OCAML that I've looked at
> takes *at least* four words for every key/value association.
> (I have been able to find plenty of numbers for the *speed*
> of HAMTs but nothing for the *space* they need other than a
> claim of "small".)  That's not necessarily the same thing,
> and the "tuple-like" part makes me wonder if something much
> cleverer is being done.  However, we're told that's going "and
> will be replaced by some ordered tree".  Suppose you have a
> 2-3-4 tree.  We can play SML/Mercury-style games to move the
> size of a node into the pointer to it, so we
>
> 2-node: C1 K1 V1 C2                    4   words/pair
> 3-node: C1 K1 V1 C2 K2 V2 C3           3.5 words/pair
> 4-node: C1 K1 V1 C2 K2 V2 C3 K3 V3 C4  3.3+words/pair
>
> So it's fair to say that this would take *roughly* 3.5 words
> per field.
>
> Playing similar move-the-balance-info-into-the-pointers games
> would let us do red-black trees or AVL trees in 4 words/pair.
>
> There are other games one can play, at the price of more
> complicated code, where we take advantage of the leaves of
> a 2-3-4 tree being at the same depth, so that leaf nodes
> are K1 V1
>  or K1 V1 K2 V2
>  or K1 V1 K2 V2 K3 V3
>  or K1 V1 K2 V2 K3 V3 K4 V4
> and the tree as a whole is roughly 2.3 words per pair.
>
> I don't know what exactly Björn-Egil Dahlberg has in mind,
> but if we charge 3 words per key/value association, it's
> not likely that we're overestimating the space cost.
>
> What about frames?  A frame <{ K1 ~ V1, ..., Kn ~ Vn }>
> is represented as
>
> 	---> { size-and-tag , | , V1, ..., Vn }
>                              /
>                            /
>              { size-and-tag, '',  K1, ..., Kn }
>
> It is, in fact, the same representation as a record, except that
> instead of the first element being the record *name* it is a
> pointer to the record's *descriptor*.  The space needed is 4+2N
> words.
>
> But it gets better.  In dictionary-like uses, you have *few*
> dictionaries with key sets that tend to be *different*.  In
> record-like uses, you have *many* 'records, with a much smaller
> number of key-sets, so the descriptors can be *shared*.  A
> typical record can be replaced by a frame with the *same*
> space cost: 2+N words.
>
> Segregating the keys is old idea for implementing environments in
> Lisp-like languages.  It works brilliantly for key-sets that don't
> change much.
>
> One really cool thing is that it really lends itself to supporting
> basic operations like comparison.  The storage representation is
> *canonical*.  Two equal frames have the same parts in the same order.
> To compare two frames for equality:
>    - if the sizes are different, the frames are different.
>    - if the descriptors are not the same pointer, then
>         if the descriptors are not *bitwise* identical in contents,
>         the frames are different.  (This can be implemented by a
>         tiny chunk of machine code that's faster than memcmp().)
>    - compare the rest of the frames as if comparing tuples.
> To compare two frames for ordering:
>   - if the sizes are different, the shorter one is less
>   - if the descriptors are not the same pointer, then
>        compare the descriptors for ordering as tuples.
>        If they are not equal, the lesser frame is the one with
>        the lesser descriptor.
>   - compare the rest of the frames as if comparing tuples.
>
> I don't understand HAMTs well enough to tell if they are canonical.
> In the absence of hash collisions, they might be, but what if there
> are collisions?  It's certainly hard to see how they could be
> canonical for both == and =:= at the same time.  (Not a problem for
> frames, because == and =:= coincide for atoms).
>
> For search trees, however, it is well understood that two trees can
> represent the same dictionary while having quite different shapes.
> This makes comparison much much harder.  You can still do it in
> linear time, but you really have to work at it.
>
>
>
>
>   
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>