[erlang-questions] Why is there an atom table?

Richard A. O'Keefe ok@REDACTED
Mon Jul 22 06:48:10 CEST 2013


On 21/07/2013, at 2:12 AM, Pierre Fenoll wrote:

> Hi,
> 
> So, atoms are stored in a shared table, using locks to read & write, and never garbage-collected.
> Well we know the issues that emerge from this.

We do talk and have been talking about alternatives for a L O  N   G   time.

I suspect that it would be possible to use a lock-free hash table
but those things make my head hurt.
> 
> What I can't find on the Web is the purpose of this table.

You will find it in the better books.
Actually, there _are_ places on the web that explain it,
and you _can_ find them with straightforward Googling.

> Why does Erlang needs to assign a unique hash to a <255-characters string and put it in memory?

The limit on the length of an atom is highly problematic.
Back in the 1980s, Quintus found themselves forced to raise the
length limit on atoms to 1024.  Modern Prolog and Lisp systems
typically have no length limit other than available memory.
Erlang is just about the only programming language I use where
(1) the language *has* atoms
(2) they are not usable to represent full file names.

> I thought atoms were just like enums, thus the only actions needed on them were ==, =:= and the various conversions to string or binary.

Atoms can be compared using < as well.

I'm not sure what language you have in mind when you talk about 'enums',
but even in Java they are *unique*.  It is precisely *because* atoms can
be used like enums that they need to be stored uniquely

> An atom (approximately) corresponds to this regexp: [a-z'][a-zA-Z0-9_']{,254}.

No, an atom corresponds to /^.{0,255}$/.  There is no restriction on
the characters that can be in an atom name.  Even \0 is allowed, anywhere.

> Its syntactic representation **already** supports the actions I cited.

Yes, but it doesn't support them *FAST*.

Storing atoms uniquely means that
 - testing for identity takes constant time, only a couple of instructions
 - when converting an atom to binary, perhaps for sending to another node,
   repeated occurrences of the same atom are recognised *FAST* and sent
   *compactly*.

Lisp, Prolog, Erlang, Smalltalk, and Ruby all have symbols for
basically the same reasons:
 - things that can be compared fast
 - and so make great keys for hash tables
 - which the language implementation takes private advantage of
 - safe tokens that cannot be changed (Lisp, Smalltalk, and Ruby strings
   are mutable and the construction of Prolog strings can be unwound)
> 
> Why not get rid of the atom table and and just use strncmp(3)?

strncmp(3) would not work because \0 is legal _inside_ an atom and
does not terminate the name.  In any case, strncmp() is much slower
than a simple pointer comparison.

It's a typical computing tradeoff:
 - if creating atoms were extremely common and comparing them rare,
   it would not make sense to have an atom table,
 - but since comparing atoms is *extremely* common and creating
   atoms *much* rarer, it makes sense to slow down creation in
   order to speed up comparison.





More information about the erlang-questions mailing list