Concatenating atoms

Thu Feb 3 12:06:24 CET 2005

> -----Original Message-----
> From: Matthias Lang [mailto:matthias@REDACTED]
> Sent: den 3 februari 2005 09:28
> To: Joe Armstrong (AL/EAB)
> Subject: RE: Concatenating atoms
> 
> 
> Joe Armstrong (AL/EAB) writes:
> 
>  >    Yes, and I would very much like to write "clean and fast 
>  > Erlang applications" that create very large numbers of atoms
>  > - the fastest algorithms involve creating very large numers 
>  > of atoms - but this is a no-no so I use strings where efficient 
>  > algorithm would have used atoms.
> 
> Can you provide an example of such an algorithm?
> 

Yes - the "obvious" example is the Erlang compiler.

Variables in Erlang get tokenised as (for example) 
suppose I have a variable name "HttpCommandName" in line
1234 of a program.

The scanner (which precedes parsing) will represent this as
{var,1234,'HttpCommandName'} - and this value will retained in the
parse tree of the program.

<< run epp:parse_file("XXX.erl", "", "") 
   to see the parse tree generated for XXX.erl
>>

This representation of a variable is very efficient, since
the compiler often checks different variables for equality.

   running member('HttpCommandname', ['var1','Var2','HttpCommandname', ...])

is *much* more efficient than running

   running member("HttpCommandname", ["var1","Var2","HttpCommandname", ...])

since head matching in the clause

	member(H, [H|T]) 

Is much more efficient if only atom comparisons are involved.

So efficient way or representing variables in the compiler is to
represent each variable as an atom.

The trouble is that *each time we run the compiler* all the variable names
used in the program get turned into atoms and are added to the atom table
of the Erlang run-time system.

So writing a "compiler server" (this is a program which accepts Erlang programs,
compiles them and returns the compiled code) is impossible - sooner or later
we run of atom space.

Note that *after* the compiler has run it would (or should be) possible to
garb away all the variable names that we created during the parsing process.

This problem might sound contrived but it is not it is a very common situation.

*Every* parser which accepts inputs from a communication channel suffers from this
problem is the set of atoms that it can create is unbounded.

Example: Suppose I want to quickly verify that an xml data structure is correct
(ie verified) I cannot safely represent

	<abc one="123" def="xyz">hoho</abc>

The *natural* representation is

	a) {abc, [{one,"123"},{def,"xyz"}], [{pcdata,"hoho"}]}

 But I have to choose a "bad" representation

      b) {"abc", [{"one","123"},{"def","xyz"}], [{pcdata,"hoho"}]}

  And *all* code that manipulates these data structure will be less efficient.

  The very fact of parsing the above data struct ensures that the atoms abc,one,def
will end up in the atom table.

  If the tag alphabet is unbounded then the atom table will eventually overflow.

  This gives us a number of design choices.

	1) Accept that I have incorrect implementation that will one day crash
	2) Prove that the atom table cannot overflow because the alphabet is
	   finite and will not overflow the atom table
	3) Uses strings instead of atoms

  I choose 3) because correctness is more important to me than speed
- one day it will crash and that is unacceptable.

  2) is impossible

  Pragmatically 1) may not happen very often and the consequences of a crash
might not be important but still I don't like this.

  Suppose somebody decides to use Erlang to build a nuclear powerstation
and uses my XML parser to control the unit which pushes the fuel rods in and
out of the core - and one day the atom table overflows  ....

   The wise men said - "get it right before you make it fast"

   /Joe

> Matthias
>