atoms vs. integers

Tue Dec 9 05:17:29 CET 2003

Very stupid question, I think I know the answer already, but since
everyone knows there's no such thing as a stupid question, let me
prove it wrong by asking one:

>From a performance point, which is faster, ints or atoms?  I realize
both are exactly one word in size in the emulator, and that atoms are
supposed to be unified in the atom table, so all occurances of the
atom say 'finished' is at the same memory address, and thus have the
same term value.

I'm basically planning on stuffing a large number of 2 tuples holding
{Key, Atom} into an ets table, and doing lots of pattern matching on
the Atoms coming back.  A large amount of code is going to something
along:

	case ets:lookup(reg_tab, Key) of
		[{_, finished}] -> ...;
		[{_, not_done}] -> ...;
		[] -> ...
	end.

My guess is that since these are comparing terms, its just a comparsion
of two words, which is dirt cheap.

What about sending these over the wire then in distributed messages to
other nodes?  I'm going to use Erlang distribution rather than UBF or
some other solution as I'm going to be using the distribution for what
it was originally designed for: tightly coupling systems that are
already tightly coupled.  No reason to reinvent the wheel here.
(Besides, I want the monitoring and linking Erlang offers!)

>From what I know, atoms would be slower here, as the receiver would
need to lookup the matching atom on the other end.  If I used ints,
I could use macros ?FINISHED and ?NOT_DONE, but of course this is not
very Erlangish.

I might still use the macros, but start out with atoms under them, and
then can either globally search and replace the macros out if I'm ok
with the atom transfer performance, or change the macros to ints.

To answer my own question:  don't optimize now, just write the code,
make it correct, profile, and put off optimizing as long as possible. :)

However, I don't want my program to expend a lot of time looking up
atoms when I'm going to be having hundreds of nodes passing around
tens of thousands of messages per second to each other with these
two atoms in them.

On a related note, how do bigints compare to just using a tuple of
small ints or a binary?  I'm thinking about both term storage and
the header(s) required for the term, GC, cost to move between ets
and back, etc.

I don't need to perform math on it once created, just perform
comparsions for ordering.  Creation can be easily hidden down within
a function, much like make_ref().  I can't use make_ref() however
because I need certain assurances about ordering of values.

These values are the Keys above - I basically need about 96 bits wide
of data space within any given Key, which means a tuple of 4 elements,
a bigint of 3, or a binary of 12 bytes (3 words).  I need to be able
to shove thousands of these into an ets set, but they will have a high
degree of common bits.  Its a given that of the 96 bits, the same 80
will be common with all other Key values in the ets table.

I can organize the Key any way I chose to get better hashing from
ets if any ets masters have words of wisdom to share...  one thought
I have had is to pull out those 80 or so common bits and store them
somehow 'globally', then put just the 16 differing bits in the table.
Every so often however it is necessary to rebuild this table, and update
the 80 common bits.

Unfortunately, Erlang does not really offer a great way to share this
GlobalCommon80 with all interested processes (which can be in the
hundreds) on each node.  Short of putting them into an ets table,
or sending a broadcast message.  So this solution is not good, unless
it would dramatically improve ets performance, outweighing the cost of
manging GlobalCommon80.

-- 
Shawn.