[erlang-questions] Special syntax for dictionaries
Richard O'Keefe
ok@REDACTED
Fri May 7 07:03:19 CEST 2010
On May 6, 2010, at 9:44 PM, Sergey Samokhin wrote:
> * You don't know for sure *how many* keys there will be.
>
> JSON can be sent to you through RESTful interface with a lot of fake
> keys, and you don't want to store them as atoms in the global atom
> storage.
We shouldn't have to care about that.
There is a great gaping bleeding vulnerability at the heart
of Erlang: it SHOULDN'T be possible to bring down a system
by sending it a lot of distinct atoms, but it IS.
It doesn't have to be that way.
EEP 20, by someone whose initials look a lot like mine,
says
An idea from the Logix implementation of Flat Concurrent
Prolog can be adapted to Erlang: invisibly to users there
can be two implementations of 'atoms', fixing a major
system integrity issue and removing the need to warp one's
data structure design to code around it.
Historically, many Prolog systems had a similar limitation.
When I got a copy of Brand X Prolog for the Macintosh, about the
10th thing I tried was creating more atoms than its atom table
could hold. Brand X Prolog not only crashed, it managed to wipe
itself off the disc on the way to the grave, so I was never able
to try any other tests. (Silly me, I had thought that since I
wasn't trying to change anything on that floppy, it would not change.)
A couple of years ago, Jan Wielemaker fixed SWI Prolog, so that it
DOES garbage collect atoms, and it is now common practice to use
atoms all over the place, in large numbers. This despite the fact
that SWI Prolog also has (sadly pthread-like) concurrency.
So "we need to use something other than atoms for JSON labels" is
a *temporary* concession to a bug in the surviving Erlang
implementation (GERL and Erlang-over-Scheme apparently being dead)
which has less excuse for remaining every day. I don't think we
should allow a bug that's overdue for fixing to warp the design of
*future* language changes.
>
> * You don't know for sure *what* keys there will be.
Then we are definitely talking about dictionaries. But if you
don't know what the keys will be, you can't write them in patterns
or expressions, so what good would special syntax be?
>
> Sometimes key you have is a combination of different nested keys.
Nested keys should be handled by nested dictionaries each with a
simple key. For example, in Java on my old machine,
System.getProperties.list(System.out) shows me 48 keys (and their
values) in an apparently random order. Suppose they were factorised
file/...
java/...
user/...
line/...
os/...
sun/...
Each level would be easier to grasp, and groups of properties, like
java/vm/... and os/... and user/... would actually make sense. And
each level would be so small that simple linear search would work
well. As a frame, we'd have
<{ file ~
<{ encoding ~
<{ '.' ~ _1
, pkg ~ _2 }>
, separator ~ _3 }>
, java ~ ...
, line ~ ...
, os ~ ...
, sun ~ ...
, user ~
<{ country ~ _U
, dir ~ _V
, home ~ _W
, language ~ _X
, name ~ _Y
, timezone ~ _Z }> }>
> In
> Redis database it's common to use keys like "user/name". Some subkeys
> can be integers. There can be something behind the string
> representation, so you may want to parse the key. Atoms aren't
> strings, so what you can do on it is limited.
Non-sequitur. There is nothing that can be done with strings that
*couldn't* be done with atoms. For example, in Haskell, practically
everything you might want to do with a list of characters (String)
can be done with a Data.ByteString. In particular, there's no reason
why you couldn't be allowed to concatenate atoms, slice out subatoms,
and do regular expression matching on atoms. *IF* the global limited
atom table weren't a scarce and vulnerable resource, it would make
perfect sense to do these things. SWI Prolog even lets you read
a term as an atom. (Makes me queasy, but it _works_.)
>
> * You want to have pattern matching
> Manipulation such a JSON-documents without pattern matching is a
> pain.
I've done a fair bit of XML processing in Prolog, and for that matter,
in C. (I have my own faster-than-expat parser and my own "Document
Value Model" library. It's amazing what you can do with XML
_conveniently_
in C.) JSON is really very like XML. They are *both* hard to process
with or without patterns, for the same reason: they are only
semistructured
and there can be all sorts of extra cruft there. For example, I'm
currently working with 100,000 US patents, marked up in XML. You would
think "where is *the* ECLA code saying what this is about" would be a
simple matter of fishing quickly in the meta-data, but it isn't, and for
what I actually have to do to get a best-guess at the value, no
plausible
pattern language would be a substantial improvement on the actual code I
have.
There's an old statistician's saying:
raw data are like raw potatoes,
they're no use until they're cleaned.
It applies to open semistructured data coming into a computer program
as well: first thing you do is extract just the data you want and
clean it up to make life as easy as possible for down-stream code.
(Hint: the letters "JS" in "JSON" stand for what? That has what
pattern matching facilities? Programmers using it complain about
the difficulty of using JSON data how much?)
> With dictionaries you shouldn't be restricted to atoms as keys.
With dictionaries, you aren't. Never have been, never will be.
> If
> your keys are known for your program, you probably want something more
> suitable for structures like frames/structs.
>
> #1.1 If we need syntax for dictionaries, what syntax to use?
Something extensible. There can be more than one kind of dictionary.
(As a trivial example: are keys matched using ==, or =:=, or something
else, which might perhaps ignore alphabetic case?)
>
> #2. Do we need first class replacement (frames/structs) for records?
>
> Yes, because records have some well known runtime restrictions.
>
> #2.1 If we need frames/structs, what syntax to use?
>
> So dictionaries are more "dynamic", when frames/struct are more
> "static".
That's a fair characterisation. While it is always possible to add any
key to any frame (getting a new frame), and while it is possible for a
frame to have keys that the receiver was not expecting, the _creator_ of
a frame must have a specific set of keys in mind. (Although there is a
pattern-like syntax and ALSO a fully dynamic set of functions.)
> Actually my previous message wasn't about "let's use term() for keys
> in frames". I really would like to have both of them implemented. I
> wrote the first message because I felt the lack of syntax for
> dictionaries.
I recommend writing a few modules *pretending* that such a syntax
exists and see what it ends up looking like. You'll keep on changing
your mind for a while, but eventually something will gell.
> I couldn't find your report. Could you post the link here?
Here it is again:
www.cs.otago.ac.nz/staffpriv/ok/frames.pdf
> Actually, the size of my JSON keys is small. I'm just a bit afraid of
> using atoms, because atoms aren't garbage collected and atoms aren't
> that good for string processing (regular expressions, parsing,
> validation) just because atoms aren't strings.
At the risk of sounding like "Poor Johnny One-Note"
- atoms *should* be and *could* be garbage collected
(Logix did it and SWI Prolog does it)
- atoms *could* be processed by regular expressions as
easily as character lists and binaries (they *are* in
Smalltalk)
Right now, you are right. Fixing the atom garbage collection problem
ought to be near the top of the priority list, I think.
Note: EEP 20 proposes implementing atoms two different ways: 'global'
atoms that appear in source code and 'dynamic' ones. One reason for
this is that no changes to the compiler or to its .beam files are
needed, only VM changes.
>
> Related links:
>
> Abstract patterns, structs and frames:
> http://www.erlang.org/pipermail/erlang-questions/2009-February/041921.html
There *is* an EEP for abstract patterns, EEP 29.
I did submit an EEP for frames but it was rejected as too large.
Basically, http://www.otago.ac.nz/staffpriv/ok/frames.pdf *is* the
EEP for frames, and has been since late 2003.
More information about the erlang-questions
mailing list