[erlang-questions] json to map
ok@REDACTED
ok@REDACTED
Sat Aug 29 13:56:13 CEST 2015
>
> I have made them the same type because all three are only 1 character.
I am suddenly feeling very cross.
YOU WANT TO BE ABLE TO TELL AT A GLANCE WHAT KIND OF TOKEN YOU
HAVE. That means *NOT* hiding stuff inside a binary.
> I have used the code from fiffy as reference to make my own.
> I understand that they must be different but on some point they are the
> same.
No, and no, and NO. THEY ARE DIFFERENT. The fact that they
*happened* to be represented by single characters in the input
is UTTERLY UNINTERESTING. You have LESS THAN NO INTEREST in
knowing what the characters were. It's like the way in Pascal
"[" and "(*" were different *character sequences* but the same
*token*. When you are dealing with TOKENS, you not only do not
want to know anything about the characters, you WANT NOT TO KNOW
any thing about the characters.
Here again is the Haskell type declaration:
data Token -- a Token is
= TInt Int -- an integer, or
| TWord String -- a word, or
| TDash -- a dash, or
| TSlash -- a slash, or
| TComma -- a comma.
When we are dealing with tokens, WE NEED TO BE ABLE TO TELL
ONE KIND OF TOKEN FROM ANOTHER BY A SINGLE PATTERN MATCH.
That's what a sum-of-products data type is all about; it's
a thing that lets us tell what we have by a single 'case'
analysis.
When we are dealing with tokens, we have made that choice
so we don't have to deal with characters. We don't *care*
whether an integer was written as 10, 010, 000000000000010,
or in another context, 2r1010, 3r101, 0xA, $\n, ...
Notice that a TInt token has some associated information,
and a TWord token has some associated information, in both
cases *derived from* but *not identical to* their source
characters, but a TDash, a TSlash, or a TComma have *NO*
associated information. They are NOT associated with any
character or string. As far as the rest of the program is
concerned, IT DOES NOT MATTER whether TDash stands for
U+002D, U+2010, U+2011, U+2012, U+2013, U+2014, U+2052,
U+2053, U+2448, U+2212, or whatever. That information is
*GONE*, and it's gone because we WANT it gone. We need to
be able to tell one token from another and to recover any
important information, BUT WE HAVE NO INTEREST IN WHAT
THE TOKENS LOOKED LIKE ANY MORE.
Like I said before, this separation of concerns between a
stage where we *do* have to care about the textual
representation of tokens and a stage where we can heave a
huge sigh of relief and forget that rubbish is one of the
reasons why we make a distinction between character
sequences and token sequences in the first place; it's one
of the reasons why I have no desire ever to use "scannerless"
parsing technology.
Now we want to map that token representation into Erlang.
And we DON'T start by writing -type declarations.
We start by saying "There are five situations that I want
to be able to discriminate with a single 'case' in Erlang.
Two of them have one item of associated information each,
and the other three have no associated information."
The first thing to do is to sort these things into groups
where all the situations in each group have the same
number of pieces of associated information.
Situations with NO associated information can (and should!)
be represented by atoms.
Situations with N pieces of associated information should
normally be represented by tuples with N+1 elements, the
first being an atom.
The atoms within a group MUST be different, so that a
single 'case' can trivially distinguish the situations.
All the atoms SHOULD be different so that people can make
sense of them.
So we end up with
TInt i {int,I}
TWord w {word,W}
TDash dash % I used '-' before
TSlash slash % I used '/' before
TComma comma % I used ',' before
I've used different atoms this time to make the point
that there is NO necessary connection between the names
we use to distinguish the cases and the spelling of any
of the tokens.
It does not make sense, in *any* programming language,
to associate a binary with the dash, slash, or comma
tokens. We DO need to know what kind of token we are
dealing with. We do NOT need to know how it was spelled.
Suppose we were doing this in C. We might have
enum Tag {T_Int, T_Word, T_Dash, T_Slash, T_Comma};
typedef struct Token {
enum Tag tag;
union {
int val; // Used only when tag == T_Int
char const *str; // Used only when tag == T_Word
/* NOTHING */ // T_Dash, T_Slash, T_Comma
} u;
} Token;
This really isn't about Erlang, in fact. It is wrong in
*any* language to represent three *different* tokens by
the *same* thing.
Now that we've figured out that we want to use
{int,I}
{word,"w..."}
dash
slash
comma
to represent the different tokens, *NOW* you can write
a type declaration that expresses this.
-type token()
:: {int,integer()}
| {word,string()}
| dash
| slash
| comma.
This is not a matter of taste or style.
Any time you write a type union in Erlang where two of the
alternatives even overlap, you should get worried. Because
that means you have an *ambiguous* type; one where there
is some value such that you cannot tell which of the
alternatives it belongs to on the basis of its form. This
is not always a mistake, but "mistake" is the way to bet.
Amongst other things, while the *computer* may be able to
work it out, ambiguous alternatives are situations where
*people* are likely to be confused.
Oh yeah, the other thing. There is only ONE type introduced
here. There was ONE type in the Haskell code; that should
have been a clue that ONE type was probably the right thing
in the Erlang translation. 'dash' and 'comma' are distinct
VALUES of the token() type; they are not usefully to be
thought of as distinct TYPES.
> Pity , you gave me the answer. Now I can do a copy/paste and go on and
> the next time I do it wrong again.
> That is why I want to try this one on my own and make my own mistakes
> and learn from it , make more mistake and also learn from them.
On present showing, you have nothing to worry about on that
score, and I hope you have learned something about designing
data types.
More information about the erlang-questions
mailing list