Parsing binaries (was Re: Concatenating atoms)

Mon Feb 7 13:37:31 CET 2005

On 2005-02-07 at 11:26:44 (+0100), Joe Armstrong (AL/EAB) wrote:
> > 
> > Hi Matthias, hi Joe,
> > 
> > If integers are desirable, why not use indexes into the binary which
> > contains the original XML data?
> 
> Absolutly - this is a good idea - It just make the parseing a wee bit
> more complicated. (Actually you can use strings, - but don't tell anybody I said so -
> if you can guarantee that they are "pointer identical" (ie you build the strings
> in linear manner using a "pure dictionary type" library)
> 
>  
> > So, to use Joe's example to illustrate, the variable Abc on line 23 of
> > the program could be represented as {var,Offset,Size} where Offset is
> > the position of the first byte of "Abc" in XML_Binary, and 
> > Size in this
> > case would be 3 (obtaining line numbers would be done by scanning the
> > binary for newline chars, for error reports). I think I'm correct in
> > saying that the following:
> > 
> > <<_:Offset/binary,Chunk:Size/binary,_/binary>> = XML_Binary,
> > 
> > Doesn't create a new binary Chunk, instead creates a 
> > reference into the
> > existing XML_binary.
> > 
> > A lexicon of tokens could be based on a list of {Offset,Size} tuples,
> > and all matching tokens in a parse tree can refer to the 
> > first occurance
> > of the token in XML_Binary.

this sounds interesting. As long as only Offset and Size are compared the
life is very good and fast. I think another technique instead of
preserving the whole input (espacially if you parse big files you might
not want to have the whole input in memory), build a dictionary mapping
offsets (and sizes if required) to strings (or short binaries) and a
trie representing the reverse map. If we can rely on some smart Erlang
VM implementation, it would be enough to have the trie which represents
the identical map. Then simply exchange each occurance of an string
with the one found in the map. If we have a large file with only a small
number of different identifiers, this should optimize the space
requirements. In other words, using a simple compression algorithm. The
number of keywords of language to parse should be fixed, so converting
those to atoms for comparison should be any problem (at least all
languages I know have a fixed set of keywords :)). Including such a trie
into a scanner (or scanner generator generated scanner) shouldn't be
a problem. One could even include some support in the parser generator
or parser combinators to not have to use atoms for keywords, but I think
that is overkill.

The advantage of using the strings would be, that one does not have ot
lookup the strings in binaries or dictionaries which should make it much
easier to use them. On the other hand this needs some support from the
runtime, that structures may be shared and then comparison will be
really fast (but I think this is the way the current VM works and if not
a change might at least be easier than adding an atom garbage collector
:)).

-- 
Matthias Kretschmer