[erlang-questions] Strings as Lists

Fri Feb 15 03:56:46 CET 2008

On 14/02/2008, Christian S <chsu79@REDACTED> wrote:
>
>
> Parsing a quoted string as a token from leex is difficult if you know
> that the end-quote might not be included in the chunk you just fed
> into leex, but the next chunk read from the tcp stream.

No, you're wrong here, leex has been designed to handle just this case. See
next bit below.

With fully recursive grammars I can see how one wants to let yecc
> handle it, but a quoted string is not really recursive: You cant have
> a quoted string inside a quoted string the same way you can have, say,
> an if-expression inside an if-expression inside an if-expression etc
> in a programming language.
>
> Leex is a tool I would use for when I know I have some file of finite
> length and I could do a two-pass parsing with yecc as the second
> stage, I would not use it for tokenizing SMTP/IRC/NNTP...

This is where you are wrong and have missed how leex works. From above as
well.

The i/o system was designed to handle just this case  where you receive your
data in chunks and you need to be able to handle the collecting of the data
from these chunks into the correct units in a re-entrant. In this case
tokens. Or it could be lines, or records, or all the tokens in an Erlang
form, or ... . A process which is an IoDevice for the io module functions
can do just this, this is what makes it an IoDevice. Unfortunately there is
no good write up describing the i/o system and the proper interface needed
this functionality is not properly defined in the io module. There was a
description in the old book but not in the released sections. Someday if I
get time I will fix it.

Now leex was designed to fit into the i/o system so it can handle getting
data in chunks in a re-entrant fashion. It depends on which functions in the
generated file you call. That file has the same interface as the erl_scan
module. The string/2/3 functions take a complete string and return the
tokens in it. This is a one-shot deal.

However the functions token/3 and tokens/3 are re-entrant. Token will read
one token, while tokens will read all the tokens up to a a token which was
declared as {end_token, ...  }. Like dot ". " in Erlang. You first call them
with a continuation of [], if there are enough characters then it returns
{done,Result,LeftOverChars} otherwise it there weren't enough characters it
returns {more,Continuation}. Then you call the function again with
Continuation and more characters, and so until you get what you need or your
characters run out. No more characters is signaled by calling with 'eof'
instead of characters, an empty list does not have this effect. Check the
documentation for erl_scan (though they don't have the token function).

So leex can do exactly what you want.

Unfortunately yecc doesn't have the same type of interface so it is not
re-entrant and you have to give it all the tokens in one go. Even more
unfortunately it could have been written in such a way. And could be
rewritten as well. :-(

Hope this helps. I will try to find some examples of code. Otherwise check
in the file modules.

Robert

I'm looking for a better tool (as in quicker and easier code to
> maintain/extend) than writing protocol parsing "by hand".
>
> PS.
> I reserve the right to be completly mistaken about everything.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20080215/b2aedc97/attachment.htm>