[erlang-questions] Parsing C with leex and yecc

Tue Jul 20 15:00:09 CEST 2010

On Tue, Jul 20, 2010 at 12:32 PM, Sverker Eriksson
<sverker@REDACTED> wrote:
> Joe Armstrong wrote:
>>
>> I'm trying to parser ANSI C with leex and yecc and have run into
>> two problems.
>>
>> 1) /* ... */ comments. Leex is (as I understand things) greedy
>>   thus I can't just write a regexp to match comments, since it
>>   will consume no-only the current comment, but all comments until
>>   the last comment in the file.
>>
>>   To solve this I have just written a simple pre-processor to remove
>> comments
>>   from the original source.
>>
>>
>
> re:run("/***first comment***/ /* next comment */",
> "/\\*([^*]|(\\*+([^*/])))*\\*+/").
>
> http://ostermiller.org/findcomment.html

  Ummm ... this will incorrectly match a literal string, containing a
comment  ... for example:

      char *p = "hi /* not a comment */ how are you?";

  Which is not what I want ..

  Easiest seems to be  bit of pure Erlang:

On Tue, Jul 20, 2010 at 12:32 PM, Sverker Eriksson
<sverker@REDACTED> wrote:
> Joe Armstrong wrote:
>>
>> I'm trying to parser ANSI C with leex and yecc and have run into
>> two problems.
>>
>> 1) /* ... */ comments. Leex is (as I understand things) greedy
>>   thus I can't just write a regexp to match comments, since it
>>   will consume no-only the current comment, but all comments until
>>   the last comment in the file.
>>
>>   To solve this I have just written a simple pre-processor to remove
>> comments
>>   from the original source.
>>
>>
>
> re:run("/***first comment***/ /* next comment */",
> "/\\*([^*]|(\\*+([^*/])))*\\*+/").
>
> http://ostermiller.org/findcomment.html

  Ummm ... this will incorrectly match a literal string, containing a
comment, for example:

      char *p = "hi /* not a comment */ how are you?";

  Which is not what I want ..

  Easiest seems to be  bit of pure Erlang:

%% remove_comments(Str) -> Str'
%%    remove C'style comments from a string
%%    note1: We retain any embedded NLs in the comment
%%           this is so that line number calculations in the tokenizer
%%           will still be correct
%%    note2. We copy literal strings. Since a literal string might
%%           contain a comment we have to parse the string
%%    note3: Comments must be replaced by at least one space
%%           Since otherwise "123/* comment */456" would be
%%           transformed into 123456 (a single integer) instead
%%           of two integers 123 and 456. This is why we add a
%%           space in the last line of skip_comment/2.

remove_comments(Str) -> remove_comments(Str, []).

remove_comments("/*" ++ T, L) -> skip_comment(T, L);
remove_comments([$"|T], L)    -> copy_string_literal(T, [$"|L]);
remove_comments([H|T], L)     -> remove_comments(T, [H|L]);
remove_comments([], L)        -> reverse(L).

skip_comment("*/" ++ T, L)   -> remove_comments(T, L);
skip_comment("\n" ++ T, L)   -> skip_comment(T, [$\n|L]);
skip_comment([_|T], L)       -> skip_comment(T, L);
skip_comment([], L)          -> remove_comments([], [$\s|L]).

copy_string_literal([$\\,$"|T], L) -> copy_string_literal(T, [$",$\\|L]);
copy_string_literal([$"|T], L)     -> copy_string_literal(T, [$"|L]);
copy_string_literal([H|T], L)      -> copy_string_literal(T, [H|L]);
copy_string_literal([], L)         -> copy([], L).

/Joe

>
> /Sverker
>
>