[erlang-questions] tokenising broken code

Tue Jan 14 22:17:39 CET 2014

Hi Joe,

We have such a tokenizer in erlide. It extends erl_scanner with extra
information: where necessary the string representation of the token is
included. Also, the position information is the offset in the file and
the length of the token (as string). The tokens include comments and
whitespace (which can be filtered away with an option), but not macros
(which we handle on the tier above).

I think it is quite independent of the erlide code. The source is at
https://github.com/erlide/erlide/blob/pu/org.erlide.kernel.ide/src/erlide_scan.erl

It is easy to diff with erl_scan to see what's changed. I think our
version is based on R14's erl_scan.

Please let me know if it helps and/or if you have questions.

best regards,
Vlad

On Tue, Jan 14, 2014 at 9:14 PM, Joe Armstrong <erlang@REDACTED> wrote:
> Hello,
>
> Does anybody have a tokeniser for broken erlang code (broken means
> unparsable).
>
> I just want to render variables, atoms strings etc. in different
> colors and typefaces.
>
> So I need a tokeniser that
>
>   - retains everything (comments and all)
>   - does not do any token conversions (ie 16#abc) is not tokenised
>     as {int,2748}, but as {integer,"16#abc"}
>
> It needs to handle broken code in a sensible way - for example if a
> string end quote is missing - do something sensible.
>
> Assuming that what I wants tokenises a string S into a sequence
> [{Tag1,S1},{Tag2,S2},...] I'd like S1 ++ S2 ++ ... = S. ie. the tokeniser
> should be lossless.
>
> Cheers
>
> /Joe
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions