[erlang-questions] speed of neotoma parser

Robert Virding rvirding@REDACTED
Sat Nov 14 19:44:19 CET 2015


If you specifically want an LL(1) parser generator I have written one for
Erlang and LFE, spell1, https://github.com/rvirding/spell1. You need a
tokeniser for it, for example one generated with leex. The LALR(1) grammars
of yecc are more general than LL(1) but spell1 does have a few benefits
over yecc:

- It is reentrant in that you don't need to pass in all the tokens in go
but can keep adding them until it has gotten enough.
- It can handle passing in too many tokens and just returns the left-overs.

Yecc crashes on both of these cases. These aren't really a problem for
erlang as you have the '.' which marks the end but I needed it for LFE.

Robert


On 14 November 2015 at 18:33, Max Lapshin <max.lapshin@REDACTED> wrote:

> Got it.
>
> So, it is just a bad way of usage.
>
> I decided to put semantics and most of logic inside config to make
> validation as early as possible. And it seems that it is a misusage.
>
> OK, I will think what to do including option "ignore and relax" because
> only one user suffers from this problem.
>
> Thanks for hints and explanation!
> On Nov 14, 2015 8:20 PM, "Sean Cribbs" <seancribbs@REDACTED> wrote:
>
>> Hi Max,
>>
>> My general impression of your grammar is that it conflates syntax and
>> semantics. Your configuration lines seem to be structured roughly like
>> KEYWORD VALUES* ( "{" BLOCK "}" )? ";", but you create a single rule for
>> every possible configuration item. This means at the top-level you have to
>> create ordered choices between completely unrelated extents. Every config
>> item you parse is going to go through the same backtracking among ~30
>> alternatives, with nothing able to be memoized (because the stem of each
>> one is unique). This is a quintessential worst-case for the types of
>> parsers that neotoma currently generates.
>>
>> I suggest you rethink the parser, focus on what are the shared structures
>> and try to extract those. Use fewer rules, try not to treat each config
>> item as a special case, and then validate the structure after parsing it.
>> Alternatively, you could pick up and use cuttlefish which has a much
>> simpler configuration language and built-in semantic validation and
>> transformation and command-line tools.
>>
>> As you say earlier in the thread, you could also use leex and yecc. I
>> understand the desire not to, but their performance characteristics are
>> better known and more predictable. They might be worth it.
>>
>> I feel pretty responsible for the limitations you encountered. Ford's
>> thesis clearly outlines optimizations that neotoma does not do, including
>> more efficient matching of terminals via "switch", cost calculation to
>> determine what to memoize and inline, unrolling recognition logic instead
>> of using parser-combinators, and more. I'm working on a rewrite, but it's
>> not ready yet.
>>
>> I'm sorry. I hope the suggestions above help you focus the grammar into
>> something more usable.
>>
>> Sean
>>
>> On Sat, Nov 14, 2015 at 10:27 AM, Max Lapshin <max.lapshin@REDACTED>
>> wrote:
>>
>>> I've sent example from production to Sean.
>>>
>>> (flussonic@REDACTED)2> timer:tc(fun() ->
>>> config2_format:parse(element(2,file:read_file("big.conf"))), ok end).
>>>
>>> {29727926,ok}
>>>
>>> On Sat, Nov 14, 2015 at 4:05 PM, French, Michael <michael.french@REDACTED
>>> > wrote:
>>>
>>>> The concept of 'token' is fluid in PEGs. The terminal/non-terminal
>>>> distinction might not work. For example, the definition of 'alphanumeric'
>>>> might appear in several different 'tokens', rather than repeating char
>>>> classes, which means the token rules become non-terminals.
>>>>
>>>> Maybe use a hint in the grammar, like using an upper-case name for rule
>>>> LHS to indicate a 'token' (which is similar to a convention in Antlr). Then
>>>> always memo-ize the token rules, but not (necessarily) the others that have
>>>> lower-case rule names.
>>>>
>>>> BR
>>>> Mike
>>>>
>>>> ________________________________________
>>>> From: erlang-questions-bounces@REDACTED [
>>>> erlang-questions-bounces@REDACTED] on behalf of Joe Armstrong [
>>>> erlang@REDACTED]
>>>> Sent: Friday, November 13, 2015 11:25 PM
>>>> To: Sean Cribbs
>>>> Cc: Erlang-Questions Questions
>>>> Subject: Re: [erlang-questions] speed of neotoma parser
>>>>
>>>> PEG parsers are notoriously inefficient. How about having a separate
>>>> tokenization pass, and parse token instead of characters. At a guess
>>>> this would be far faster since you'd backtrack over completed tokens
>>>> rather than characters.
>>>>
>>>> /Joe
>>>>
>>>> On Fri, Nov 13, 2015 at 3:02 PM, Sean Cribbs <seancribbs@REDACTED>
>>>> wrote:
>>>> > Max,
>>>> >
>>>> > Do you have a link to your grammar? I can probably poke at it and
>>>> give you
>>>> > some tips.
>>>> >
>>>> > However, I am well aware of performance problems with neotoma -- with
>>>> large
>>>> > grammars or large inputs it drags. Yes, there are general problems
>>>> for PEGs
>>>> > in Erlang, but its current implementation is particularly naive and
>>>> > wasteful. I'm working on a rewrite, but it's a complete overhaul (and
>>>> more
>>>> > faithful to the thesis and reference implementation "Pappy"). Since
>>>> it's not
>>>> > core to my day-job, I've only been able to work on the rewrite
>>>> occasionally
>>>> > in my free time.
>>>> >
>>>> > On Thu, Nov 12, 2015 at 12:07 PM, Max Lapshin <max.lapshin@REDACTED>
>>>> wrote:
>>>> >>
>>>> >> Yes, Louis, I also think that there may be a simple way of speeding
>>>> it up.
>>>> >>
>>>> >> I'm only afraid that I will have to open my university book and
>>>> remember
>>>> >> what LL-1 means and how it differs from LALR =)
>>>> >>
>>>> >> Ok, will try to profile it first.
>>>> >>
>>>> >> _______________________________________________
>>>> >> erlang-questions mailing list
>>>> >> erlang-questions@REDACTED
>>>> >> http://erlang.org/mailman/listinfo/erlang-questions
>>>> >>
>>>> >
>>>> >
>>>> > _______________________________________________
>>>> > erlang-questions mailing list
>>>> > erlang-questions@REDACTED
>>>> > http://erlang.org/mailman/listinfo/erlang-questions
>>>> >
>>>> _______________________________________________
>>>> erlang-questions mailing list
>>>> erlang-questions@REDACTED
>>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>> _______________________________________________
>>>> erlang-questions mailing list
>>>> erlang-questions@REDACTED
>>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>>
>>>
>>>
>>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20151114/aed190dc/attachment.htm>


More information about the erlang-questions mailing list