[erlang-questions] speed of neotoma parser

Max Lapshin max.lapshin@REDACTED
Sat Nov 14 23:00:25 CET 2015


I need parsing of a nginx-like config.
In my current implementation I have described almost all keywords in peg
grammar, so any unknown keyword will stop parsing.

If I have correctly understood Sean, better will be to validate after
parsing.

I need to step back and read about lalr vs ll difference first.
On Nov 14, 2015 9:44 PM, "Robert Virding" <rvirding@REDACTED> wrote:

> If you specifically want an LL(1) parser generator I have written one for
> Erlang and LFE, spell1, https://github.com/rvirding/spell1. You need a
> tokeniser for it, for example one generated with leex. The LALR(1) grammars
> of yecc are more general than LL(1) but spell1 does have a few benefits
> over yecc:
>
> - It is reentrant in that you don't need to pass in all the tokens in go
> but can keep adding them until it has gotten enough.
> - It can handle passing in too many tokens and just returns the left-overs.
>
> Yecc crashes on both of these cases. These aren't really a problem for
> erlang as you have the '.' which marks the end but I needed it for LFE.
>
> Robert
>
>
> On 14 November 2015 at 18:33, Max Lapshin <max.lapshin@REDACTED> wrote:
>
>> Got it.
>>
>> So, it is just a bad way of usage.
>>
>> I decided to put semantics and most of logic inside config to make
>> validation as early as possible. And it seems that it is a misusage.
>>
>> OK, I will think what to do including option "ignore and relax" because
>> only one user suffers from this problem.
>>
>> Thanks for hints and explanation!
>> On Nov 14, 2015 8:20 PM, "Sean Cribbs" <seancribbs@REDACTED> wrote:
>>
>>> Hi Max,
>>>
>>> My general impression of your grammar is that it conflates syntax and
>>> semantics. Your configuration lines seem to be structured roughly like
>>> KEYWORD VALUES* ( "{" BLOCK "}" )? ";", but you create a single rule for
>>> every possible configuration item. This means at the top-level you have to
>>> create ordered choices between completely unrelated extents. Every config
>>> item you parse is going to go through the same backtracking among ~30
>>> alternatives, with nothing able to be memoized (because the stem of each
>>> one is unique). This is a quintessential worst-case for the types of
>>> parsers that neotoma currently generates.
>>>
>>> I suggest you rethink the parser, focus on what are the shared
>>> structures and try to extract those. Use fewer rules, try not to treat each
>>> config item as a special case, and then validate the structure after
>>> parsing it. Alternatively, you could pick up and use cuttlefish which has a
>>> much simpler configuration language and built-in semantic validation and
>>> transformation and command-line tools.
>>>
>>> As you say earlier in the thread, you could also use leex and yecc. I
>>> understand the desire not to, but their performance characteristics are
>>> better known and more predictable. They might be worth it.
>>>
>>> I feel pretty responsible for the limitations you encountered. Ford's
>>> thesis clearly outlines optimizations that neotoma does not do, including
>>> more efficient matching of terminals via "switch", cost calculation to
>>> determine what to memoize and inline, unrolling recognition logic instead
>>> of using parser-combinators, and more. I'm working on a rewrite, but it's
>>> not ready yet.
>>>
>>> I'm sorry. I hope the suggestions above help you focus the grammar into
>>> something more usable.
>>>
>>> Sean
>>>
>>> On Sat, Nov 14, 2015 at 10:27 AM, Max Lapshin <max.lapshin@REDACTED>
>>> wrote:
>>>
>>>> I've sent example from production to Sean.
>>>>
>>>> (flussonic@REDACTED)2> timer:tc(fun() ->
>>>> config2_format:parse(element(2,file:read_file("big.conf"))), ok end).
>>>>
>>>> {29727926,ok}
>>>>
>>>> On Sat, Nov 14, 2015 at 4:05 PM, French, Michael <
>>>> michael.french@REDACTED> wrote:
>>>>
>>>>> The concept of 'token' is fluid in PEGs. The terminal/non-terminal
>>>>> distinction might not work. For example, the definition of 'alphanumeric'
>>>>> might appear in several different 'tokens', rather than repeating char
>>>>> classes, which means the token rules become non-terminals.
>>>>>
>>>>> Maybe use a hint in the grammar, like using an upper-case name for
>>>>> rule LHS to indicate a 'token' (which is similar to a convention in Antlr).
>>>>> Then always memo-ize the token rules, but not (necessarily) the others that
>>>>> have lower-case rule names.
>>>>>
>>>>> BR
>>>>> Mike
>>>>>
>>>>> ________________________________________
>>>>> From: erlang-questions-bounces@REDACTED [
>>>>> erlang-questions-bounces@REDACTED] on behalf of Joe Armstrong [
>>>>> erlang@REDACTED]
>>>>> Sent: Friday, November 13, 2015 11:25 PM
>>>>> To: Sean Cribbs
>>>>> Cc: Erlang-Questions Questions
>>>>> Subject: Re: [erlang-questions] speed of neotoma parser
>>>>>
>>>>> PEG parsers are notoriously inefficient. How about having a separate
>>>>> tokenization pass, and parse token instead of characters. At a guess
>>>>> this would be far faster since you'd backtrack over completed tokens
>>>>> rather than characters.
>>>>>
>>>>> /Joe
>>>>>
>>>>> On Fri, Nov 13, 2015 at 3:02 PM, Sean Cribbs <seancribbs@REDACTED>
>>>>> wrote:
>>>>> > Max,
>>>>> >
>>>>> > Do you have a link to your grammar? I can probably poke at it and
>>>>> give you
>>>>> > some tips.
>>>>> >
>>>>> > However, I am well aware of performance problems with neotoma --
>>>>> with large
>>>>> > grammars or large inputs it drags. Yes, there are general problems
>>>>> for PEGs
>>>>> > in Erlang, but its current implementation is particularly naive and
>>>>> > wasteful. I'm working on a rewrite, but it's a complete overhaul
>>>>> (and more
>>>>> > faithful to the thesis and reference implementation "Pappy"). Since
>>>>> it's not
>>>>> > core to my day-job, I've only been able to work on the rewrite
>>>>> occasionally
>>>>> > in my free time.
>>>>> >
>>>>> > On Thu, Nov 12, 2015 at 12:07 PM, Max Lapshin <max.lapshin@REDACTED>
>>>>> wrote:
>>>>> >>
>>>>> >> Yes, Louis, I also think that there may be a simple way of speeding
>>>>> it up.
>>>>> >>
>>>>> >> I'm only afraid that I will have to open my university book and
>>>>> remember
>>>>> >> what LL-1 means and how it differs from LALR =)
>>>>> >>
>>>>> >> Ok, will try to profile it first.
>>>>> >>
>>>>> >> _______________________________________________
>>>>> >> erlang-questions mailing list
>>>>> >> erlang-questions@REDACTED
>>>>> >> http://erlang.org/mailman/listinfo/erlang-questions
>>>>> >>
>>>>> >
>>>>> >
>>>>> > _______________________________________________
>>>>> > erlang-questions mailing list
>>>>> > erlang-questions@REDACTED
>>>>> > http://erlang.org/mailman/listinfo/erlang-questions
>>>>> >
>>>>> _______________________________________________
>>>>> erlang-questions mailing list
>>>>> erlang-questions@REDACTED
>>>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>>> _______________________________________________
>>>>> erlang-questions mailing list
>>>>> erlang-questions@REDACTED
>>>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>>>
>>>>
>>>>
>>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20151115/dd8e74e9/attachment.htm>


More information about the erlang-questions mailing list