[erlang-questions] speed of neotoma parser

Serge Aleynikov serge@REDACTED
Sun Nov 15 06:08:33 CET 2015


Max,

You can try this parser, which does parse config syntax very similar to
what you need: https://github.com/saleyn/erlcfg

Regards,

Serge

On Sat, Nov 14, 2015 at 5:00 PM, Max Lapshin <max.lapshin@REDACTED> wrote:

> I need parsing of a nginx-like config.
> In my current implementation I have described almost all keywords in peg
> grammar, so any unknown keyword will stop parsing.
>
> If I have correctly understood Sean, better will be to validate after
> parsing.
>
> I need to step back and read about lalr vs ll difference first.
> On Nov 14, 2015 9:44 PM, "Robert Virding" <rvirding@REDACTED> wrote:
>
>> If you specifically want an LL(1) parser generator I have written one for
>> Erlang and LFE, spell1, https://github.com/rvirding/spell1. You need a
>> tokeniser for it, for example one generated with leex. The LALR(1) grammars
>> of yecc are more general than LL(1) but spell1 does have a few benefits
>> over yecc:
>>
>> - It is reentrant in that you don't need to pass in all the tokens in go
>> but can keep adding them until it has gotten enough.
>> - It can handle passing in too many tokens and just returns the
>> left-overs.
>>
>> Yecc crashes on both of these cases. These aren't really a problem for
>> erlang as you have the '.' which marks the end but I needed it for LFE.
>>
>> Robert
>>
>>
>> On 14 November 2015 at 18:33, Max Lapshin <max.lapshin@REDACTED> wrote:
>>
>>> Got it.
>>>
>>> So, it is just a bad way of usage.
>>>
>>> I decided to put semantics and most of logic inside config to make
>>> validation as early as possible. And it seems that it is a misusage.
>>>
>>> OK, I will think what to do including option "ignore and relax" because
>>> only one user suffers from this problem.
>>>
>>> Thanks for hints and explanation!
>>> On Nov 14, 2015 8:20 PM, "Sean Cribbs" <seancribbs@REDACTED> wrote:
>>>
>>>> Hi Max,
>>>>
>>>> My general impression of your grammar is that it conflates syntax and
>>>> semantics. Your configuration lines seem to be structured roughly like
>>>> KEYWORD VALUES* ( "{" BLOCK "}" )? ";", but you create a single rule for
>>>> every possible configuration item. This means at the top-level you have to
>>>> create ordered choices between completely unrelated extents. Every config
>>>> item you parse is going to go through the same backtracking among ~30
>>>> alternatives, with nothing able to be memoized (because the stem of each
>>>> one is unique). This is a quintessential worst-case for the types of
>>>> parsers that neotoma currently generates.
>>>>
>>>> I suggest you rethink the parser, focus on what are the shared
>>>> structures and try to extract those. Use fewer rules, try not to treat each
>>>> config item as a special case, and then validate the structure after
>>>> parsing it. Alternatively, you could pick up and use cuttlefish which has a
>>>> much simpler configuration language and built-in semantic validation and
>>>> transformation and command-line tools.
>>>>
>>>> As you say earlier in the thread, you could also use leex and yecc. I
>>>> understand the desire not to, but their performance characteristics are
>>>> better known and more predictable. They might be worth it.
>>>>
>>>> I feel pretty responsible for the limitations you encountered. Ford's
>>>> thesis clearly outlines optimizations that neotoma does not do, including
>>>> more efficient matching of terminals via "switch", cost calculation to
>>>> determine what to memoize and inline, unrolling recognition logic instead
>>>> of using parser-combinators, and more. I'm working on a rewrite, but it's
>>>> not ready yet.
>>>>
>>>> I'm sorry. I hope the suggestions above help you focus the grammar into
>>>> something more usable.
>>>>
>>>> Sean
>>>>
>>>> On Sat, Nov 14, 2015 at 10:27 AM, Max Lapshin <max.lapshin@REDACTED>
>>>> wrote:
>>>>
>>>>> I've sent example from production to Sean.
>>>>>
>>>>> (flussonic@REDACTED)2> timer:tc(fun() ->
>>>>> config2_format:parse(element(2,file:read_file("big.conf"))), ok end).
>>>>>
>>>>> {29727926,ok}
>>>>>
>>>>> On Sat, Nov 14, 2015 at 4:05 PM, French, Michael <
>>>>> michael.french@REDACTED> wrote:
>>>>>
>>>>>> The concept of 'token' is fluid in PEGs. The terminal/non-terminal
>>>>>> distinction might not work. For example, the definition of 'alphanumeric'
>>>>>> might appear in several different 'tokens', rather than repeating char
>>>>>> classes, which means the token rules become non-terminals.
>>>>>>
>>>>>> Maybe use a hint in the grammar, like using an upper-case name for
>>>>>> rule LHS to indicate a 'token' (which is similar to a convention in Antlr).
>>>>>> Then always memo-ize the token rules, but not (necessarily) the others that
>>>>>> have lower-case rule names.
>>>>>>
>>>>>> BR
>>>>>> Mike
>>>>>>
>>>>>> ________________________________________
>>>>>> From: erlang-questions-bounces@REDACTED [
>>>>>> erlang-questions-bounces@REDACTED] on behalf of Joe Armstrong [
>>>>>> erlang@REDACTED]
>>>>>> Sent: Friday, November 13, 2015 11:25 PM
>>>>>> To: Sean Cribbs
>>>>>> Cc: Erlang-Questions Questions
>>>>>> Subject: Re: [erlang-questions] speed of neotoma parser
>>>>>>
>>>>>> PEG parsers are notoriously inefficient. How about having a separate
>>>>>> tokenization pass, and parse token instead of characters. At a guess
>>>>>> this would be far faster since you'd backtrack over completed tokens
>>>>>> rather than characters.
>>>>>>
>>>>>> /Joe
>>>>>>
>>>>>> On Fri, Nov 13, 2015 at 3:02 PM, Sean Cribbs <seancribbs@REDACTED>
>>>>>> wrote:
>>>>>> > Max,
>>>>>> >
>>>>>> > Do you have a link to your grammar? I can probably poke at it and
>>>>>> give you
>>>>>> > some tips.
>>>>>> >
>>>>>> > However, I am well aware of performance problems with neotoma --
>>>>>> with large
>>>>>> > grammars or large inputs it drags. Yes, there are general problems
>>>>>> for PEGs
>>>>>> > in Erlang, but its current implementation is particularly naive and
>>>>>> > wasteful. I'm working on a rewrite, but it's a complete overhaul
>>>>>> (and more
>>>>>> > faithful to the thesis and reference implementation "Pappy"). Since
>>>>>> it's not
>>>>>> > core to my day-job, I've only been able to work on the rewrite
>>>>>> occasionally
>>>>>> > in my free time.
>>>>>> >
>>>>>> > On Thu, Nov 12, 2015 at 12:07 PM, Max Lapshin <
>>>>>> max.lapshin@REDACTED> wrote:
>>>>>> >>
>>>>>> >> Yes, Louis, I also think that there may be a simple way of
>>>>>> speeding it up.
>>>>>> >>
>>>>>> >> I'm only afraid that I will have to open my university book and
>>>>>> remember
>>>>>> >> what LL-1 means and how it differs from LALR =)
>>>>>> >>
>>>>>> >> Ok, will try to profile it first.
>>>>>> >>
>>>>>> >> _______________________________________________
>>>>>> >> erlang-questions mailing list
>>>>>> >> erlang-questions@REDACTED
>>>>>> >> http://erlang.org/mailman/listinfo/erlang-questions
>>>>>> >>
>>>>>> >
>>>>>> >
>>>>>> > _______________________________________________
>>>>>> > erlang-questions mailing list
>>>>>> > erlang-questions@REDACTED
>>>>>> > http://erlang.org/mailman/listinfo/erlang-questions
>>>>>> >
>>>>>> _______________________________________________
>>>>>> erlang-questions mailing list
>>>>>> erlang-questions@REDACTED
>>>>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>>>> _______________________________________________
>>>>>> erlang-questions mailing list
>>>>>> erlang-questions@REDACTED
>>>>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>>>>
>>>>>
>>>>>
>>>>
>>> _______________________________________________
>>> erlang-questions mailing list
>>> erlang-questions@REDACTED
>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>
>>>
>>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20151115/34cbb121/attachment.htm>


More information about the erlang-questions mailing list