[erlang-questions] speed of neotoma parser

Sat Nov 14 18:20:04 CET 2015

Hi Max,

My general impression of your grammar is that it conflates syntax and
semantics. Your configuration lines seem to be structured roughly like
KEYWORD VALUES* ( "{" BLOCK "}" )? ";", but you create a single rule for
every possible configuration item. This means at the top-level you have to
create ordered choices between completely unrelated extents. Every config
item you parse is going to go through the same backtracking among ~30
alternatives, with nothing able to be memoized (because the stem of each
one is unique). This is a quintessential worst-case for the types of
parsers that neotoma currently generates.

I suggest you rethink the parser, focus on what are the shared structures
and try to extract those. Use fewer rules, try not to treat each config
item as a special case, and then validate the structure after parsing it.
Alternatively, you could pick up and use cuttlefish which has a much
simpler configuration language and built-in semantic validation and
transformation and command-line tools.

As you say earlier in the thread, you could also use leex and yecc. I
understand the desire not to, but their performance characteristics are
better known and more predictable. They might be worth it.

I feel pretty responsible for the limitations you encountered. Ford's
thesis clearly outlines optimizations that neotoma does not do, including
more efficient matching of terminals via "switch", cost calculation to
determine what to memoize and inline, unrolling recognition logic instead
of using parser-combinators, and more. I'm working on a rewrite, but it's
not ready yet.

I'm sorry. I hope the suggestions above help you focus the grammar into
something more usable.

Sean

On Sat, Nov 14, 2015 at 10:27 AM, Max Lapshin <max.lapshin@REDACTED> wrote:

> I've sent example from production to Sean.
>
> (flussonic@REDACTED)2> timer:tc(fun() ->
> config2_format:parse(element(2,file:read_file("big.conf"))), ok end).
>
> {29727926,ok}
>
> On Sat, Nov 14, 2015 at 4:05 PM, French, Michael <michael.french@REDACTED>
> wrote:
>
>> The concept of 'token' is fluid in PEGs. The terminal/non-terminal
>> distinction might not work. For example, the definition of 'alphanumeric'
>> might appear in several different 'tokens', rather than repeating char
>> classes, which means the token rules become non-terminals.
>>
>> Maybe use a hint in the grammar, like using an upper-case name for rule
>> LHS to indicate a 'token' (which is similar to a convention in Antlr). Then
>> always memo-ize the token rules, but not (necessarily) the others that have
>> lower-case rule names.
>>
>> BR
>> Mike
>>
>> ________________________________________
>> From: erlang-questions-bounces@REDACTED [
>> erlang-questions-bounces@REDACTED] on behalf of Joe Armstrong [
>> erlang@REDACTED]
>> Sent: Friday, November 13, 2015 11:25 PM
>> To: Sean Cribbs
>> Cc: Erlang-Questions Questions
>> Subject: Re: [erlang-questions] speed of neotoma parser
>>
>> PEG parsers are notoriously inefficient. How about having a separate
>> tokenization pass, and parse token instead of characters. At a guess
>> this would be far faster since you'd backtrack over completed tokens
>> rather than characters.
>>
>> /Joe
>>
>> On Fri, Nov 13, 2015 at 3:02 PM, Sean Cribbs <seancribbs@REDACTED>
>> wrote:
>> > Max,
>> >
>> > Do you have a link to your grammar? I can probably poke at it and give
>> you
>> > some tips.
>> >
>> > However, I am well aware of performance problems with neotoma -- with
>> large
>> > grammars or large inputs it drags. Yes, there are general problems for
>> PEGs
>> > in Erlang, but its current implementation is particularly naive and
>> > wasteful. I'm working on a rewrite, but it's a complete overhaul (and
>> more
>> > faithful to the thesis and reference implementation "Pappy"). Since
>> it's not
>> > core to my day-job, I've only been able to work on the rewrite
>> occasionally
>> > in my free time.
>> >
>> > On Thu, Nov 12, 2015 at 12:07 PM, Max Lapshin <max.lapshin@REDACTED>
>> wrote:
>> >>
>> >> Yes, Louis, I also think that there may be a simple way of speeding it
>> up.
>> >>
>> >> I'm only afraid that I will have to open my university book and
>> remember
>> >> what LL-1 means and how it differs from LALR =)
>> >>
>> >> Ok, will try to profile it first.
>> >>
>> >> _______________________________________________
>> >> erlang-questions mailing list
>> >> erlang-questions@REDACTED
>> >> http://erlang.org/mailman/listinfo/erlang-questions
>> >>
>> >
>> >
>> > _______________________________________________
>> > erlang-questions mailing list
>> > erlang-questions@REDACTED
>> > http://erlang.org/mailman/listinfo/erlang-questions
>> >
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20151114/8bc8bed5/attachment.htm>