[erlang-questions] Adoption of perl/javascript-style regexp syntax

Thu Jun 4 08:57:44 CEST 2009

Hi,

On Thu, Jun 4, 2009 at 04:09, Richard O'Keefe <ok@REDACTED> wrote:
> On 3 Jun 2009, at 9:35 pm, Vlad Dumitrescu wrote:
>> I suppose that you mean something like "embedded strings in a language
>> are wrong when representing anything else than plain text". And I
>> couldn't agree more, they are evil - strings that represent for
>> example a regexp should be a different data type than a text message
>> string.
>
> If we agree about that, everything else is less important.

Very good, then we'll just have to sort out the devil that's in the
details :-) I think most of the controversy in this thread is caused
by the fact that each and everyone of us have our own baggage of
presuppositions, making us not really talking about the same things.

>> Yet we don't do that because the textual representation has
>> some advantages: it's easier to read, it is higher level, it's easier
>> to modify and we're not bound to a specific internal representation.
>
> It may be easier to READ, but it is far harder to WRITE correctly.
> As for modifying, no, it is NOT easy to read.  And strings *are*
> a specific internal representation.

I see strings as an external representation, I don't know of any
regexp engine that doesn't compiel them into something else.

>> Regexps are (as you say) a structured datatype. Nobody disagrees. But
>> we have a widespread, standard and compact way to represent them.
>
> Wrong.  We have *many* ways to represent them.  We have shell
> syntax, understood by fnmatch() and glob().  We have two POSIX
> <snip>

The compact way I was referring to was as a string. The syntax of the
string's content is another issue.

> And this is another reason why trees are better.
> Because we can express a regular expression in a way that is
> independent of the target linear notation.

Only if we use the same tree representation. If each of us were to
write implementations of this library, we would get incompatible ones
(different names, maybe even different basic elements). If we use the
same library, then we could just as well agree on using POSIX string
syntax.

For me, the linear notation is not a "target" notation, it is a
"source" notation.

>> Given a compiler
>> that understands this, the following examples will generate exactly
>> the same code:
>>   identifier() -> {seq,{cset,letters()},{star,{cset,continuers()}}}.
>>   identifier() -> "{letters}{continuers}*".
>> I know which one I find easier to read and understand.
>
> Me too:  the first one.  Because the second one is a literal string.
> It contains the _text_ l,e,t,t,e,r,s, but not in any reasonable sense
> the _identifier_ letters.  I can create the first one AT RUN TIME.

Please note that I said "Given a compiler that understands this",
meaning that the compiler would recognize {letters} as an identifier
(the syntax as a regular string may be confusing, the compiler should
know it's a regexp and not a normal string).

> Remember, I'm _also_ talking about receiving a string at run time
> and including it in a regular expression which is then included
> in something else.  I don't understand why anyone is satisfied
> with compile-time-only semi-solutions.

You lost me here, probably you went too fast.How does a tree
representation help you handle runtime strings? If you're receiving a
string at runtime, how do you suggest to include it in a tree data
structure? I suppose the string could have structure too (otherwise
it's a trivial issue), wouldn't you still have to parse it? And if you
must have such a parser anyway, why not use it in the source code too?

> And as the same thing points out, a technique that deals with
> just ONE level of language embedding doesn't solve the problem
> generally enough.

I agree. Regexps are just a special case of a more general problem,
but they are much more widely used than most other embedded languages.
But then we are digressing from the original topic which was about
regexps (I'm aware that is partly of my doing, sorry for that).

I wil answer that in a separate message, as it feels it becomes
slightly off-topic.

regards,
Vlad