Adoption of perl/javascript-style regexp syntax

Thu Jun 4 03:04:39 CEST 2009

On 3 Jun 2009, at 6:59 pm, mats cronqvist wrote:

> "Richard O'Keefe" <ok@REDACTED> writes:
>
>>
>> The ultimate point though is that hacking on the language to
>> make it easier for people to do the WRONG thing does not strike
>> me as a good use of anyone's time.
>
>  A truism. But I don't think the correct definition of "WRONG" is
>  "whatever Richard O'Keefe dislikes."

Oh come *ON*.  Resort to unwarranted ad hominem attacks is an
admission of failure.

It is not that I say strings are wrong because I don't like them,
but that I do not like them behave I have learned painfully and
repeatedly that they are wrong, as in difficult to use and highly
error prone.

>> That pain level is there for a good reason: if the Erlang string
>> syntax is giving you that much of a headache, it's because STRINGS  
>> ARE
>> WRONG and you should almost certainly be using trees instead.
>
>  Alas, re wants strings, and there's not much I can do about that.

Of *course* there is.

First off, who said you had to use re?

Second, Erlang/OTP is open source.  You and I and all of us have
access to the source code.  Building our own better_re that, _as
well as_ strings, _also_ accepts some kind of tree, is hardly
rocket science.  If I weren't busy working on compilers for two
other languages, preparing lectures, and marking assignments
I'd do it myself.  When I can find some breathing time, I expect
I will.

Third, who says ((we have trees) AND (re gets strings)) are
incompatible?  It's not that slashification cannot be done, it's
that it is painful to do by hand.  So who says we have to do it
by hand?  Again, it's not rocket science to write a function that
takes a tree and linearises it as a string (for re to then
parse, undoing the linearisation).  I've done it once in the past,
for Prolog to talk to C.

Let's take a very simple case:  the replacement string.
The discussion in 're' is a little vague, and a little puzzling.
Why is Perl's \0 not supported?  How do you tell whether \123
is (substring 1)23 or (substring 12)3 or (substring 123)?
Do & and \# sequences count inside binaries?

<replacement>
  ::= []					empty
   |  [<replacement> | <replacement>]	concatenation
   |  <character code>			that literal character
   |  <binary>				that binary
   |  {match,all}			&
   |  {match,N}				\N

linearise_replacement(R) ->
     linearise_replacement(R, []).

linearise_replacement([], E) ->
     E;
linearise_replacement([H|T], E) ->
     linearise_replacement(H, linearise_replacement(T, E));
linearise_replacement(C, E) when is_integer(C), C >= 0 ->
     case C
       of $&  -> [$\\,C|E]
        ; $\\ -> [$\\,C|E]
        ; _   -> [    C|E]
     end;
linearise_replacement(B, E) when is_binary(B) ->
     binary_to_list(B) ++ E;
linearise_replacement({match,all}, E) ->
     [$&|E];
linearise_replacement({match,N}, E)
   when is_integer(N), N >= 1, N =< 9 ->
     [$\\,N+$0|E].

Now let's take an example from the re: manual.
<quote>
   Example:
     re:replace("abcd","c","[&]",[{return,list}]).
   gives

     "ab[c]d"
   while

     re:replace("abcd","c","[\\&]",[{return,list}]).
   gives

     "ab[&]d"
</quote>
If we define
replace(Subject, Pattern, Replacement, Options) ->
     re:replace(Subject, Pattern,
                linearise_replacement(Replacement), Options).
then everything becomes clear and trouble-free:
     replace("abcd", "c", "[&]", [{return,list}])
gives
     "ab[&]d"
while
     replace("abcd", "c", ["[",{match,all},"]"], [{return,list}])
gives
     "ab[c]d"
I'll have to upgrade my Erlang release to test this, but the rest
of the afternoon will be spent talking with students, so that will
have to wait.  There's already an issue about binaries and Unicode.
That's not relevant to the point, which is that providing a
nice clean _safe_ tree-based interface to something with a
string-based interface is not in fact at all hard.  It is something
we can do NOW, any of us, without language changes, because it is
NOT the language that is wrong, it's using strings.

"Strings are the opiate of the masses."