[erlang-questions] Re: Adoption of perl/javascript-style regexp syntax

Richard O'Keefe ok@REDACTED
Wed Jun 3 04:38:38 CEST 2009


On 3 Jun 2009, at 10:48 am, Ulf Wiger wrote:
> But... elaborate mechanisms to hide from the compiler?

Have you looked at some of the things that have been developed
outside Erlang to handle unholy mixtures of XML, regular expressions,
PHP/Ruby/JavaScript/whatever?

Cross-Site Scripting is a way of attacking systems that have
this kind of language mishmash.  Here's a sample URL from a
web page about XSS:

http://www.example.com/search.pl?text=<script>alert(document.cookie)</ 
script>

Chances are that the <script>...</script> bit came from
some source that said "relax, it's just a string, relax,
it's just a string"...

"Hiding from the compiler" is not an intemperate phrase for
what's going on.

(1) There is an intrinsically STRUCTURED data type.
     It might be XML or regular expressions or JavasScript or ...

(2) That data is linearised into a string.

(3) That string is interpolated into some other structured data.

(4) Which is itself linearised into a string.

And now you have multiple levels of quoting to worry about
and if you are not very careful, XSS vulnerabilities as well.

The answer is NOT to turn things into strings.
If you have something structured, LEAVE it structured.
Don't parse it at run time.

> Regexps are not the only strings where escaping can be
> an issue.

Exactly so.  But let me rephrase that:
regexps are not the only DATA TYPE where people run into
serious trouble because they insist on treating them as
strings when they really aren't.

We should represent regexps as regexp _trees_.
We should represent XML as XML _trees_.
We should represent CSS data as CSS _trees_.
Linearising is the penultimate step of processing, just before
something is written to a file or socket &c.

Strings are for *storing* and *transmitting* information.
They are really lousy tool for *processing* information.

If you are fortunate enough to have a programming language
like Lisp or Erlang (or even, to some extent, Eiffel) where
you _can_ easily write tree structures _as_ tree structures,
it is really very foolish to try to write them as strings.

Of course, if you are fetching information out of a data base,
then strings are probably what you're going to get, though I
note that XML is part of the current SQL standard and that
even the free version of DB2 from IBM is supposed to speak
XML "natively".  And of course Erlang has Mnesia, meaning that
storing logically tree-structured data _as_ trees is the more
attractive option.  Heck, even JavaScript has JSON.


> I think most of us have on occasion come across
> a problem where the string syntax in erlang creates
> unwanted noise, but not at the pain level where it would
> be warranted to start inventing a preprocessor step
> (which I find much more elaborate than accepting an alternative
> way of entering strings - something that many language
> environments already provide.)

De gustibus no disputandem est.
I find writing and using a preprocessor far easier than
hacking on the lexical analyser for the language.  It took
me just 40 SLOC of C, and now I have the _same_ "bigquote"
processor for all of C, C++, AWK, Erlang, Haskell, SML, and Prolog.
Without hacking on any compilers whatever.

The ultimate point though is that hacking on the language to
make it easier for people to do the WRONG thing does not strike
me as a good use of anyone's time.  That pain level is there
for a good reason:  if the Erlang string syntax is giving you
that much of a headache, it's because STRINGS ARE WRONG and you
should almost certainly be using trees instead.

I really shouldn't have mentioned the .{{ ... .}} hack, because
the real point there was DON'T CHANGE THE LANGUAGE because in
this case it's not the language that's wrong.



More information about the erlang-questions mailing list