[erlang-questions] re: re:pain (and stripping whitespace from text)
Sun Mar 14 03:45:17 CET 2010
Actually, using pure regular expressions, this is not really easy (or exactly possible for certain uses, i.e. nesting).
> "some quotes", with, "some more quotes", and, yet, "even more quotes"
That said, many regex libraries have extensions that make this possible, but painful. See here:
Perl: http://perldoc.perl.org/perlfaq6.html#Can-I-use-Perl-regular-expressions-to-match-balanced-text%3f (also "recursive patterns")
PCRE: http://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions (look for "recursive patterns")
Also, note that backtracking and recursion can create very undesirable behavior in terms of stack usage and execution time. This is so problematic that Google has written a special regex library that uses research in automata theory to make it behave better. See here:
I think that the moral of this story is that every way to do this with regexps is a hack, and you probably shouldn't. It looks like you're parsing CSV data. You might try this relatively simple recursive-decent parser:
I'm willing to bet that it's not particularly fast, but probably works well enough. If you need more speed, you might try implementing a parser with yecc/leex, or even do something really exciting like writing a erl_nif interface to libcsv.
On Mar 13, 2010, at 2:35 PM, Steve Davis wrote:
> I've been confounded again by re, trying to strip whitespace from
> binary text, as the obvious "[ \t\r\n]+", as in...
> list_to_binary(re:replace(<<"a, \tb, \"quoted string\", \n c, d">>,
> <<"[ \t\r\n]+">>, <<>>, [global]))
> ...results in...
> ..I know there must be a regex that would avoid the stripping inside
> the quotes, but no amount of experiment (or google) has yielded a
> suitable result for me.
> Can anybody immediately see a solution (and put me out of my pain)?
> Thanks in advance,
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:
More information about the erlang-questions