Fixes for unicode handling in re module

Rory Byrne <>
Sat Jan 16 18:46:25 CET 2010


I've added fixes for three unicode-related bugs in the re module:

    git fetch git:// stdlib_re_unicode_fixes

There's actually a forth bug that's in re:split/3, but it won't show
up in tests at the moment due to a bug in re:compile where the
unicode flag is being ignored (which I mentioned on the erlang-bugs 
mailing list). It's pretty much the same as the one fixed in 5073a1b7b 
and it's mentioned in the comment for that commit.

On a side note, I think it would be really helpful if re:run/3 was 
changed to accept chardata (charlists or binaries) rather than just 
charlists. At the moment it's a hazard that it won't accept binaries 
for Subject input when unicode REs are used. 

Is this sensible?

    Eshell V5.7.5  (abort with ^G)
    1>  re:run("hello", "\x{400}", [unicode]).
    2> re:run(<<"hello">>, "\x{400}", [unicode]).
    ** exception error: bad argument
         in function  re:run/3
            called as re:run(<<"hello">>,[1024],[unicode])
    3> re:run([<<"hello">>], "\x{400}", [unicode]).

I've already fallen victim to this on a number of occasions, and
you can see from the patches that I'm not alone. I'm not a C coder
so I can't offer a patch for the re:run BIF. But if you want to 
rename the BIF to 'do_run', or whatever, I'd happily write up a 
run/3 in erlang that normalises input and then calls it!

By the way, both the unicode and the re modules are amazingly useful.
Thanks so much for adding them!



