Fixes for unicode handling in re module

Rory Byrne <>
Sat Jan 16 18:46:25 CET 2010


I've added fixes for three unicode-related bugs in the re module:

    git fetch git:// stdlib_re_unicode_fixes

There's actually a forth bug that's in re:split/3, but it won't show
up in tests at the moment due to a bug in re:compile where the
unicode flag is being ignored (which I mentioned on the erlang-bugs 
mailing list). It's pretty much the same as the one fixed in 5073a1b7b 
and it's mentioned in the comment for that commit.

On a side note, I think it would be really helpful if re:run/3 was 
changed to accept chardata (charlists or binaries) rather than just 
charlists. At the moment it's a hazard that it won't accept binaries 
for Subject input when unicode REs are used. 

Is this sensible?

    Eshell V5.7.5  (abort with ^G)
    1>  re:run("hello", "\x{400}", [unicode]).
    2> re:run(<<"hello">>, "\x{400}", [unicode]).
    ** exception error: bad argument
         in function  re:run/3
            called as re:run(<<"hello">>,[1024],[unicode])
    3> re:run([<<"hello">>], "\x{400}", [unicode]).

I've already fallen victim to this on a number of occasions, and
you can see from the patches that I'm not alone. I'm not a C coder
so I can't offer a patch for the re:run BIF. But if you want to 
rename the BIF to 'do_run', or whatever, I'd happily write up a 
run/3 in erlang that normalises input and then calls it!

By the way, both the unicode and the re modules are amazingly useful.
Thanks so much for adding them!



More information about the erlang-patches mailing list