[erlang-bugs] Incorrect matching of non-ASCII characters using \w regex in unicode mode

Tue Apr 1 10:05:54 CEST 2014

Hello,

As far as I can tell this is a flaw/optimization in how PCRE works. You 
get the same behaviour in R16B03 for erlang.

However in later versions of PCRE an option to deal with this has been 
introcudes called PCRE_UCP and in OTP 17.0 the version of PCRE used has 
been lifted to 8.33 which includes this option. So in OTP 17 use:

1> re:run("Götterfunken", "\\W+", [{capture, all, list}, unicode, ucp]).
nomatch
2> re:run("Götterfunken", "\\w+", [{capture, all, list}, unicode, ucp]).
{match,["Götterfunken"]}
3> re:run("Götterfunken", "\\W+", [{capture, all, list}, unicode]).
{match,["ö"]}
4> re:run("Götterfunken", "\\w+", [{capture, all, list}, unicode]).
{match,["Götterfunken"]}

It seems like the PCRE docs agree with the erlang docs for how \w and \W 
should be treated, but they also add that "although this may vary for 
characters in the range 128-255 when locale-specific matching is  
happening". Maybe that is the cause of the confusion?

Lukas

On 28/03/14 14:51, Peter Minten wrote:
> re:run doesn't properly handle non-ASCII characters using unicode mode.
> On R17-rc2:
>
> 1> re:run("Götterfunken", "\\w+", [{capture, all, list}, unicode]).
> {match,["Götterfunken"]}
> 2> re:run("Götterfunken", "\\W+", [{capture, all, list}, unicode]).
> {match,["ö"]}
>
> Apparently ö is both a word and a non-word character.
>
> http://www.erlang.org/doc/man/re.html#regexp_syntax says:
>
> """In UTF-8 mode, characters with values greater than 128 never match
> \d, \s, or \w, and always match \D, \S, and \W. This is true even when
> Unicode character property support is available. These sequences retain
> their original meanings from before UTF-8 support was available, mainly
> for efficiency reasons."""
>
> As I understand this a \w regex should never match ö.
>
> Greetings,
>
> Peter
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs
>