[erlang-bugs] Incorrect matching of non-ASCII characters using \w regex in unicode mode
Lukas Larsson
lukas@REDACTED
Tue Apr 1 10:05:54 CEST 2014
Hello,
As far as I can tell this is a flaw/optimization in how PCRE works. You
get the same behaviour in R16B03 for erlang.
However in later versions of PCRE an option to deal with this has been
introcudes called PCRE_UCP and in OTP 17.0 the version of PCRE used has
been lifted to 8.33 which includes this option. So in OTP 17 use:
1> re:run("Götterfunken", "\\W+", [{capture, all, list}, unicode, ucp]).
nomatch
2> re:run("Götterfunken", "\\w+", [{capture, all, list}, unicode, ucp]).
{match,["Götterfunken"]}
3> re:run("Götterfunken", "\\W+", [{capture, all, list}, unicode]).
{match,["ö"]}
4> re:run("Götterfunken", "\\w+", [{capture, all, list}, unicode]).
{match,["Götterfunken"]}
It seems like the PCRE docs agree with the erlang docs for how \w and \W
should be treated, but they also add that "although this may vary for
characters in the range 128-255 when locale-specific matching is
happening". Maybe that is the cause of the confusion?
Lukas
On 28/03/14 14:51, Peter Minten wrote:
> re:run doesn't properly handle non-ASCII characters using unicode mode.
> On R17-rc2:
>
> 1> re:run("Götterfunken", "\\w+", [{capture, all, list}, unicode]).
> {match,["Götterfunken"]}
> 2> re:run("Götterfunken", "\\W+", [{capture, all, list}, unicode]).
> {match,["ö"]}
>
> Apparently ö is both a word and a non-word character.
>
> http://www.erlang.org/doc/man/re.html#regexp_syntax says:
>
> """In UTF-8 mode, characters with values greater than 128 never match
> \d, \s, or \w, and always match \D, \S, and \W. This is true even when
> Unicode character property support is available. These sequences retain
> their original meanings from before UTF-8 support was available, mainly
> for efficiency reasons."""
>
> As I understand this a \w regex should never match ö.
>
> Greetings,
>
> Peter
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs
>
More information about the erlang-bugs
mailing list