[erlang-bugs] Incorrect matching of non-ASCII characters using \w regex in unicode mode

Peter Minten peter.minten@REDACTED
Fri Mar 28 14:51:34 CET 2014


re:run doesn't properly handle non-ASCII characters using unicode mode.
On R17-rc2:

1> re:run("Götterfunken", "\\w+", [{capture, all, list}, unicode]).
{match,["Götterfunken"]}
2> re:run("Götterfunken", "\\W+", [{capture, all, list}, unicode]).
{match,["ö"]}

Apparently ö is both a word and a non-word character.

http://www.erlang.org/doc/man/re.html#regexp_syntax says:

"""In UTF-8 mode, characters with values greater than 128 never match
\d, \s, or \w, and always match \D, \S, and \W. This is true even when
Unicode character property support is available. These sequences retain
their original meanings from before UTF-8 support was available, mainly
for efficiency reasons."""

As I understand this a \w regex should never match ö.

Greetings,

Peter



More information about the erlang-bugs mailing list