[erlang-bugs] Incorrect matching of non-ASCII characters using \w regex in unicode mode
Peter Minten
peter.minten@REDACTED
Fri Mar 28 14:51:34 CET 2014
re:run doesn't properly handle non-ASCII characters using unicode mode.
On R17-rc2:
1> re:run("Götterfunken", "\\w+", [{capture, all, list}, unicode]).
{match,["Götterfunken"]}
2> re:run("Götterfunken", "\\W+", [{capture, all, list}, unicode]).
{match,["ö"]}
Apparently ö is both a word and a non-word character.
http://www.erlang.org/doc/man/re.html#regexp_syntax says:
"""In UTF-8 mode, characters with values greater than 128 never match
\d, \s, or \w, and always match \D, \S, and \W. This is true even when
Unicode character property support is available. These sequences retain
their original meanings from before UTF-8 support was available, mainly
for efficiency reasons."""
As I understand this a \w regex should never match ö.
Greetings,
Peter
More information about the erlang-bugs
mailing list