[erlang-questions] Word boundary assertion matching for unicode strings in re module

Victor Antonovich v.antonovich@REDACTED
Wed Nov 21 10:22:45 CET 2012


Hello!

It looks like Erlang re module can't match word boundary assertion (\b)
for non-latin characters in unicode strings:

$ erl
Erlang R15B02 (erts-5.9.2) [source] [64-bit] [smp:8:8] [async-threads:0]
[kernel-poll:false]

Eshell V5.9.2  (abort with ^G)
1> {_, R} = re:compile("\\b\\p{L}+\\b", [unicode, caseless]).
{ok,{re_pattern,0,1,
                <<69,82,67,80,61,0,0,0,1,8,0,0,1,0,0,0,0,0,0,0,0,0,0,...>>}}
2> re:run("abc 123 def", R, [global]).
{match,[[{0,3}],[{8,3}]]}
3> re:run("abc 123 абв", R, [global]).
{match,[[{0,3}]]}
4> "abc 123 абв".
[97,98,99,32,49,50,51,32,1072,1073,1074]
5> {_, R1} = re:compile("\\p{L}+", [unicode, caseless]).
{ok,{re_pattern,0,1,
                <<69,82,67,80,59,0,0,0,1,8,0,0,1,0,0,0,0,0,0,0,0,0,0,...>>}}
6> re:run("abc 123 def", R1, [global]).
{match,[[{0,3}],[{8,3}]]}
7> re:run("abc 123 абв", R1, [global]).
{match,[[{0,3}],[{8,6}]]}
8>

Is it intended behaviour or i missed something?

Regards,
Victor.



More information about the erlang-questions mailing list