[erlang-questions] Word boundary assertion matching for unicode strings in re module
Patrik Nyblom
pan@REDACTED
Wed Nov 21 11:31:40 CET 2012
Hi!
On 11/21/2012 10:22 AM, Victor Antonovich wrote:
> Hello!
>
> It looks like Erlang re module can't match word boundary assertion (\b)
> for non-latin characters in unicode strings:
>
> $ erl
> Erlang R15B02 (erts-5.9.2) [source] [64-bit] [smp:8:8] [async-threads:0]
> [kernel-poll:false]
>
> Eshell V5.9.2 (abort with ^G)
> 1> {_, R} = re:compile("\\b\\p{L}+\\b", [unicode, caseless]).
> {ok,{re_pattern,0,1,
> <<69,82,67,80,61,0,0,0,1,8,0,0,1,0,0,0,0,0,0,0,0,0,0,...>>}}
> 2> re:run("abc 123 def", R, [global]).
> {match,[[{0,3}],[{8,3}]]}
> 3> re:run("abc 123 абв", R, [global]).
> {match,[[{0,3}]]}
> 4> "abc 123 абв".
> [97,98,99,32,49,50,51,32,1072,1073,1074]
> 5> {_, R1} = re:compile("\\p{L}+", [unicode, caseless]).
> {ok,{re_pattern,0,1,
> <<69,82,67,80,59,0,0,0,1,8,0,0,1,0,0,0,0,0,0,0,0,0,0,...>>}}
> 6> re:run("abc 123 def", R1, [global]).
> {match,[[{0,3}],[{8,3}]]}
> 7> re:run("abc 123 абв", R1, [global]).
> {match,[[{0,3}],[{8,6}]]}
> 8>
>
> Is it intended behaviour or i missed something?
No, not really intended, but it's kind of a known limitation...
The pcre_exec.c code of the version we use says:
/* Find out if the previous and current characters are "word"
characters.
It takes a bit more work in UTF-8 mode. Characters > 255 are
assumed to
be "non-word" characters. */
...
prev_is_word = c < 256 && (md->ctypes[c] & ctype_word) != 0;
...
cur_is_word = c < 256 && (md->ctypes[c] & ctype_word) != 0;
...
- so, it's a design choise in the used PCRE library. It seems fixed in
the recent pcre-2.83, but...
The use of PCRE in Erlang involves a whole lot of hacking to adopt PCRE
to Erlang's execution model where the schedulers have to gain control
within a limited time frame. So whenever we chose to switch PCRE
library, we have quite some work ahead of us. Work that has to be done
with caution. In other words, you will have to wait for an adoption of
2.83 for a while, although it will happen (I would say, looking at the
current backlogs, there is no chance having it in place for R16B :()
> Regards,
> Victor.
Cheers,
/Patrik
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
More information about the erlang-questions
mailing list