[erlang-questions] Word boundary assertion matching for unicode strings in re module

Patrik Nyblom pan@REDACTED
Wed Nov 21 11:31:40 CET 2012


Hi!

On 11/21/2012 10:22 AM, Victor Antonovich wrote:
> Hello!
>
> It looks like Erlang re module can't match word boundary assertion (\b)
> for non-latin characters in unicode strings:
>
> $ erl
> Erlang R15B02 (erts-5.9.2) [source] [64-bit] [smp:8:8] [async-threads:0]
> [kernel-poll:false]
>
> Eshell V5.9.2  (abort with ^G)
> 1> {_, R} = re:compile("\\b\\p{L}+\\b", [unicode, caseless]).
> {ok,{re_pattern,0,1,
>                  <<69,82,67,80,61,0,0,0,1,8,0,0,1,0,0,0,0,0,0,0,0,0,0,...>>}}
> 2> re:run("abc 123 def", R, [global]).
> {match,[[{0,3}],[{8,3}]]}
> 3> re:run("abc 123 абв", R, [global]).
> {match,[[{0,3}]]}
> 4> "abc 123 абв".
> [97,98,99,32,49,50,51,32,1072,1073,1074]
> 5> {_, R1} = re:compile("\\p{L}+", [unicode, caseless]).
> {ok,{re_pattern,0,1,
>                  <<69,82,67,80,59,0,0,0,1,8,0,0,1,0,0,0,0,0,0,0,0,0,0,...>>}}
> 6> re:run("abc 123 def", R1, [global]).
> {match,[[{0,3}],[{8,3}]]}
> 7> re:run("abc 123 абв", R1, [global]).
> {match,[[{0,3}],[{8,6}]]}
> 8>
>
> Is it intended behaviour or i missed something?
No, not really intended, but it's kind of a known limitation...

The pcre_exec.c code of the version we use says:

       /* Find out if the previous and current characters are "word" 
characters.
       It takes a bit more work in UTF-8 mode. Characters > 255 are 
assumed to
       be "non-word" characters. */
...
prev_is_word = c < 256 && (md->ctypes[c] & ctype_word) != 0;
...
cur_is_word = c < 256 && (md->ctypes[c] & ctype_word) != 0;
...

- so, it's a design choise in the used PCRE library. It seems fixed in 
the recent pcre-2.83, but...

The use of PCRE in Erlang involves a whole lot of hacking to adopt PCRE 
to Erlang's execution model where the schedulers have to gain control 
within a limited time frame. So whenever we chose to switch PCRE 
library, we have quite some work ahead of us. Work that has to be done 
with caution. In other words, you will have to wait for an adoption of 
2.83 for a while, although it will happen (I would say, looking at the 
current backlogs, there is no chance having it in place for R16B :()
> Regards,
> Victor.
Cheers,
/Patrik
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions




More information about the erlang-questions mailing list