[erlang-bugs] different behaviour of re:replace for directly-specified and precompiled regular expressions

Robin Haberkorn rh@REDACTED
Mon Sep 12 16:15:31 CEST 2011


Hello,

I think I may have discovered a bug in the stdlib 're' module.

For some Erlang strings, re:replace behaves differently
for regular expressions "re:compile"d with the 'unicode'
option and regular expressions passed uncompiled to
re:replace, giving 'unicode' in its options list.

I've minimized the test case using PropEr.
Have a look at the following erl session:

Erlang R14B03 (erts-5.8.4) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:0] [kernel-poll:false]

Eshell V5.8.4  (abort with ^G)
1> RegExp = ".".
"."
2> {ok, RegExpC} = re:compile(RegExp, [unicode]).
{ok,{re_pattern,0,1,
                <<69,82,67,80,56,0,0,0,0,8,0,0,0,0,0,0,0,0,0,0,0,0,0,...>>}}
3> re:replace([133], RegExp, " ", [unicode, global]).
[<<" ">>]
4> re:replace([133], RegExpC, " ", [global]).
[133]
5> unicode:characters_to_binary(re:replace([133], RegExp, " ", [unicode, global])).
<<" ">>
6> unicode:characters_to_binary(re:replace([133], RegExpC, " ", [global])).         
<<194,133>>
7>

That is, in (4) the replacement simply isn't performed.
[133] should be a valid unicode charlist and 133 a valid
unicode codepoint.
I've discovered this by running re:replace on io_lib:format
return values. If I'm not totally confused by Erlang's
Unicode handling, io_lib:format without the unicode
translation modifier returns a (deep) list of byte()s.
Since they are integer lists the UTF8 binary encoding does
not matter and all integers returned are valid unicode
code points (unicode:characters_to_binary does
not seem to complain about any list that causes these problems
with re:replace).

Moreover consider the following difference:

10> re:replace([256], RegExpC, " ", [global]).                               
** exception error: bad argument
     in function  re:replace/4
        called as re:replace([256],
                             {re_pattern,0,1,
                                         <<69,82,67,80,56,0,0,0,0,8,0,0,0,0,0,0,0,0,0,0,0,0,0,...>>},
                             " ",
                             [global])
11> re:replace([256], RegExp, " ", [unicode,global]).
[<<" ">>]

Almost as if re:replace would expect only byte()s in (10).

Is this desired behaviour, perhaps even documented?

Best Regards,
Robin

-- 
-- 
------------------ managed broadband access ------------------

Travelping GmbH               phone:           +49-391-8190990
Roentgenstr. 13               fax:           +49-391-819099299
D-39108 Magdeburg             email:       info@REDACTED
GERMANY                       web:   http://www.travelping.com


Company Registration: Amtsgericht Stendal Reg No.:   HRB 10578
Geschaeftsfuehrer: Holger Winkelmann | VAT ID No.: DE236673780
--------------------------------------------------------------



More information about the erlang-bugs mailing list