[erlang-bugs] different behaviour of re:replace for directly-specified and precompiled regular expressions
Robin Haberkorn
rh@REDACTED
Mon Sep 12 16:15:31 CEST 2011
Hello,
I think I may have discovered a bug in the stdlib 're' module.
For some Erlang strings, re:replace behaves differently
for regular expressions "re:compile"d with the 'unicode'
option and regular expressions passed uncompiled to
re:replace, giving 'unicode' in its options list.
I've minimized the test case using PropEr.
Have a look at the following erl session:
Erlang R14B03 (erts-5.8.4) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:0] [kernel-poll:false]
Eshell V5.8.4 (abort with ^G)
1> RegExp = ".".
"."
2> {ok, RegExpC} = re:compile(RegExp, [unicode]).
{ok,{re_pattern,0,1,
<<69,82,67,80,56,0,0,0,0,8,0,0,0,0,0,0,0,0,0,0,0,0,0,...>>}}
3> re:replace([133], RegExp, " ", [unicode, global]).
[<<" ">>]
4> re:replace([133], RegExpC, " ", [global]).
[133]
5> unicode:characters_to_binary(re:replace([133], RegExp, " ", [unicode, global])).
<<" ">>
6> unicode:characters_to_binary(re:replace([133], RegExpC, " ", [global])).
<<194,133>>
7>
That is, in (4) the replacement simply isn't performed.
[133] should be a valid unicode charlist and 133 a valid
unicode codepoint.
I've discovered this by running re:replace on io_lib:format
return values. If I'm not totally confused by Erlang's
Unicode handling, io_lib:format without the unicode
translation modifier returns a (deep) list of byte()s.
Since they are integer lists the UTF8 binary encoding does
not matter and all integers returned are valid unicode
code points (unicode:characters_to_binary does
not seem to complain about any list that causes these problems
with re:replace).
Moreover consider the following difference:
10> re:replace([256], RegExpC, " ", [global]).
** exception error: bad argument
in function re:replace/4
called as re:replace([256],
{re_pattern,0,1,
<<69,82,67,80,56,0,0,0,0,8,0,0,0,0,0,0,0,0,0,0,0,0,0,...>>},
" ",
[global])
11> re:replace([256], RegExp, " ", [unicode,global]).
[<<" ">>]
Almost as if re:replace would expect only byte()s in (10).
Is this desired behaviour, perhaps even documented?
Best Regards,
Robin
--
--
------------------ managed broadband access ------------------
Travelping GmbH phone: +49-391-8190990
Roentgenstr. 13 fax: +49-391-819099299
D-39108 Magdeburg email: info@REDACTED
GERMANY web: http://www.travelping.com
Company Registration: Amtsgericht Stendal Reg No.: HRB 10578
Geschaeftsfuehrer: Holger Winkelmann | VAT ID No.: DE236673780
--------------------------------------------------------------
More information about the erlang-bugs
mailing list