[erlang-questions] puzzled with this charset/encoding -related behaviour

Alexandre Karpov alexakarpov@REDACTED
Sat Oct 14 18:23:19 CEST 2017


Thanks everyone! I didn't realize until this conversation how much more
important strings-as-binaries are, compared to simple "strings". Everything
_works_ now, of course, but I don't think my understanding has caught up
100%

"by default it guesses that lists containing integers larger than 255
is not a string but a list of integers" <<< this really set some things
straight

But suffer me this follow-up question, Dan. Using +pc unicode indeed gave
me a shell that represents lists of integers using the characters found in
Unicode mapping; so now, in error messages, I see arguments reported more
clearly:

*7> re:run("йцу.asd", xmerl_regexp:sh_to_awk("*.*"), [{capture, none}]).
 *

*** exception error: bad argument*

*     in function  re:run/3*

*        called as re:run("йцу.asd","^(.*\\..*)$",[{capture,none}])*


If I use the binary-string representation, it works _even_without_ /utf8,
it works just fine:

*3> re:run(<<"普通話.asd">>, xmerl_regexp:sh_to_awk("*.*"), [{capture,
none}]).*

*match*
Note that the call above was executed in the shell started _without_ the
+pc unicode, and the binary does _not_ have the /utf8>> thingy... This
means my understanding is still lacking... binaries are honest and good,
strings are fake and evil, but +px unicode seems to help a little with fake
string... while using binary-strings doesn't _always_ require the /utf8 ...
what is this sorcery?!
=)


On Sat, Oct 14, 2017 at 4:24 AM, Dan Gudmundsson <dangud@REDACTED> wrote:

> re:run("йцу.asd", xmerl_regexp:sh_to_awk("*.*"), [{capture, none},
> unicode]).
> The binary one matches since it works on bytes and not utf-8 characters?
>
> Also the erlang shell doesn't know if a list of integers is a list of
> integers or a string,
> since they may be represented by the same list of integers.
>
> So it tries to guess, by default it guesses that lists containing integers
> larger than 255
> is not a string but a list of integers. You can change that with:
>
> (w)erl +pc unicode
>
> 1> "йцу.asd".
> "йцу.asd"
>
> /Dan
>
>
> On Sat, Oct 14, 2017 at 10:12 AM Attila Rajmund Nohl <
> attila.r.nohl@REDACTED> wrote:
>
>> 2017-10-14 4:21 GMT+02:00 Alexandre Karpov <alexakarpov@REDACTED>:
>> > TL;DR: how do I run erl which understands Unicode?
>> >
>> > Or, in more detail:
>> >
>> > (Disclaimer: this official documentation got me really humbled:
>> > http://www1.erlang.org/doc/apps/stdlib/unicode_usage.html
>> > , and just a little bit scared =) )
>> >
>> > Judging by my S/O question, which got 3 upvotes and no answers, I'm not
>> the
>> > only one wondering:
>> > https://stackoverflow.com/questions/46735539/erlang-regexp-m
>> atching-on-chinese-characters
>> >
>> > Here's the gist of the problem:
>> >
>> > 57> "абв".
>> >
>> > [1072,1073,1074]
>> >
>> > The codes are correct Unicode for the [Cyrillic] characters - which
>> means my
>> > Terminal didn't fail to understand my keyboard's input =) but Erlang
>> shell
>> > didn't recognize Terminal's input as printable characters. And it is my
>> > understanding that this is exactly why this call fails:
>> >
>> > 25> re:run("йцу.asd", xmerl_regexp:sh_to_awk("*.*"), [{capture,
>> none}]). **
>> > exception error: bad argument in function re:run/3 called as
>> > re:run([1081,1094,1091,46,97,115,100], "^(.*\\..*)$", [{capture,none}])
>>
>> Try
>>
>> re:run(<<"йцу.asd"/utf8>>, xmerl_regexp:sh_to_awk("*.*"), [{capture,
>> none}]).
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20171014/5c93adee/attachment.htm>


More information about the erlang-questions mailing list