[erlang-questions] puzzled with this charset/encoding -related behaviour

Mon Oct 16 11:42:10 CEST 2017

On Sat, Oct 14, 2017 at 6:23 PM Alexandre Karpov <alexakarpov@REDACTED>
wrote:

> Thanks everyone! I didn't realize until this conversation how much more
> important strings-as-binaries are, compared to simple "strings". Everything
> _works_ now, of course, but I don't think my understanding has caught up
> 100%
>
> "by default it guesses that lists containing integers larger than 255
> is not a string but a list of integers" <<< this really set some things
> straight
>
> But suffer me this follow-up question, Dan. Using +pc unicode indeed gave
> me a shell that represents lists of integers using the characters found in
> Unicode mapping; so now, in error messages, I see arguments reported more
> clearly:
>
> *7> re:run("йцу.asd", xmerl_regexp:sh_to_awk("*.*"), [{capture, none}]).
>  *
>
> *** exception error: bad argument*
>
> *     in function  re:run/3*
>
> *        called as re:run("йцу.asd","^(.*\\..*)$",[{capture,none}])*
>
>
> If I use the binary-string representation, it works _even_without_ /utf8,
> it works just fine:
>
> *3> re:run(<<"普通話.asd">>, xmerl_regexp:sh_to_awk("*.*"), [{capture,
> none}]).*
>
> *match*
> Note that the call above was executed in the shell started _without_ the
> +pc unicode, and the binary does _not_ have the /utf8>> thingy... This
> means my understanding is still lacking... binaries are honest and good,
> strings are fake and evil, but +px unicode seems to help a little with fake
> string... while using binary-strings doesn't _always_ require the /utf8 ...
> what is this sorcery?!
> =)
>

First see the difference here:

*4> <<"普通話.asd">>.*
*<<110,26,113,46,97,115,100>>*
*i.e. the codepoints are just truncated to below 256*

*5> <<"普通話.asd"/utf8>>.*
*<<230,153,174,233,128,154,232,169,177,46,97,115,100>>*
*6> *
And the codepoints are utf8 encoded.

Don't give up on lists, they can be useful and fast for some usages.

And your regexp matches anything with a dot in it, so even if the string is
handled as utf8 encoded binary or just plain bytes, it
still works since in both representation you get a match.

To understand unicode, play around and try to make it work (on both lists
and binaries) with some fancier regexps try match
something with a unicode sign in it and capture the result so you see what
you matched.
Print your input/result strings with both io:format("~ts: ~w~n",[Str,
Str]). So you can see both the actual string and it's representation,
test with both binaries and lists as representations.

/Dan

>
> On Sat, Oct 14, 2017 at 4:24 AM, Dan Gudmundsson <dangud@REDACTED> wrote:
>
>> re:run("йцу.asd", xmerl_regexp:sh_to_awk("*.*"), [{capture, none},
>> unicode]).
>> The binary one matches since it works on bytes and not utf-8 characters?
>>
>> Also the erlang shell doesn't know if a list of integers is a list of
>> integers or a string,
>> since they may be represented by the same list of integers.
>>
>> So it tries to guess, by default it guesses that lists containing
>> integers larger than 255
>> is not a string but a list of integers. You can change that with:
>>
>> (w)erl +pc unicode
>>
>> 1> "йцу.asd".
>> "йцу.asd"
>>
>> /Dan
>>
>>
>> On Sat, Oct 14, 2017 at 10:12 AM Attila Rajmund Nohl <
>> attila.r.nohl@REDACTED> wrote:
>>
>>> 2017-10-14 4:21 GMT+02:00 Alexandre Karpov <alexakarpov@REDACTED>:
>>> > TL;DR: how do I run erl which understands Unicode?
>>> >
>>> > Or, in more detail:
>>> >
>>> > (Disclaimer: this official documentation got me really humbled:
>>> > http://www1.erlang.org/doc/apps/stdlib/unicode_usage.html
>>> > , and just a little bit scared =) )
>>> >
>>> > Judging by my S/O question, which got 3 upvotes and no answers, I'm
>>> not the
>>> > only one wondering:
>>> >
>>> https://stackoverflow.com/questions/46735539/erlang-regexp-matching-on-chinese-characters
>>> >
>>> > Here's the gist of the problem:
>>> >
>>> > 57> "абв".
>>> >
>>> > [1072,1073,1074]
>>> >
>>> > The codes are correct Unicode for the [Cyrillic] characters - which
>>> means my
>>> > Terminal didn't fail to understand my keyboard's input =) but Erlang
>>> shell
>>> > didn't recognize Terminal's input as printable characters. And it is my
>>> > understanding that this is exactly why this call fails:
>>> >
>>> > 25> re:run("йцу.asd", xmerl_regexp:sh_to_awk("*.*"), [{capture,
>>> none}]). **
>>> > exception error: bad argument in function re:run/3 called as
>>> > re:run([1081,1094,1091,46,97,115,100], "^(.*\\..*)$", [{capture,none}])
>>>
>>> Try
>>>
>>> re:run(<<"йцу.asd"/utf8>>, xmerl_regexp:sh_to_awk("*.*"), [{capture,
>>> none}]).
>>> _______________________________________________
>>> erlang-questions mailing list
>>> erlang-questions@REDACTED
>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20171016/f9377194/attachment.htm>