[erlang-questions] Character encodings and lager

Mon Aug 3 17:52:58 CEST 2015

On 08/03, Roger Lipscombe wrote:
>    Message = [60,60,178,179,62,62].
>    mochijson2:encode({struct, [{"msg", list_to_binary(Message)}]}).
>** exception exit: {ucs,{bad_utf8_character_code}}
>
>So, my question would -- usually -- be: "how do I convert the Latin1
>string to UTF8?".
>

unicode:characters_to_binary(List, latin1, utf8) should do it: 
http://www.erlang.org/doc/man/unicode.html#characters_to_binary-3

>However, the binary isn't supposed to contain anything outside the
>32-127 ASCII range. In fact, it should be an uppercase hexadecimal
>string: [A-F0-9] in ASCII.
>

Ah, well those are supposed to be the same in latin1 and utf8. SOmething 
else is breaking your data. The thing you generated a failure on above 
in the example is:

2> [60,60,178,179,62,62].
"<<²³>>"

Which, well, isn't very much hexadecimal as it uses the superscript 2 
and 3 (² and ³) instead of anything else, and also includes < and >.

>Note: In the original crash, the string was sent from an embedded
>device, and it appears to have garbage in it because of some kind of
>corruption in configuration NVRAM.
>

Yeah that's a problem alright.

>So, I have an actual *binary*, which usually only contains valid hex
>characters (in ASCII), but occasionally has bytes outside this range.
>How do I get that into mochijson2, via lager, without anything
>crashing?
>

Avoid putting garbage into the program, and the program will stop 
choking on the garbage. The conversion of latin1 to unicode will only 
work as long as the sequence of garbage generated is perceived to be 
valid latin1.

>I tried the following:
>
>    mochijson2:encode({struct, [{"msg",
>unicode:characters_to_binary(Message)}]}).
>
>...which works, but am I going to get burnt if I start using UTF-8 in
>my logging once we move to Erlang 17 or 18?
>

Use the form I put in earlier: characters_to_binary(Data, InEncoding, 
OutEncoding) -> Result

If you specify the InEncoding, you can, along with your upgrade from 17 
to 18, change the InEncoding from latin1 to utf8 and possibly remove the 
text entirely.

But the underlying problem is that you will *never* have a happy good 
time if what you're trying to do is convert unknown encodings into a 
known one. There is just no good solid way to do it. Text encoding is 
one place where it pays off to be very strict in what you accept so the 
rest of your system is much simpler.

>How do others deal with this kind of thing in Erlang?
>

I crash on bad corrupted input because nothing good will come out of 
trying to make that kind of garbage edible to my programs.

Regards,
Fred.