[erlang-questions] utf-8 and xmerl
Éric Pailleau
eric.pailleau@REDACTED
Fri Aug 14 18:23:36 CEST 2015
Hello,
Please precise what Erlang release you are using. Utf8 came lately in Erlang.
Regards
Le 14 août 2015 13:49, Alexander Turkin <snowwlex@REDACTED> a écrit :
>
> Dear list,
>
>
> I've got a problem with unicode & xmerl library.
>
> Input data for xmerl is utf-8 encoded xml, and what I've got as the result is encoded latin1. But I need utf8!
>
>
> EXAMPLES
>
> Body = <<"<?xml version=\"1.0\" encoding=\"UTF-8\"?><response><value>René</value></response>"/utf8>>.
>
> (for the sake of portability here is term_to_binary(Body):
>
> <<131,109,0,0,0,79,60,63,120,109,108,32,118,101,114,115,
> 105,111,110,61,34,49,46,48,34,32,101,110,99,111,100,105,
> 110,103,61,34,85,84,70,45,56,34,63,62,60,114,101,115,
> 112,111,110,115,101,62,60,118,97,108,117,101,62,82,101,
> 110,195,169,60,47,118,97,108,117,101,62,60,47,114,101,
> 115,112,111,110,115,101,62>>
>
>
>
> (1):
>
> When I do
>
> xmerl_scan:string(binary_to_list(Body)).
>
> it returns
>
> {#xmlElement{name = response,expanded_name = response,
> nsinfo = [],
> namespace = #xmlNamespace{default = [],nodes = []},
> parents = [],pos = 1,attributes = [],
> content = [#xmlElement{name = value,expanded_name = value,
> nsinfo = [],
> namespace = #xmlNamespace{default = [],nodes = []},
> parents = [{response,1}],
> pos = 1,attributes = [],
> content = [#xmlText{parents = [{value,1},{response,1}],
> pos = 1,language = [],
>
>
> value = "René",
>
>
> type = text}],
> language = [],xmlbase = "/Users/aturkin/ws/",
> elementdef = undeclared}],
> language = [],xmlbase = "/Users/aturkin/ws/",
> elementdef = undeclared},
> []}
>
>
> So, note there is `value = "René"` string, and it uses [233] symbol, which is latin1.
>
>
>
>
> (2):
>
> xmerl_scan:string(xmerl_ucs:to_utf8(binary_to_list(Body)))
>
> returns
>
> {#xmlElement{name = response,expanded_name = response,
> nsinfo = [],
> namespace = #xmlNamespace{default = [],nodes = []},
> parents = [],pos = 1,attributes = [],
> content = [#xmlElement{name = value,expanded_name = value,
> nsinfo = [],
> namespace = #xmlNamespace{default = [],nodes = []},
> parents = [{response,1}],
> pos = 1,attributes = [],
> content = [#xmlText{parents = [{value,1},{response,1}],
> pos = 1,language = [],
>
>
> value = "René",
>
>
> type = text}],
> language = [],xmlbase = "/Users/aturkin/ws/",
> elementdef = undeclared}],
> language = [],xmlbase = "/Users/aturkin/ws/",
> elementdef = undeclared},
> []}
>
> Now `value = "René"`, so 2 bytes are used to code this symbol, and this is utf-8.
>
> So in (2) I get what I need, but why I need to force that conversion for xmerl?
>
>
>
>
> QUESTIONS
>
> 1. I don't understand why xmerl_scan allows you to set input encoding, but it looks like there is no way to set output encoding. Is there any way to make xmerl_scan to return utf8 instead of latin1?
>
> 2. How is that happen, that in (1) it does conversion utf-8 -> latin1, and in (2) it's utf-8?
>
>
>
>
> --
> Best Regards,
> Alex Turkin
More information about the erlang-questions
mailing list