[erlang-questions] utf-8 and xmerl

Fri Aug 14 13:49:19 CEST 2015

Dear list,

I've got a problem with unicode & xmerl library.

Input data for xmerl is utf-8 encoded xml, and what I've got as the result
is encoded latin1. But I need utf8!

EXAMPLES

Body = <<"<?xml version=\"1.0\"
encoding=\"UTF-8\"?><response><value>René</value></response>"/utf8>>.

(for the sake of portability here is term_to_binary(Body):

<<131,109,0,0,0,79,60,63,120,109,108,32,118,101,114,115,
  105,111,110,61,34,49,46,48,34,32,101,110,99,111,100,105,
  110,103,61,34,85,84,70,45,56,34,63,62,60,114,101,115,
  112,111,110,115,101,62,60,118,97,108,117,101,62,82,101,
  110,195,169,60,47,118,97,108,117,101,62,60,47,114,101,
  115,112,111,110,115,101,62>>

(1):

When I do

xmerl_scan:string(binary_to_list(Body)).

it returns

{#xmlElement{name = response,expanded_name = response,
             nsinfo = [],
             namespace = #xmlNamespace{default = [],nodes = []},
             parents = [],pos = 1,attributes = [],
             content = [#xmlElement{name = value,expanded_name = value,
                                    nsinfo = [],
                                    namespace = #xmlNamespace{default =
[],nodes = []},
                                    parents = [{response,1}],
                                    pos = 1,attributes = [],
                                    content = [#xmlText{parents =
[{value,1},{response,1}],
                                                        pos = 1,language =
[],

                                                        value = "René",

                                                        type = text}],
                                    language = [],xmlbase =
"/Users/aturkin/ws/",
                                    elementdef = undeclared}],
             language = [],xmlbase = "/Users/aturkin/ws/",
             elementdef = undeclared},
 []}

So, note there is `value = "René"` string, and it uses [233] symbol, which
is latin1.

(2):

xmerl_scan:string(xmerl_ucs:to_utf8(binary_to_list(Body)))

returns

{#xmlElement{name = response,expanded_name = response,
             nsinfo = [],
             namespace = #xmlNamespace{default = [],nodes = []},
             parents = [],pos = 1,attributes = [],
             content = [#xmlElement{name = value,expanded_name = value,
                                    nsinfo = [],
                                    namespace = #xmlNamespace{default =
[],nodes = []},
                                    parents = [{response,1}],
                                    pos = 1,attributes = [],
                                    content = [#xmlText{parents =
[{value,1},{response,1}],
                                                        pos = 1,language =
[],

                                                        value = "RenÃ©",

                                                        type = text}],
                                    language = [],xmlbase =
"/Users/aturkin/ws/",
                                    elementdef = undeclared}],
             language = [],xmlbase = "/Users/aturkin/ws/",
             elementdef = undeclared},
 []}

Now `value = "RenÃ©"`, so 2 bytes are used to code this symbol, and this is
utf-8.

So in (2) I get what I need, but why I need to force that conversion for
xmerl?

QUESTIONS

1. I don't understand why xmerl_scan allows you to set input encoding, but
it looks like there is no way to set output encoding. Is there any way to
make xmerl_scan to return utf8 instead of latin1?

2. How is that happen, that in (1) it does conversion utf-8 -> latin1, and
in (2) it's utf-8?

-- 
Best Regards,
Alex Turkin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150814/eac4e444/attachment.htm>