[erlang-questions] utf-8 and xmerl
Alexander Turkin
snowwlex@REDACTED
Fri Aug 14 13:49:19 CEST 2015
Dear list,
I've got a problem with unicode & xmerl library.
Input data for xmerl is utf-8 encoded xml, and what I've got as the result
is encoded latin1. But I need utf8!
EXAMPLES
Body = <<"<?xml version=\"1.0\"
encoding=\"UTF-8\"?><response><value>René</value></response>"/utf8>>.
(for the sake of portability here is term_to_binary(Body):
<<131,109,0,0,0,79,60,63,120,109,108,32,118,101,114,115,
105,111,110,61,34,49,46,48,34,32,101,110,99,111,100,105,
110,103,61,34,85,84,70,45,56,34,63,62,60,114,101,115,
112,111,110,115,101,62,60,118,97,108,117,101,62,82,101,
110,195,169,60,47,118,97,108,117,101,62,60,47,114,101,
115,112,111,110,115,101,62>>
(1):
When I do
xmerl_scan:string(binary_to_list(Body)).
it returns
{#xmlElement{name = response,expanded_name = response,
nsinfo = [],
namespace = #xmlNamespace{default = [],nodes = []},
parents = [],pos = 1,attributes = [],
content = [#xmlElement{name = value,expanded_name = value,
nsinfo = [],
namespace = #xmlNamespace{default =
[],nodes = []},
parents = [{response,1}],
pos = 1,attributes = [],
content = [#xmlText{parents =
[{value,1},{response,1}],
pos = 1,language =
[],
value = "René",
type = text}],
language = [],xmlbase =
"/Users/aturkin/ws/",
elementdef = undeclared}],
language = [],xmlbase = "/Users/aturkin/ws/",
elementdef = undeclared},
[]}
So, note there is `value = "René"` string, and it uses [233] symbol, which
is latin1.
(2):
xmerl_scan:string(xmerl_ucs:to_utf8(binary_to_list(Body)))
returns
{#xmlElement{name = response,expanded_name = response,
nsinfo = [],
namespace = #xmlNamespace{default = [],nodes = []},
parents = [],pos = 1,attributes = [],
content = [#xmlElement{name = value,expanded_name = value,
nsinfo = [],
namespace = #xmlNamespace{default =
[],nodes = []},
parents = [{response,1}],
pos = 1,attributes = [],
content = [#xmlText{parents =
[{value,1},{response,1}],
pos = 1,language =
[],
value = "René",
type = text}],
language = [],xmlbase =
"/Users/aturkin/ws/",
elementdef = undeclared}],
language = [],xmlbase = "/Users/aturkin/ws/",
elementdef = undeclared},
[]}
Now `value = "René"`, so 2 bytes are used to code this symbol, and this is
utf-8.
So in (2) I get what I need, but why I need to force that conversion for
xmerl?
QUESTIONS
1. I don't understand why xmerl_scan allows you to set input encoding, but
it looks like there is no way to set output encoding. Is there any way to
make xmerl_scan to return utf8 instead of latin1?
2. How is that happen, that in (1) it does conversion utf-8 -> latin1, and
in (2) it's utf-8?
--
Best Regards,
Alex Turkin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150814/eac4e444/attachment.htm>
More information about the erlang-questions
mailing list