[erlang-questions] erlsom and binary parsing output

Willem de Jong w.a.de.jong@REDACTED
Fri Feb 8 00:10:14 CET 2008


Hi Zvi,

I am pleased to hear that you like the tool :)

At this moment there is no way to tell erlsom that it should map strings to
binaries in stead of lists. Right now erlsom only maps integers, booleans
and qnames (and only under certain conditions - it doesn't try to determine
what the type is of an extended or restricted type).

The sax parser that is part of erlsom already does the decoding of
the binary input to a list of integers (unicode 'code points'). This means
that the layer that deals with all the schema-stuff would have to re-encode
it again, which wouldn't be very efficient. However, I am
currently finalising a new version the sax-parser. This version will take
advantage of the improved handling of binaries in R12B; it will work
directly on binaries. In theory it could also return binaries, but I am not
sure this would be a good idea. It depends on what you want to do with the
output, I guess.

Your request raises a couple of questions:
- if erlsom would mp strings to binaries, then these binaries have to be
encoded in some way (unless you want to accept only ASCII). What encoding
should it use? My feeling is that UTF-8 would be the best choice.

- if you want to have this kind of mapping, then you need a way to tell
erlsom what to map, and how. What would be the way to do this? Would you
prefer to add some special information to the xsd, or would you like to do
it in another way?

Anyway, currently I do not have any plans to do a lot of work on this - but
I must say that I have also been thinking about the possibility to return
binaries in stead of strings. It might be a possibility to introduce an
option to have binaries (utf-8 encoded) in all the places where currently
strings are used.

Regards,
Willem


On Feb 7, 2008 2:21 AM, Zvi <exta7@REDACTED> wrote:

>
> Hi Willem,
>
> I using erlsom and it's much easier to use, than xmerl.
> The added benefit, that I do not need to patch my Erlang instalation with
> 'windows-1252' encoding support.
> My problem with erlsom is,that parsed strings are lists. Is there are any
> option in erlsom so xs:string will be mapped to the Erlang binary?
> Also in XML Data Binding product I was using with C++, you can specify
> mappings from XSD datatypes to the C++ datatypes and even create mappings
> for custom datatypes.
> In my schema I have integers and floats, besides strings, but they all
> mapped to strings (more exactly to Erlang lists of ASCII codes :).
> Some C++ XML Data Binding products even handle enums (which is much bigger
> problem in C++ than in Erlang - no atoms). For example I can map XSD
> datatype xs:date to my custom class CDate and just provide two methods:
> fromString and toString.
> If erlsom will map at least standard XSD datatypes to standard Erlang
> datatypes it will be also usefull.
>
> Thanks for the usefull tool.
> Zvi
>
>
>
> Willem de Jong-2 wrote:
> >
> > Hi,
> >
> > Similar to what Bertil suggested for Xmerl, you can achieve this in
> Erlsom
> > by adding a clause
> >
> > "windows-1252" -> 'iso-8859-1';  %% note: this is actually introducing a
> > bug
> >
> >                                  %% in order to work around a problem!
> >
> > to the case statement in encoding_type() in erlsom_lib.erl.
> >
> > I would be interested to know why you think it will be necessary to
> > replace
> > it by a C++ port. It seems to me that it will be complicating things
> > considerably. What are the requirements that make this necessary? What
> > properties should an Erlang XML parser have?
> >
> > Regards,
> > Willem
> >
> >
> > On 1/7/08, Zvi <exta7@REDACTED> wrote:
> >>
> >>
> >> XML generated by closed-source 3rd party Windows server (if it was
> >> generated
> >> by me, then it was encoded in utf-8).
> >> I asking here questions from Erlang domain, not the obvious & ugly
> common
> >> sence solutions, like reading the entire file into memory, changing the
> >> encoding string and only then feeding it into xmerl. (the problem only
> >> that
> >> this XML can be quite big, like 0.5 MB and more).
> >> Maybe xmerl has some option for forcing encoding, other than specified
> in
> >> the <?xml?> PI?
> >> Maybe there is some other XML parser like erlsom or expat driver, which
> >> supports windows-1252 encoding?
> >> Anyway I using xmerl just for prototyping, the long term solution will
> be
> >> to
> >> write C++ port, which will be doing all the XML processing and return
> >> Erlang
> >> terms in either text or binary form, which can be read either by
> >> file:consult or binary_to_term on the Erlang side.
> >>
> >> ZVi
> >>
> >>
> >> Christian S wrote:
> >> >
> >> > Why not ask yourself how to change your xml so it says iso-8859-1 as
> >> you
> >> > say
> >> > it should be doing?
> >> >
> >> > http://en.wikipedia.org/wiki/Garbage_In,_Garbage_Out
> >> >
> >> > On Jan 7, 2008 5:22 PM, Zvi <exta7@REDACTED> wrote:
> >> >>
> >> >> Bertil,
> >> >>
> >> >> thanks for the reply.
> >> >> Actually the charcter set used is always latin-1, but for some
> reason
> >> 3rd
> >> >> party software call it windows-1252 . So if you can tell me, what I
> >> >> should
> >> >> change in xmerl, so it will threat windows-1252 as Latin-1 .
> >> > _______________________________________________
> >> > erlang-questions mailing list
> >> > erlang-questions@REDACTED
> >> > http://www.erlang.org/mailman/listinfo/erlang-questions
> >> >
> >> >
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Exception-in-xmerl%2C-when-pasing-XML-with-non-UTF8-character-set-tp14588326p14674437.html
> >> Sent from the Erlang Questions mailing list archive at Nabble.com<http://nabble.com/>
> .
> >>
> >> _______________________________________________
> >> erlang-questions mailing list
> >> erlang-questions@REDACTED
> >> http://www.erlang.org/mailman/listinfo/erlang-questions
> >>
> >
> > _______________________________________________
> > erlang-questions mailing list
> > erlang-questions@REDACTED
> > http://www.erlang.org/mailman/listinfo/erlang-questions
> >
>
> --
> View this message in context:
> http://www.nabble.com/Exception-in-xmerl%2C-when-pasing-XML-with-non-UTF8-character-set-tp14588326p15325643.html
> Sent from the Erlang Questions mailing list archive at Nabble.com<http://nabble.com/>
> .
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20080208/90794d7e/attachment.htm>


More information about the erlang-questions mailing list