[erlang-questions] Fwd: String encoding and character set

Bob Ippolito <>
Wed Jan 17 02:28:35 CET 2007


Robert is recommending that text should be dealt with in UCS-4, where
characters and list items are the same thing. Other encodings should
be dealt with at IO boundaries.

-bob

On 1/16/07, dda <> wrote:
> Nope. Let's take for instance a utf-8 string. As an Erlang list,
> there's no way, in the language, to extract safely one character or
> more from the string. You cannot extract, in say "유니코드는ISO엔코딩보다 훨씬
> 좋다." [that's Korean if you're wondering] the 6th to 11th characters –
> ISO엔코딩 – without doing more contorsions than a circus artist. A list
> is not a string, it's raw data left for us to muck with.
>
> --
> dda
>
> On 1/17/07, Robert Virding <> wrote:
> > We do actually, in fact we have something much much better, a list.
> > Using a list you don't have to worry about encodings but can use the
> > unicode value directly in the string/list. This makes all processing
> > much easier. Then when you are done you can convert it to what ever
> > encoding you want.
> >
> > I don't really understand why anyone would want to process data in an
> > unnecessarily complex format instead of a simple one.
> >
> > Robert
> >
> > dda wrote:
> > > String types – at least well-implemented ones – don't just store a
> > > string, but also encoding information. They are/should be geared
> > > towards pain-free manipulation of text data, and by text I mean things
> > > outside ASCII-land. Encodings-aware string manipulation functions
> > > don't function on bytes, but on characters, a quite different notion.
> > > We don't have this in Erlang.
>
> _______________________________________________
> erlang-questions mailing list
> 
> http://www.erlang.org/mailman/listinfo/erlang-questions
>




More information about the erlang-questions mailing list