[eeps] EEP 35 "Binary string modules"

Wed Nov 24 16:18:26 CET 2010

Hi Kenji,

Both overlong UTF-8 characters and invalid unicode ranges are rejected in 
the current implementation, both in the bit syntax and 
in the module 'unicode'. While that's not explicitly mentioned in the EEP, 
it is referring to the Unicode standard documents which clearly 
states the invalid ranges, why it is also implemented. It's 
also stated in the manual page for the module 'unicode'. I (and Björn, who 
wrote the bit syntax part) interpreted the RFC so that we don't allow 
overlong sequences on neither input nor output. As we thought the 
interpretation was quite obvious, we didn't feel we had to explicitly 
state that either in the EEP, but nevertheless mentioned it in the manual 
page as well.

It may be that modules like "string", that handle lists more or less 
without analyzing, would let an invalid unicode character slip through, 
but that is because of the "working without looking at the characters" 
property of this module. The EEP could definitely state that it requires 
valid unicode ranges, but as invalid characters are invalid and therefore 
not Unicode, I didn't really feel I had to say that.

And, to clarify - the list representation is the *Unicode codepoints*, the 
binary representation is also the *Unicode codepoints*, but encoded 
according to the UTF-8 encoding scheme. Codepoints that are invalid (due 
to the unfortunate UTF16 representation) are invalid *Unicode codepoints* 
and therefore invalid regardless of how the codepoints are actually 
encoded.

Overlong sequences, on the other hand, are a matter for the UTF-8 
*encoding* and has little to do with codepoints.

BOM's are adressed in the 'unicode' module. See the manual page. Having a 
BOM in each Unicode-representing binary is not efficient, why a binary 
string handling package does not deal with it.

Cheers,
/Patrik

On Tue, 23 Nov 2010, Kenji Rikitake wrote:

> Some thoughts on EEP35:
>
> * Usage of the UTF-8 (also RFC3629) in the "utf-8" encoded binaries must
>  be explicitly addressed in the EEP.  Just using the word "Unicode"
>  does not sufficiently address the details, because in the current
>  implementation of Erlang, the lists representing character strings use
>  the UTF-8 *character numbers*, while the binaries use encoded UTF-8
>  *octet sequences*.
>
>  This may affect EEP10 also, because it does not specifically mention
>  the usage of UTF-8 character number (max 10ffff#16 as in RFC3629) in
>  the Erlang lists representing character strings.
>
> * Issues of overlong encoding (RFC3629 Section 3) must be explicitly
>  addressed in the EEP also.
>
>  From RFC3629 Section 3:
>
>  "Implementations of the decoding algorithm above MUST protect against
>   decoding invalid sequences.  For instance, a naive implementation may
>   decode the overlong UTF-8 sequence C0 80 into the character U+0000,
>   or the surrogate pair ED A1 8C ED BE B4 into U+233B4.  Decoding
>   invalid sequences may have security consequences or cause other
>   problems.  See Security Considerations (Section 10) below."
>
> * BOM (Byte Order Mark) issues should also be addressed. I suggest
>  Erlang/OTP should follow the suggested use as represented in RFC3629
>  Section 6.
>
> Regards,
> Kenji Rikitake
>
> ________________________________________________________________
> eeps (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:eeps-unsubscribe@REDACTED
>