[eeps] EEP 35 "Binary string modules"
Wed Nov 24 16:18:26 CET 2010
Both overlong UTF-8 characters and invalid unicode ranges are rejected in
the current implementation, both in the bit syntax and
in the module 'unicode'. While that's not explicitly mentioned in the EEP,
it is referring to the Unicode standard documents which clearly
states the invalid ranges, why it is also implemented. It's
also stated in the manual page for the module 'unicode'. I (and Björn, who
wrote the bit syntax part) interpreted the RFC so that we don't allow
overlong sequences on neither input nor output. As we thought the
interpretation was quite obvious, we didn't feel we had to explicitly
state that either in the EEP, but nevertheless mentioned it in the manual
page as well.
It may be that modules like "string", that handle lists more or less
without analyzing, would let an invalid unicode character slip through,
but that is because of the "working without looking at the characters"
property of this module. The EEP could definitely state that it requires
valid unicode ranges, but as invalid characters are invalid and therefore
not Unicode, I didn't really feel I had to say that.
And, to clarify - the list representation is the *Unicode codepoints*, the
binary representation is also the *Unicode codepoints*, but encoded
according to the UTF-8 encoding scheme. Codepoints that are invalid (due
to the unfortunate UTF16 representation) are invalid *Unicode codepoints*
and therefore invalid regardless of how the codepoints are actually
Overlong sequences, on the other hand, are a matter for the UTF-8
*encoding* and has little to do with codepoints.
BOM's are adressed in the 'unicode' module. See the manual page. Having a
BOM in each Unicode-representing binary is not efficient, why a binary
string handling package does not deal with it.
On Tue, 23 Nov 2010, Kenji Rikitake wrote:
> Some thoughts on EEP35:
> * Usage of the UTF-8 (also RFC3629) in the "utf-8" encoded binaries must
> be explicitly addressed in the EEP. Just using the word "Unicode"
> does not sufficiently address the details, because in the current
> implementation of Erlang, the lists representing character strings use
> the UTF-8 *character numbers*, while the binaries use encoded UTF-8
> *octet sequences*.
> This may affect EEP10 also, because it does not specifically mention
> the usage of UTF-8 character number (max 10ffff#16 as in RFC3629) in
> the Erlang lists representing character strings.
> * Issues of overlong encoding (RFC3629 Section 3) must be explicitly
> addressed in the EEP also.
> From RFC3629 Section 3:
> "Implementations of the decoding algorithm above MUST protect against
> decoding invalid sequences. For instance, a naive implementation may
> decode the overlong UTF-8 sequence C0 80 into the character U+0000,
> or the surrogate pair ED A1 8C ED BE B4 into U+233B4. Decoding
> invalid sequences may have security consequences or cause other
> problems. See Security Considerations (Section 10) below."
> * BOM (Byte Order Mark) issues should also be addressed. I suggest
> Erlang/OTP should follow the suggested use as represented in RFC3629
> Section 6.
> Kenji Rikitake
> eeps (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:
More information about the eeps