[eeps] EEP 35 "Binary string modules"

Kenji Rikitake kenji.rikitake@REDACTED
Tue Nov 23 13:37:08 CET 2010


Some thoughts on EEP35:

* Usage of the UTF-8 (also RFC3629) in the "utf-8" encoded binaries must
  be explicitly addressed in the EEP.  Just using the word "Unicode"
  does not sufficiently address the details, because in the current
  implementation of Erlang, the lists representing character strings use
  the UTF-8 *character numbers*, while the binaries use encoded UTF-8
  *octet sequences*.

  This may affect EEP10 also, because it does not specifically mention
  the usage of UTF-8 character number (max 10ffff#16 as in RFC3629) in
  the Erlang lists representing character strings.

* Issues of overlong encoding (RFC3629 Section 3) must be explicitly
  addressed in the EEP also.

  From RFC3629 Section 3:

  "Implementations of the decoding algorithm above MUST protect against
   decoding invalid sequences.  For instance, a naive implementation may
   decode the overlong UTF-8 sequence C0 80 into the character U+0000,
   or the surrogate pair ED A1 8C ED BE B4 into U+233B4.  Decoding
   invalid sequences may have security consequences or cause other
   problems.  See Security Considerations (Section 10) below."

* BOM (Byte Order Mark) issues should also be addressed. I suggest
  Erlang/OTP should follow the suggested use as represented in RFC3629
  Section 6.

Regards,
Kenji Rikitake


More information about the eeps mailing list