[eeps] EEP 35 "Binary string modules"

Wed Nov 24 01:43:33 CET 2010

On 24/11/2010, at 1:37 AM, Kenji Rikitake wrote:

>  From RFC3629 Section 3:
> 
>  "Implementations of the decoding algorithm above MUST protect against
>   decoding invalid sequences.  For instance, a naive implementation may
>   decode the overlong UTF-8 sequence C0 80 into the character U+0000,
>   or the surrogate pair ED A1 8C ED BE B4 into U+233B4.

It's not clear what "MUST protect against" means.

>  Decoding
>   invalid sequences may have security consequences or cause other
>   problems.  See Security Considerations (Section 10) below."

On one reading, there is no problem with decoding overlong sequence
as long as that does not "have security consequences or cause other
problems".  I suspect they're talking about buffer overflows here,
which doesn't apply in this context.  I also suspect that they may
be talking about systems where one does

	look for bad magic in a byte sequence
	decode it using UTF-8
	trust that bad magic is not there

where the test may have been done early on the grounds that the
bad magic uses ASCII characters that should code as themselves.

In the spirit of IGOR (Input Generous Output Restricted)
we should accept overlong sequences unless there is some problem
about decoding them as such.  There are plenty of other ways to
hide bad magic, and the principle is always to check for bad
magic *after* decoding.