[erlang-patches] Montavista system call error handling patch
Fri May 14 01:55:21 CEST 2010
On Thu, May 13, 2010 at 6:12 PM, Mikael Pettersson <> wrote:
> Steve Vinoski writes:
> > We've recently experienced a 100% repeatable case of R13B02-1 and
> > R13B04 beam core dumps on a Montavista Linux system running on a
> > Cavium Octeon processor. This is technically not a problem with the
> > Erlang VM but rather is a case where certain socket-related calls in
> > this version of Montavista Linux, based on a 2.6.21 Linux kernel, are
> > indicating failure by returning negative numbers with large absolute
> > values, such as negative 0x40000, rather than returning -1 to indicate
> > failure as they should. The Erlang TCP code expects system calls to
> > return -1 to indicate errors and compares directly against that value,
> > so it ends up treating these negative return values as successful and
> > adds them to internal pointers as bytes written, offsets, etc. This in
> > turn corrupts these pointers and causes beam to dump core.
> > This patch introduces a portability macro to test for system call
> > failure within inet_drv.c:
> > git fetch git://github.com/vinoski/otp.git socket_error_portability
> I'm not happy with this patch.
> First, you're changing (weakening) error checking on all non-Win32
> platforms, not just the apparently horribly broken MVL.
I expected this patch to be controversial, and I can certainly
understand if the Erlang/OTP team doesn't want to accept it. In that
case, I'll just use the patch in my own build. But the reason I
submitted it is that it could enhance Erlang's portability and might
save someone else from having to debug and work around the same issue.
I'm not sure I agree that this approach weakens any error checking.
Socket system calls generally return ssize_t, which is "signed
size_t." A positive return value or 0 is the number of bytes written
or read for calls that write or read, or for success/fail types of
calls, 0 indicates success. This leaves negative numbers as being
error cases. POSIX specifies -1 for errors for these cases, yet here
we have a kernel returning a negative number other than -1. Most such
code of this type I've ever written personally over the years or have
seen written by others (including books and other websites) has always
used a "< 0" comparison rather than comparing directly against -1. Go
thumb through Rich Stevens's network programming books, for example,
and you'll find the "< 0" comparison used everywhere.
So, rather than weakening the error checking, I would argue this
approach enhances the portability of Erlang by guarding against a case
actually seen in practice on a real live kernel. Certainly one can
assert the kernel is behaving badly, but does that mean Erlang
shouldn't run on it, especially given that the change to make it work
is so straightforward?
> At a minimum
> the new IS_SOCKET_ERROR() macro should be written as
> #if <whatever the test for Montavista's butchered Linux is>
> #define IS_SOCKET_ERROR(val) ((val) < 0)
> #define IS_SOCKET_ERROR(val) ((val) == SOCKET_ERROR)
But how do we know that this kernel version from this particular
vendor is the only one suffering from this bug?
> Second, the change needs a big comment. Otherwise someone who's ever
> read the POSIX or SuS specs will want to clean it up and turn it back
> to the original, not realising the reason for the out-of-spec error check.
Adding a comment is not a problem -- I've gone ahead and amended my
commit to include one.
> (Personally I'd rather wrap the broken socket calls with functions that
> convert them to sane APIs, in this case just fix up error returns, than
> to force all users to adapt to the breakage. But lots of code in erts
> could do with that sort of cleanup, so I'm not going to complain if you
> don't want to do it that way.)
If I was confident that the problems we've seen were limited to just
certain system calls, then yes, perhaps that approach would be better.
I appreciate you taking a look at the patch and commenting on it,
whether it's accepted or not, and thanks for letting me explain
further why I think the patch has merit.
More information about the erlang-patches