[erlang-questions] Corrupted term on distribution channel from a C-Node
Mon Apr 27 17:21:58 CEST 2015
It might be worth running your load tests with address sanitizer
(https://code.google.com/p/address-sanitizer/) assuming the erlang
code and your code works under it. I've caught heap corruption in the
erlang odbc module using address sanitizer. Maybe your code or the
erlang code is corrupting the heap somewhere and this is introducing
non-determinism into the system. Or if you are super desperate and can
handle the slow down you could run it in production :) The problem
might not show up in your load tests because it is a specific message
that is triggering the heap corruption and this isn't part of your
load tests. Though, address sanitizer will catch some issues that
would never causes crashes (for example if you have an off by 1 error
but it always overwrites slack space) so it can be dangerous running
it in production because it can turn programs that will always run
correctly into programs that crash.
On Mon, Apr 27, 2015 at 10:45 AM, Peter Membrey <peter@REDACTED> wrote:
> Hi guys,
> Bit of a weird mystery to share with you this time. We're using this particular C-Node in production and it handles many million messages a day with no problem. We've been running it without much in the way of modification for over a year and a half. There haven't been any recent changes made.
> A few weeks back, we had an application crash with a corrupted term coming from the C-Node:
> Last week we had another crash with a similar corrupted term:
> These two apps were running on different boxes, talking to different Erlang VMs, and everything was done over local loopback i.e. same app but everything else in the environment was independent.
> Having gone through the message with the EDP and external term man pages, it looks like we're getting a corrupt SEND command. In both cases the cookie is not being encoded correctly (the cookie is definitely not empty), and in the second instance, we have a 0 where we'd expect a 103. The messages following both corrupted SEND messages were decodable with binary_to_term/1 and the payload looked good.
> This looks similar to a problem that this guy had:
> In our set up however:
> 1. All calls to ei_send are protected by the same mutex
> 2. Only one thread is sending this data (one thread sending data, the other listening for data from the Erlang VM)
> 3. Thread local variables for the ei_x_buff are being used
> 4. The ei_x_buff is not being reset after each send, but the index is set back to zero so encoding starts at the beginning
> Again, this app generally hums along beautifully and we've got no idea where this issue lies. I'm running the C-Node under load at the moment (roughly 15,000 messages per second) and it has done 161,200,000 messages so far this afternoon without any trouble.
> Any advice and guidance on where to look would be most gratefully appreciated!
> Kind Regards,
> Peter Membrey
> erlang-questions mailing list
More information about the erlang-questions