[erlang-questions] Corrupted term on distribution channel from a C-Node
Mon Apr 27 11:45:50 CEST 2015
Bit of a weird mystery to share with you this time. We're using this particular C-Node in production and it handles many million messages a day with no problem. We've been running it without much in the way of modification for over a year and a half. There haven't been any recent changes made.
A few weeks back, we had an application crash with a corrupted term coming from the C-Node:
Last week we had another crash with a similar corrupted term:
These two apps were running on different boxes, talking to different Erlang VMs, and everything was done over local loopback i.e. same app but everything else in the environment was independent.
Having gone through the message with the EDP and external term man pages, it looks like we're getting a corrupt SEND command. In both cases the cookie is not being encoded correctly (the cookie is definitely not empty), and in the second instance, we have a 0 where we'd expect a 103. The messages following both corrupted SEND messages were decodable with binary_to_term/1 and the payload looked good.
This looks similar to a problem that this guy had:
In our set up however:
1. All calls to ei_send are protected by the same mutex
2. Only one thread is sending this data (one thread sending data, the other listening for data from the Erlang VM)
3. Thread local variables for the ei_x_buff are being used
4. The ei_x_buff is not being reset after each send, but the index is set back to zero so encoding starts at the beginning
Again, this app generally hums along beautifully and we've got no idea where this issue lies. I'm running the C-Node under load at the moment (roughly 15,000 messages per second) and it has done 161,200,000 messages so far this afternoon without any trouble.
Any advice and guidance on where to look would be most gratefully appreciated!
More information about the erlang-questions