[erlang-questions] Corrupted term on distribution channel from a C-Node

Mon Apr 27 18:54:34 CEST 2015

Hi,

There still might be a threading issue with writes to that fd. I would
change the erlang code so that ei_write_t/ei_writev_t asserts that the
lock is held when it is called. Maybe there is another place that is
calling ei_write_t/ei_writev_t without the lock that you have not
caught. Or alternatively move the lock into erlang and the lock
presumably should be taken at ei_writev_fill_t and ei_write_fill_t.
Because you are locking outside at a higher layer it is hard to know
for sure if you are locking correctly at all the places.

Also in connect this code looks suspect. Other stuff is calling
get_ei_socket_info and then reading stuff from the structure. But it
is possible that the memory pointed to by the pointer has been freed
and is being used by some other allocation. stuff that reads from this
structure needs to take out the mutex over the whole code path then
copy it out before unlocking. It could be that this is the problem
because it looks like one of your symptoms :)

static ei_socket_info* get_ei_socket_info(int fd)
{
    int i;
#ifdef _REENTRANT
    ei_mutex_lock(ei_sockets_lock, 0);
#endif /* _REENTRANT */
    for (i = 0; i < ei_n_sockets; ++i)
        if (ei_sockets[i].socket == fd) {
            /*fprintf("get_ei_socket_info %d  %d \"%s\"\n",
                    fd, ei_sockets[i].dist_version, ei_sockets[i].cookie);*/
#ifdef _REENTRANT
            ei_mutex_unlock(ei_sockets_lock);
#endif /* _REENTRANT */
            return &ei_sockets[i];
        }
#ifdef _REENTRANT
    ei_mutex_unlock(ei_sockets_lock);
#endif /* _REENTRANT */
    return NULL;
}

On Mon, Apr 27, 2015 at 2:53 PM, Peter Membrey <peter@REDACTED> wrote:
> Hi,
> It's non deterministic for sure. We run both a primary and secondary app reading in the same data (multicast). One had this issue and the other did not. Although the hardware is slightly different, the versions and configurations are basically identical.
> I believe we are running R16B03. I'll be able to double check tomorrow (can't believe I forgot to include that in the first email).
> It's actually a slightly patched version to handle the issue of not being able to properly lock a tcp connection as the receive function would block but handle the heart beat response internally. That version can be found here:
> https://github.com/pmembrey/otp
> I'm sending from my phone so apologies for any typos or mistakes.
> Thanks again for your help!
> Kind regards,
> Peter Membrey
> On 27 Apr 2015 21:38, Jesper Louis Andersen <jesper.louis.andersen@REDACTED> wrote:
>
>
> On Mon, Apr 27, 2015 at 11:45 AM, Peter Membrey <peter@REDACTED> wrote:
>
>> Having gone through the message with the EDP and external term man pages,
>> it looks like we're getting a corrupt SEND command. In both cases the
>> cookie is not being encoded correctly (the cookie is definitely not empty),
>> and in the second instance, we have a 0 where we'd expect a 103. The
>> messages following both corrupted SEND messages were decodable with
>> binary_to_term/1 and the payload looked good.
>>
>
> Is this nondeterministic? I wouldn't entirely rule out the possibility of
> the hardware messing up, or some error in the code elsewhere manipulating
> the wrong data. Not that I can pinpoint *that* is what is happening, but do
> not rule it out from the start that this could be a hardware thing. What
> version of Erlang is this? A recent one, or an earlier one?
>
>
> --
> J.
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>