[erlang-questions] Corrupted term on distribution channel from a C-Node

Thu Apr 30 06:31:48 CEST 2015

Hi guys,

Might be getting a little closer :)

It looks like the cookie itself is not corrupted. I've gone over network captures and it seems none of the SEND messages actually include the cookie. In fact I wasn't able to find the cookie "in the clear" anywhere between the time the connection was made and when it was closed. However the next thing that should come after the cookie is the destination PID and this seems to be where we're seeing the problem.

I've been able to sort of replicate it in development, in that I can now get it to crash within 30 minutes or so, rather than waiting 6 months and praying. If I run it under strace or valgrind, the problem appears to go away (good old Heisenbug).

I've added code to print the node name part of the struct when it diverges from what I expect. So far this code has not been triggered, even though it has still crashed. This suggests that the PID struct is probably good. This PID struct is passed directly to ei_send.

>From the network capture I can confirm that a corrupt PID was sent on the wire. Interestingly however, there were valid messages before it and valid messages after it which suggests it's not a global state issue.

I'll continue to dig into this. Right now I suspect we're passing dodgy PIDs to ei_send, but is it possible that ei_send is doing something funky?

As always and feedback or ideas gratefully received :)

Kind Regards,

Peter Membrey

----- Original Message -----
From: "Peter Membrey" <peter@REDACTED>
To: "Ben Murphy" <benmmurphy@REDACTED>
Cc: "Erlang (E-mail)" <erlang-questions@REDACTED>
Sent: Tuesday, 28 April, 2015 08:34:52
Subject: Re: [erlang-questions] Corrupted term on distribution channel from a C-Node

Hi guys,

Thanks for all the feedback! :)

The modifications I made were at a higher level than ei_writev, so hopefully I haven't inadvertently caused this problem - though of course it's quite possible that I have :)

The changes really only provided a way to tell the receive family of functions not to reply to a heart beat directly. Then when you get an ERL_TICK, you need to handle it yourself, which is easy enough to wrap in a mutex. That way you can still use blocking reads and don't have to worry about timeouts before whole messages could be read (and all sorts of funky other workarounds that were tried). The PR is here with the doco:

https://github.com/erlang/otp/pulls/pmembrey

I've been meaning to complete the test cases so this could be merged upstream...

The send_tock() function sends a heartbeat reply which is {0x00,0x00,0x00,0x00} - I wonder if that is what is being interleaved by mistake?

Still, if you look from line 93:

https://github.com/pmembrey/otp/blob/maint/lib/erl_interface/src/connect/send.c#L93

    v[0].iov_base = (char *)header;
    v[0].iov_len = index;
    v[1].iov_base = (char *)msg;
    v[1].iov_len = msglen;

This gets sent to ei_send() - but from what I can tell, these are effectively separate write operations. Is it possible for the first vector to fail to write completely, and then for the second write to start and complete?

Thanks again!

Kind Regards,

Peter Membrey

----- Original Message -----
From: "Ben Murphy" <benmmurphy@REDACTED>
To: "Peter Membrey" <peter@REDACTED>
Cc: "Jesper Louis Andersen" <jesper.louis.andersen@REDACTED>, "Erlang (E-mail)" <erlang-questions@REDACTED>
Sent: Tuesday, 28 April, 2015 00:54:34
Subject: Re: [erlang-questions] Corrupted term on distribution channel from a C-Node

Hi,

There still might be a threading issue with writes to that fd. I would
change the erlang code so that ei_write_t/ei_writev_t asserts that the
lock is held when it is called. Maybe there is another place that is
calling ei_write_t/ei_writev_t without the lock that you have not
caught. Or alternatively move the lock into erlang and the lock
presumably should be taken at ei_writev_fill_t and ei_write_fill_t.
Because you are locking outside at a higher layer it is hard to know
for sure if you are locking correctly at all the places.

Also in connect this code looks suspect. Other stuff is calling
get_ei_socket_info and then reading stuff from the structure. But it
is possible that the memory pointed to by the pointer has been freed
and is being used by some other allocation. stuff that reads from this
structure needs to take out the mutex over the whole code path then
copy it out before unlocking. It could be that this is the problem
because it looks like one of your symptoms :)

static ei_socket_info* get_ei_socket_info(int fd)
{
    int i;
#ifdef _REENTRANT
    ei_mutex_lock(ei_sockets_lock, 0);
#endif /* _REENTRANT */
    for (i = 0; i < ei_n_sockets; ++i)
        if (ei_sockets[i].socket == fd) {
            /*fprintf("get_ei_socket_info %d  %d \"%s\"\n",
                    fd, ei_sockets[i].dist_version, ei_sockets[i].cookie);*/
#ifdef _REENTRANT
            ei_mutex_unlock(ei_sockets_lock);
#endif /* _REENTRANT */
            return &ei_sockets[i];
        }
#ifdef _REENTRANT
    ei_mutex_unlock(ei_sockets_lock);
#endif /* _REENTRANT */
    return NULL;
}

On Mon, Apr 27, 2015 at 2:53 PM, Peter Membrey <peter@REDACTED> wrote:
> Hi,
> It's non deterministic for sure. We run both a primary and secondary app reading in the same data (multicast). One had this issue and the other did not. Although the hardware is slightly different, the versions and configurations are basically identical.
> I believe we are running R16B03. I'll be able to double check tomorrow (can't believe I forgot to include that in the first email).
> It's actually a slightly patched version to handle the issue of not being able to properly lock a tcp connection as the receive function would block but handle the heart beat response internally. That version can be found here:
> https://github.com/pmembrey/otp
> I'm sending from my phone so apologies for any typos or mistakes.
> Thanks again for your help!
> Kind regards,
> Peter Membrey
> On 27 Apr 2015 21:38, Jesper Louis Andersen <jesper.louis.andersen@REDACTED> wrote:
>
>
> On Mon, Apr 27, 2015 at 11:45 AM, Peter Membrey <peter@REDACTED> wrote:
>
>> Having gone through the message with the EDP and external term man pages,
>> it looks like we're getting a corrupt SEND command. In both cases the
>> cookie is not being encoded correctly (the cookie is definitely not empty),
>> and in the second instance, we have a 0 where we'd expect a 103. The
>> messages following both corrupted SEND messages were decodable with
>> binary_to_term/1 and the payload looked good.
>>
>
> Is this nondeterministic? I wouldn't entirely rule out the possibility of
> the hardware messing up, or some error in the code elsewhere manipulating
> the wrong data. Not that I can pinpoint *that* is what is happening, but do
> not rule it out from the start that this could be a hardware thing. What
> version of Erlang is this? A recent one, or an earlier one?
>
>
> --
> J.
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
_______________________________________________
erlang-questions mailing list
erlang-questions@REDACTED
http://erlang.org/mailman/listinfo/erlang-questions