[erlang-questions] Corrupted term on distribution channel from a C-Node

Tue Apr 28 02:34:52 CEST 2015

Hi guys,

Thanks for all the feedback! :)

The modifications I made were at a higher level than ei_writev, so hopefully I haven't inadvertently caused this problem - though of course it's quite possible that I have :)

The changes really only provided a way to tell the receive family of functions not to reply to a heart beat directly. Then when you get an ERL_TICK, you need to handle it yourself, which is easy enough to wrap in a mutex. That way you can still use blocking reads and don't have to worry about timeouts before whole messages could be read (and all sorts of funky other workarounds that were tried). The PR is here with the doco:

https://github.com/erlang/otp/pulls/pmembrey

I've been meaning to complete the test cases so this could be merged upstream...

The send_tock() function sends a heartbeat reply which is {0x00,0x00,0x00,0x00} - I wonder if that is what is being interleaved by mistake?

Still, if you look from line 93:

https://github.com/pmembrey/otp/blob/maint/lib/erl_interface/src/connect/send.c#L93

    v[0].iov_base = (char *)header;
    v[0].iov_len = index;
    v[1].iov_base = (char *)msg;
    v[1].iov_len = msglen;

This gets sent to ei_send() - but from what I can tell, these are effectively separate write operations. Is it possible for the first vector to fail to write completely, and then for the second write to start and complete?

Thanks again!

Kind Regards,

Peter Membrey

----- Original Message -----
From: "Ben Murphy" <benmmurphy@REDACTED>
To: "Peter Membrey" <peter@REDACTED>
Cc: "Jesper Louis Andersen" <jesper.louis.andersen@REDACTED>, "Erlang (E-mail)" <erlang-questions@REDACTED>
Sent: Tuesday, 28 April, 2015 00:54:34
Subject: Re: [erlang-questions] Corrupted term on distribution channel from a C-Node

Hi,

There still might be a threading issue with writes to that fd. I would
change the erlang code so that ei_write_t/ei_writev_t asserts that the
lock is held when it is called. Maybe there is another place that is
calling ei_write_t/ei_writev_t without the lock that you have not
caught. Or alternatively move the lock into erlang and the lock
presumably should be taken at ei_writev_fill_t and ei_write_fill_t.
Because you are locking outside at a higher layer it is hard to know
for sure if you are locking correctly at all the places.

Also in connect this code looks suspect. Other stuff is calling
get_ei_socket_info and then reading stuff from the structure. But it
is possible that the memory pointed to by the pointer has been freed
and is being used by some other allocation. stuff that reads from this
structure needs to take out the mutex over the whole code path then
copy it out before unlocking. It could be that this is the problem
because it looks like one of your symptoms :)

static ei_socket_info* get_ei_socket_info(int fd)
{
    int i;
#ifdef _REENTRANT
    ei_mutex_lock(ei_sockets_lock, 0);
#endif /* _REENTRANT */
    for (i = 0; i < ei_n_sockets; ++i)
        if (ei_sockets[i].socket == fd) {
            /*fprintf("get_ei_socket_info %d  %d \"%s\"\n",
                    fd, ei_sockets[i].dist_version, ei_sockets[i].cookie);*/
#ifdef _REENTRANT
            ei_mutex_unlock(ei_sockets_lock);
#endif /* _REENTRANT */
            return &ei_sockets[i];
        }
#ifdef _REENTRANT
    ei_mutex_unlock(ei_sockets_lock);
#endif /* _REENTRANT */
    return NULL;
}

On Mon, Apr 27, 2015 at 2:53 PM, Peter Membrey <peter@REDACTED> wrote:
> Hi,
> It's non deterministic for sure. We run both a primary and secondary app reading in the same data (multicast). One had this issue and the other did not. Although the hardware is slightly different, the versions and configurations are basically identical.
> I believe we are running R16B03. I'll be able to double check tomorrow (can't believe I forgot to include that in the first email).
> It's actually a slightly patched version to handle the issue of not being able to properly lock a tcp connection as the receive function would block but handle the heart beat response internally. That version can be found here:
> https://github.com/pmembrey/otp
> I'm sending from my phone so apologies for any typos or mistakes.
> Thanks again for your help!
> Kind regards,
> Peter Membrey
> On 27 Apr 2015 21:38, Jesper Louis Andersen <jesper.louis.andersen@REDACTED> wrote:
>
>
> On Mon, Apr 27, 2015 at 11:45 AM, Peter Membrey <peter@REDACTED> wrote:
>
>> Having gone through the message with the EDP and external term man pages,
>> it looks like we're getting a corrupt SEND command. In both cases the
>> cookie is not being encoded correctly (the cookie is definitely not empty),
>> and in the second instance, we have a 0 where we'd expect a 103. The
>> messages following both corrupted SEND messages were decodable with
>> binary_to_term/1 and the payload looked good.
>>
>
> Is this nondeterministic? I wouldn't entirely rule out the possibility of
> the hardware messing up, or some error in the code elsewhere manipulating
> the wrong data. Not that I can pinpoint *that* is what is happening, but do
> not rule it out from the start that this could be a hardware thing. What
> version of Erlang is this? A recent one, or an earlier one?
>
>
> --
> J.
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>