[erlang-bugs] Fwd: VM locks up on write to socket (and now it seems to file too)

Peter Membrey pmembrey@REDACTED
Mon Nov 26 12:35:52 CET 2012


Hi all,

Trying to send again under a new account...

Cheers,

Pete

---------- Forwarded message ----------
From: Peter Membrey <peter@REDACTED>
Date: 24 November 2012 21:57
Subject: Re: [erlang-bugs] VM locks up on write to socket (and now it seems
to file too)
To: Patrik Nyblom <pan@REDACTED>
Cc: erlang-bugs@REDACTED


Hi guys,

Thanks for getting back in touch so quickly!

I did do an lsof on the process and I can confirm that it was
definitely a socket. However by that time the application it had been
trying to send to had been killed. When I checked the sockets were
showing as waiting to close. Unfortunately I didn't think to do an
lsof until after the apps had been shut down. I was hoping the VM
would recover if I killed the app that had upset it. However even
after all the apps connected had been shut down, the issue didn't
resolve.

The application receives requests from a client, which contains two
data items. The stream ID and a timestamp. Both are encoded as big
integer unsigned numbers. The server then looks through the file
referenced by the stream ID and uses the timestamp as an index. The
file format is currently really simple, in the form of:

<Timestamp:64/native-integer,Length:32/native-integer,Data:Length/binary>>

There is an index file that provides an offset into the file based on
time stamp, but basically it opens the file, and reads sequentially
through it until it finds the timestamps that it cares about. In this
case it reads all data with a greater timestamp until the end of the
file is reached. It's possible the client is sending an incorrect
timestamp, and maybe too much data is being read. However the loop is
very primitive - it reads all the data in one go before passing it
back to the protocol handler to send down the socket; so by that time
even though the response is technically incorrect and the app has
failed, it should still not cause the VM any issues.

The data is polled every 10 seconds by the client app so I would not
expect there to be 2GB of new data to send. I'm afraid my C skills are
somewhat limited, so I'm not sure how to put together a sample app to
try out writev. The platform is 64bit CentOS 6.3 (equivalent to RHEL
6.3) so I'm not expecting any strange or weird behaviour from the OS
level but of course I could be completely wrong there. The OS is
running directly on hardware, so there's no VM layer to worry about.

Hope this might offer some additional clues…

Thanks again!

Kind Regards,

Peter Membrey



On 24 November 2012 00:13, Patrik Nyblom <pan@REDACTED> wrote:
> Hi again!
>
> Could you go back to the version without the printouts and get back to the
> situation where writev loops returning 0 (as in the strace)? If so, it
would
> be really interesting to see an 'lsof' of the beam process, to see if this
> file descriptor really is open and is a socket...
>
> The thing is that writev with a vector that is not empty, would never
return
> 0 for a non blocking socket. Not on any modern (i.e. not ancient) POSIX
> compliant system anyway. Of course it is a *really* large item you are
> trying to write there, but it should be no problem for a 64bit linux.
>
> Also I think there is no use finding the Erlang code, I'll take that back,
> It would be more interesting to see what really happens at the OS/VM level
> in this case.
>
> Cheers,
> Patrik
>
>
> On 11/23/2012 01:49 AM, Loïc Hoguin wrote:
>>
>> Sending this on behalf of someone who didn't manage to get the email sent
>> to this list after 2 attempts. If someone can check if he's hold up or
>> something that'd be great.
>>
>> Anyway he has a big issue so I hope I can relay the conversation
reliably.
>>
>> Thanks!
>>
>> On 11/23/2012 01:45 AM, Peter Membrey wrote:
>>>
>>> From: Peter Membrey <peter@REDACTED>
>>> Date: 22 November 2012 19:02
>>> Subject: VM locks up on write to socket (and now it seems to file too)
>>> To: erlang-bugs@REDACTED
>>>
>>>
>>> Hi guys,
>>>
>>> I wrote a simple database application called CakeDB
>>> (https://github.com/pmembrey/cakedb) that basically spends its time
>>> reading and writing files and sockets. There's very little in the way
>>> of complex logic. It is running on CentOS 6.3 with all the updates
>>> applied. I hit this problem on R15B02 so I rolled back to R15B01 but
>>> the issue remained. Erlang was built from source.
>>>
>>> The machine has two Intel X5690 CPUs giving 12 cores plus HT. I've
>>> tried various arguments for the VM but so far nothing has prevented
>>> the problem. At the moment I'm using:
>>>
>>> +K
>>> +A 6
>>> +sbt tnnps
>>>
>>> The issue I'm seeing is that one of the scheduler threads will hit
>>> 100% cpu usage and the entire VM will become unresponsive. When this
>>> happens, I am not able to connect via the console with attach and
>>> entop is also unable to connect. I can still establish TCP connections
>>> to the application, but I never receive a response. A standard kill
>>> signal will cause the VM to shut down (it doesn't need -9).
>>>
>>> Due to the pedigree of the VM I am quite willing to accept that I've
>>> made a fundamental mistake in my code. I am pretty sure that the way I
>>> am doing the file IO could result in some race conditions. However, my
>>> poor code aside, from what I understand, I still shouldn't be able to
>>> crash / deadlock the VM like this.
>>>
>>> The issue doesn't seem to be caused by load. The app can fail when
>>> it's very busy, but also when it is practically idle. I haven't been
>>> able to find a trigger or any other explanation for the failure.
>>>
>>> The thread maxing out the CPU is attempting to write data to the socket:
>>>
>>> (gdb) bt
>>> #0  0x00007f9882ab6377 in writev () from /lib64/libc.so.6
>>> #1  0x000000000058a81f in tcp_inet_output (data=0x2407570,
>>> event=<value optimized out>) at drivers/common/inet_drv.c:9681
>>> #2  tcp_inet_drv_output (data=0x2407570, event=<value optimized out>)
>>> at drivers/common/inet_drv.c:9601
>>> #3  0x00000000004b773f in erts_port_task_execute (runq=0x7f98826019c0,
>>> curr_port_pp=0x7f9881639338)  at beam/erl_port_task.c:858
>>> #4  0x00000000004afd83 in schedule (p=<value optimized out>,
>>> calls=<value optimized out>) at beam/erl_process.c:6533
>>> #5  0x0000000000539ca2 in process_main () at beam/beam_emu.c:1268
>>> #6  0x00000000004b1279 in sched_thread_func (vesdp=0x7f9881639280) at
>>> beam/erl_process.c:4834
>>> #7  0x00000000005ba726 in thr_wrapper (vtwd=0x7fff6cfe2300) at
>>> pthread/ethread.c:106
>>> #8  0x00007f9882f78851 in start_thread () from /lib64/libpthread.so.0
>>> #9  0x00007f9882abe11d in clone () from /lib64/libc.so.6
>>> (gdb)
>>>
>>> I then tried running strace on that thread and got (indefinitely):
>>>
>>> writev(15, [{"", 2158022464}], 1)       = 0
>>> writev(15, [{"", 2158022464}], 1)       = 0
>>> writev(15, [{"", 2158022464}], 1)       = 0
>>> writev(15, [{"", 2158022464}], 1)       = 0
>>> writev(15, [{"", 2158022464}], 1)       = 0
>>> writev(15, [{"", 2158022464}], 1)       = 0
>>> writev(15, [{"", 2158022464}], 1)       = 0
>>> writev(15, [{"", 2158022464}], 1)       = 0
>>> writev(15, [{"", 2158022464}], 1)       = 0
>>> writev(15, [{"", 2158022464}], 1)       = 0
>>> ...
>>>
>>>  From what I can tell, it's trying to write data to a socket, which is
>>> succeeding, but writing 0 bytes. From the earlier definitions in the
>>> source file, an error condition would be signified by a negative
>>> number. Any other result is the number of bytes written, in this case
>>> 0. I'm not sure if this is desired behaviour or not. I've tried
>>> killing the application on the other end of the socket, but it has no
>>> effect on the VM.
>>>
>>> I have enabled debugging for the inet code, so hopefully this will
>>> give a little more insight. I am currently trying to reproduce the
>>> condition, but as I really have no idea what causes it, it's pretty
>>> much a case of wait and see.
>>>
>>>
>>> **** UPDATE ****
>>>
>>> I managed to lock up the VM again, but this time it was caused by file
>>> IO,
>>> probably from the debugging statements. Although it worked fine for some
>>> time
>>> the last entry in the file was cut off.
>>>
>>>  From GDB:
>>>
>>> (gdb) info threads
>>>    53 Thread 0x7f83e988b700 (LWP 8621)  0x00007f83ea6da54d in read ()
>>> from /lib64/libpthread.so.0
>>>    52 Thread 0x7f83e8c8f700 (LWP 8622)  0x00007f83ea6d743c in
>>> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>>>    51 Thread 0x7f83e818d700 (LWP 8623)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    50 Thread 0x7f83e816b700 (LWP 8624)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    49 Thread 0x7f83e8149700 (LWP 8625)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    48 Thread 0x7f83e8127700 (LWP 8626)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    47 Thread 0x7f83e8105700 (LWP 8627)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    46 Thread 0x7f83e80e3700 (LWP 8628)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    45 Thread 0x7f83e80c1700 (LWP 8629)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    44 Thread 0x7f83e809f700 (LWP 8630)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    43 Thread 0x7f83e807d700 (LWP 8631)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    42 Thread 0x7f83e805b700 (LWP 8632)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    41 Thread 0x7f83e8039700 (LWP 8633)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    40 Thread 0x7f83e8017700 (LWP 8634)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    39 Thread 0x7f83e7ff5700 (LWP 8635)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    38 Thread 0x7f83e7fd3700 (LWP 8636)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    37 Thread 0x7f83e7fb1700 (LWP 8637)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    36 Thread 0x7f83e7f8f700 (LWP 8638)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    35 Thread 0x7f83e7f6d700 (LWP 8639)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    34 Thread 0x7f83e7f4b700 (LWP 8640)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    33 Thread 0x7f83e7f29700 (LWP 8641)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    32 Thread 0x7f83e7f07700 (LWP 8642)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    31 Thread 0x7f83e7ee5700 (LWP 8643)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    30 Thread 0x7f83e7ec3700 (LWP 8644)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    29 Thread 0x7f83e7ea1700 (LWP 8645)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    28 Thread 0x7f83e7e7f700 (LWP 8646)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    27 Thread 0x7f83d7c5a700 (LWP 8647)  0x00007f83ea6db09d in waitpid
>>> () from /lib64/libpthread.so.0
>>>    26 Thread 0x7f83d7c53700 (LWP 8648)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    25 Thread 0x7f83d7252700 (LWP 8649)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    24 Thread 0x7f83d6851700 (LWP 8650)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    23 Thread 0x7f83d5e50700 (LWP 8651)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    22 Thread 0x7f83d544f700 (LWP 8652)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    21 Thread 0x7f83d4a4e700 (LWP 8653)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    20 Thread 0x7f83d404d700 (LWP 8654)  0x00007f83ea20be7d in write ()
>>> from /lib64/libc.so.6
>>>    19 Thread 0x7f83d364c700 (LWP 8655)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    18 Thread 0x7f83d2c4b700 (LWP 8656)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    17 Thread 0x7f83d224a700 (LWP 8657)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    16 Thread 0x7f83d1849700 (LWP 8658)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    15 Thread 0x7f83d0e48700 (LWP 8659)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    14 Thread 0x7f83d0447700 (LWP 8660)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    13 Thread 0x7f83cfa46700 (LWP 8661)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    12 Thread 0x7f83cf045700 (LWP 8662)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    11 Thread 0x7f83ce644700 (LWP 8663)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    10 Thread 0x7f83cdc43700 (LWP 8664)  0x00007f83ea215ae9 in syscall
>>> () from /lib64/libc.so.6
>>>    9 Thread 0x7f83cd242700 (LWP 8665)  0x00007f83ea215ae9 in syscall ()
>>> from /lib64/libc.so.6
>>>    8 Thread 0x7f83cc841700 (LWP 8666)  0x00007f83ea215ae9 in syscall ()
>>> from /lib64/libc.so.6
>>>    7 Thread 0x7f83cbe40700 (LWP 8667)  0x00007f83ea215ae9 in syscall ()
>>> from /lib64/libc.so.6
>>>    6 Thread 0x7f83cb43f700 (LWP 8668)  0x00007f83ea215ae9 in syscall ()
>>> from /lib64/libc.so.6
>>>    5 Thread 0x7f83caa3e700 (LWP 8669)  0x00007f83ea215ae9 in syscall ()
>>> from /lib64/libc.so.6
>>>    4 Thread 0x7f83ca03d700 (LWP 8670)  0x00007f83ea215ae9 in syscall ()
>>> from /lib64/libc.so.6
>>>    3 Thread 0x7f83c963c700 (LWP 8671)  0x00007f83ea215ae9 in syscall ()
>>> from /lib64/libc.so.6
>>>    2 Thread 0x7f83c8c3b700 (LWP 8672)  0x00007f83ea215ae9 in syscall ()
>>> from /lib64/libc.so.6
>>> * 1 Thread 0x7f83eb3a8700 (LWP 8597)  0x00007f83ea211d03 in select ()
>>> from /lib64/libc.so.6
>>> (gdb)
>>>
>>>
>>> (gdb) bt
>>> #0  0x00007f83ea20be7d in write () from /lib64/libc.so.6
>>> #1  0x00007f83ea1a2583 in _IO_new_file_write () from /lib64/libc.so.6
>>> #2  0x00007f83ea1a3b35 in _IO_new_do_write () from /lib64/libc.so.6
>>> #3  0x00007f83ea1a21fd in _IO_new_file_xsputn () from /lib64/libc.so.6
>>> #4  0x00007f83ea17589d in vfprintf () from /lib64/libc.so.6
>>> #5  0x00007f83ea18003a in printf () from /lib64/libc.so.6
>>> #6  0x000000000058f0e8 in tcp_recv (desc=0x2c3d350, request_len=0) at
>>> drivers/common/inet_drv.c:8976
>>> #7  0x000000000058f63a in tcp_inet_input (data=0x2c3d350, event=<value
>>> optimized out>) at drivers/common/inet_drv.c:9326
>>> #8  tcp_inet_drv_input (data=0x2c3d350, event=<value optimized out>)
>>> at drivers/common/inet_drv.c:9604
>>> #9  0x00000000004b770f in erts_port_task_execute (runq=0x7f83e9d5d3c0,
>>> curr_port_pp=0x7f83e8dc6e78) at beam/erl_port_task.c:851
>>> #10 0x00000000004afd83 in schedule (p=<value optimized out>,
>>> calls=<value optimized out>) at beam/erl_process.c:6533
>>> #11 0x0000000000539ca2 in process_main () at beam/beam_emu.c:1268
>>> #12 0x00000000004b1279 in sched_thread_func (vesdp=0x7f83e8dc6dc0) at
>>> beam/erl_process.c:4834
>>> #13 0x00000000005bb3e6 in thr_wrapper (vtwd=0x7fffe8266da0) at
>>> pthread/ethread.c:106
>>> #14 0x00007f83ea6d3851 in start_thread () from /lib64/libpthread.so.0
>>> #15 0x00007f83ea21911d in clone () from /lib64/libc.so.6
>>> (gdb)
>>>
>>> (gdb) bt
>>> #0  0x00007f83ea6da54d in read () from /lib64/libpthread.so.0
>>> #1  0x0000000000554b6e in signal_dispatcher_thread_func (unused=<value
>>> optimized out>) at sys/unix/sys.c:2776
>>> #2  0x00000000005bb3e6 in thr_wrapper (vtwd=0x7fffe8266c80) at
>>> pthread/ethread.c:106
>>> #3  0x00007f83ea6d3851 in start_thread () from /lib64/libpthread.so.0
>>> #4  0x00007f83ea21911d in clone () from /lib64/libc.so.6
>>> (gdb)
>>>
>>> (gdb) bt
>>> #0  0x00007f83ea215ae9 in syscall () from /lib64/libc.so.6
>>> #1  0x00000000005bba35 in wait__ (e=0x2989390) at
>>> pthread/ethr_event.c:92
>>> #2  ethr_event_wait (e=0x2989390) at pthread/ethr_event.c:218
>>> #3  0x00000000004ae5bd in erts_tse_wait (fcalls=<value optimized out>,
>>> esdp=0x7f83e8e2c440, rq=0x7f83e9d5e7c0) at beam/erl_threads.h:2319
>>> #4  scheduler_wait (fcalls=<value optimized out>, esdp=0x7f83e8e2c440,
>>> rq=0x7f83e9d5e7c0) at beam/erl_process.c:2087
>>> #5  0x00000000004afb94 in schedule (p=<value optimized out>,
>>> calls=<value optimized out>) at beam/erl_process.c:6467
>>> #6  0x0000000000539ca2 in process_main () at beam/beam_emu.c:1268
>>> #7  0x00000000004b1279 in sched_thread_func (vesdp=0x7f83e8e2c440) at
>>> beam/erl_process.c:4834
>>> #8  0x00000000005bb3e6 in thr_wrapper (vtwd=0x7fffe8266da0) at
>>> pthread/ethread.c:106
>>> #9  0x00007f83ea6d3851 in start_thread () from /lib64/libpthread.so.0
>>> #10 0x00007f83ea21911d in clone () from /lib64/libc.so.6
>>> (gdb)
>>>
>>>
>>> (gdb) bt
>>> #0  0x00007f83ea6db09d in waitpid () from /lib64/libpthread.so.0
>>> #1  0x0000000000555a9f in child_waiter (unused=<value optimized out>)
>>> at sys/unix/sys.c:2700
>>> #2  0x00000000005bb3e6 in thr_wrapper (vtwd=0x7fffe8266d50) at
>>> pthread/ethread.c:106
>>> #3  0x00007f83ea6d3851 in start_thread () from /lib64/libpthread.so.0
>>> #4  0x00007f83ea21911d in clone () from /lib64/libc.so.6
>>> (gdb)
>>>
>>>
>>> **** END UPDATE ****
>>>
>>>
>>> I'm happy to provide any information I can, so please don't hesitate to
>>> ask.
>>>
>>> Thanks in advance!
>>>
>>> Kind Regards,
>>>
>>> Peter Membrey
>>>
>>
>>
>
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20121126/d8c828dd/attachment.htm>


More information about the erlang-bugs mailing list