[erlang-bugs] VM locks up on write to socket (and now it seems to file too)

Patrik Nyblom pan@REDACTED
Fri Nov 23 10:21:42 CET 2012


Hi!

Try to trace the erlang code to see what triggers this. Some sequence of 
operations or some special data sent on the socket?

BTW you have to be registered (with the correct mail address, the one 
you send from) to post to this list, that's usually the problem when 
you're unable to send to the list.

Cheers,
Patrik
On 11/23/2012 01:49 AM, Loïc Hoguin wrote:
> Sending this on behalf of someone who didn't manage to get the email 
> sent to this list after 2 attempts. If someone can check if he's hold 
> up or something that'd be great.
>
> Anyway he has a big issue so I hope I can relay the conversation 
> reliably.
>
> Thanks!
>
> On 11/23/2012 01:45 AM, Peter Membrey wrote:
>> From: Peter Membrey <peter@REDACTED>
>> Date: 22 November 2012 19:02
>> Subject: VM locks up on write to socket (and now it seems to file too)
>> To: erlang-bugs@REDACTED
>>
>>
>> Hi guys,
>>
>> I wrote a simple database application called CakeDB
>> (https://github.com/pmembrey/cakedb) that basically spends its time
>> reading and writing files and sockets. There's very little in the way
>> of complex logic. It is running on CentOS 6.3 with all the updates
>> applied. I hit this problem on R15B02 so I rolled back to R15B01 but
>> the issue remained. Erlang was built from source.
>>
>> The machine has two Intel X5690 CPUs giving 12 cores plus HT. I've
>> tried various arguments for the VM but so far nothing has prevented
>> the problem. At the moment I'm using:
>>
>> +K
>> +A 6
>> +sbt tnnps
>>
>> The issue I'm seeing is that one of the scheduler threads will hit
>> 100% cpu usage and the entire VM will become unresponsive. When this
>> happens, I am not able to connect via the console with attach and
>> entop is also unable to connect. I can still establish TCP connections
>> to the application, but I never receive a response. A standard kill
>> signal will cause the VM to shut down (it doesn't need -9).
>>
>> Due to the pedigree of the VM I am quite willing to accept that I've
>> made a fundamental mistake in my code. I am pretty sure that the way I
>> am doing the file IO could result in some race conditions. However, my
>> poor code aside, from what I understand, I still shouldn't be able to
>> crash / deadlock the VM like this.
>>
>> The issue doesn't seem to be caused by load. The app can fail when
>> it's very busy, but also when it is practically idle. I haven't been
>> able to find a trigger or any other explanation for the failure.
>>
>> The thread maxing out the CPU is attempting to write data to the socket:
>>
>> (gdb) bt
>> #0  0x00007f9882ab6377 in writev () from /lib64/libc.so.6
>> #1  0x000000000058a81f in tcp_inet_output (data=0x2407570,
>> event=<value optimized out>) at drivers/common/inet_drv.c:9681
>> #2  tcp_inet_drv_output (data=0x2407570, event=<value optimized out>)
>> at drivers/common/inet_drv.c:9601
>> #3  0x00000000004b773f in erts_port_task_execute (runq=0x7f98826019c0,
>> curr_port_pp=0x7f9881639338)  at beam/erl_port_task.c:858
>> #4  0x00000000004afd83 in schedule (p=<value optimized out>,
>> calls=<value optimized out>) at beam/erl_process.c:6533
>> #5  0x0000000000539ca2 in process_main () at beam/beam_emu.c:1268
>> #6  0x00000000004b1279 in sched_thread_func (vesdp=0x7f9881639280) at
>> beam/erl_process.c:4834
>> #7  0x00000000005ba726 in thr_wrapper (vtwd=0x7fff6cfe2300) at
>> pthread/ethread.c:106
>> #8  0x00007f9882f78851 in start_thread () from /lib64/libpthread.so.0
>> #9  0x00007f9882abe11d in clone () from /lib64/libc.so.6
>> (gdb)
>>
>> I then tried running strace on that thread and got (indefinitely):
>>
>> writev(15, [{"", 2158022464}], 1)       = 0
>> writev(15, [{"", 2158022464}], 1)       = 0
>> writev(15, [{"", 2158022464}], 1)       = 0
>> writev(15, [{"", 2158022464}], 1)       = 0
>> writev(15, [{"", 2158022464}], 1)       = 0
>> writev(15, [{"", 2158022464}], 1)       = 0
>> writev(15, [{"", 2158022464}], 1)       = 0
>> writev(15, [{"", 2158022464}], 1)       = 0
>> writev(15, [{"", 2158022464}], 1)       = 0
>> writev(15, [{"", 2158022464}], 1)       = 0
>> ...
>>
>>  From what I can tell, it's trying to write data to a socket, which is
>> succeeding, but writing 0 bytes. From the earlier definitions in the
>> source file, an error condition would be signified by a negative
>> number. Any other result is the number of bytes written, in this case
>> 0. I'm not sure if this is desired behaviour or not. I've tried
>> killing the application on the other end of the socket, but it has no
>> effect on the VM.
>>
>> I have enabled debugging for the inet code, so hopefully this will
>> give a little more insight. I am currently trying to reproduce the
>> condition, but as I really have no idea what causes it, it's pretty
>> much a case of wait and see.
>>
>>
>> **** UPDATE ****
>>
>> I managed to lock up the VM again, but this time it was caused by 
>> file IO,
>> probably from the debugging statements. Although it worked fine for 
>> some time
>> the last entry in the file was cut off.
>>
>>  From GDB:
>>
>> (gdb) info threads
>>    53 Thread 0x7f83e988b700 (LWP 8621)  0x00007f83ea6da54d in read ()
>> from /lib64/libpthread.so.0
>>    52 Thread 0x7f83e8c8f700 (LWP 8622)  0x00007f83ea6d743c in
>> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>>    51 Thread 0x7f83e818d700 (LWP 8623)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    50 Thread 0x7f83e816b700 (LWP 8624)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    49 Thread 0x7f83e8149700 (LWP 8625)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    48 Thread 0x7f83e8127700 (LWP 8626)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    47 Thread 0x7f83e8105700 (LWP 8627)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    46 Thread 0x7f83e80e3700 (LWP 8628)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    45 Thread 0x7f83e80c1700 (LWP 8629)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    44 Thread 0x7f83e809f700 (LWP 8630)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    43 Thread 0x7f83e807d700 (LWP 8631)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    42 Thread 0x7f83e805b700 (LWP 8632)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    41 Thread 0x7f83e8039700 (LWP 8633)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    40 Thread 0x7f83e8017700 (LWP 8634)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    39 Thread 0x7f83e7ff5700 (LWP 8635)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    38 Thread 0x7f83e7fd3700 (LWP 8636)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    37 Thread 0x7f83e7fb1700 (LWP 8637)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    36 Thread 0x7f83e7f8f700 (LWP 8638)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    35 Thread 0x7f83e7f6d700 (LWP 8639)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    34 Thread 0x7f83e7f4b700 (LWP 8640)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    33 Thread 0x7f83e7f29700 (LWP 8641)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    32 Thread 0x7f83e7f07700 (LWP 8642)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    31 Thread 0x7f83e7ee5700 (LWP 8643)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    30 Thread 0x7f83e7ec3700 (LWP 8644)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    29 Thread 0x7f83e7ea1700 (LWP 8645)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    28 Thread 0x7f83e7e7f700 (LWP 8646)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    27 Thread 0x7f83d7c5a700 (LWP 8647)  0x00007f83ea6db09d in waitpid
>> () from /lib64/libpthread.so.0
>>    26 Thread 0x7f83d7c53700 (LWP 8648)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    25 Thread 0x7f83d7252700 (LWP 8649)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    24 Thread 0x7f83d6851700 (LWP 8650)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    23 Thread 0x7f83d5e50700 (LWP 8651)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    22 Thread 0x7f83d544f700 (LWP 8652)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    21 Thread 0x7f83d4a4e700 (LWP 8653)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    20 Thread 0x7f83d404d700 (LWP 8654)  0x00007f83ea20be7d in write ()
>> from /lib64/libc.so.6
>>    19 Thread 0x7f83d364c700 (LWP 8655)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    18 Thread 0x7f83d2c4b700 (LWP 8656)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    17 Thread 0x7f83d224a700 (LWP 8657)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    16 Thread 0x7f83d1849700 (LWP 8658)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    15 Thread 0x7f83d0e48700 (LWP 8659)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    14 Thread 0x7f83d0447700 (LWP 8660)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    13 Thread 0x7f83cfa46700 (LWP 8661)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    12 Thread 0x7f83cf045700 (LWP 8662)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    11 Thread 0x7f83ce644700 (LWP 8663)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    10 Thread 0x7f83cdc43700 (LWP 8664)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>    9 Thread 0x7f83cd242700 (LWP 8665)  0x00007f83ea215ae9 in syscall ()
>> from /lib64/libc.so.6
>>    8 Thread 0x7f83cc841700 (LWP 8666)  0x00007f83ea215ae9 in syscall ()
>> from /lib64/libc.so.6
>>    7 Thread 0x7f83cbe40700 (LWP 8667)  0x00007f83ea215ae9 in syscall ()
>> from /lib64/libc.so.6
>>    6 Thread 0x7f83cb43f700 (LWP 8668)  0x00007f83ea215ae9 in syscall ()
>> from /lib64/libc.so.6
>>    5 Thread 0x7f83caa3e700 (LWP 8669)  0x00007f83ea215ae9 in syscall ()
>> from /lib64/libc.so.6
>>    4 Thread 0x7f83ca03d700 (LWP 8670)  0x00007f83ea215ae9 in syscall ()
>> from /lib64/libc.so.6
>>    3 Thread 0x7f83c963c700 (LWP 8671)  0x00007f83ea215ae9 in syscall ()
>> from /lib64/libc.so.6
>>    2 Thread 0x7f83c8c3b700 (LWP 8672)  0x00007f83ea215ae9 in syscall ()
>> from /lib64/libc.so.6
>> * 1 Thread 0x7f83eb3a8700 (LWP 8597)  0x00007f83ea211d03 in select ()
>> from /lib64/libc.so.6
>> (gdb)
>>
>>
>> (gdb) bt
>> #0  0x00007f83ea20be7d in write () from /lib64/libc.so.6
>> #1  0x00007f83ea1a2583 in _IO_new_file_write () from /lib64/libc.so.6
>> #2  0x00007f83ea1a3b35 in _IO_new_do_write () from /lib64/libc.so.6
>> #3  0x00007f83ea1a21fd in _IO_new_file_xsputn () from /lib64/libc.so.6
>> #4  0x00007f83ea17589d in vfprintf () from /lib64/libc.so.6
>> #5  0x00007f83ea18003a in printf () from /lib64/libc.so.6
>> #6  0x000000000058f0e8 in tcp_recv (desc=0x2c3d350, request_len=0) at
>> drivers/common/inet_drv.c:8976
>> #7  0x000000000058f63a in tcp_inet_input (data=0x2c3d350, event=<value
>> optimized out>) at drivers/common/inet_drv.c:9326
>> #8  tcp_inet_drv_input (data=0x2c3d350, event=<value optimized out>)
>> at drivers/common/inet_drv.c:9604
>> #9  0x00000000004b770f in erts_port_task_execute (runq=0x7f83e9d5d3c0,
>> curr_port_pp=0x7f83e8dc6e78) at beam/erl_port_task.c:851
>> #10 0x00000000004afd83 in schedule (p=<value optimized out>,
>> calls=<value optimized out>) at beam/erl_process.c:6533
>> #11 0x0000000000539ca2 in process_main () at beam/beam_emu.c:1268
>> #12 0x00000000004b1279 in sched_thread_func (vesdp=0x7f83e8dc6dc0) at
>> beam/erl_process.c:4834
>> #13 0x00000000005bb3e6 in thr_wrapper (vtwd=0x7fffe8266da0) at
>> pthread/ethread.c:106
>> #14 0x00007f83ea6d3851 in start_thread () from /lib64/libpthread.so.0
>> #15 0x00007f83ea21911d in clone () from /lib64/libc.so.6
>> (gdb)
>>
>> (gdb) bt
>> #0  0x00007f83ea6da54d in read () from /lib64/libpthread.so.0
>> #1  0x0000000000554b6e in signal_dispatcher_thread_func (unused=<value
>> optimized out>) at sys/unix/sys.c:2776
>> #2  0x00000000005bb3e6 in thr_wrapper (vtwd=0x7fffe8266c80) at
>> pthread/ethread.c:106
>> #3  0x00007f83ea6d3851 in start_thread () from /lib64/libpthread.so.0
>> #4  0x00007f83ea21911d in clone () from /lib64/libc.so.6
>> (gdb)
>>
>> (gdb) bt
>> #0  0x00007f83ea215ae9 in syscall () from /lib64/libc.so.6
>> #1  0x00000000005bba35 in wait__ (e=0x2989390) at
>> pthread/ethr_event.c:92
>> #2  ethr_event_wait (e=0x2989390) at pthread/ethr_event.c:218
>> #3  0x00000000004ae5bd in erts_tse_wait (fcalls=<value optimized out>,
>> esdp=0x7f83e8e2c440, rq=0x7f83e9d5e7c0) at beam/erl_threads.h:2319
>> #4  scheduler_wait (fcalls=<value optimized out>, esdp=0x7f83e8e2c440,
>> rq=0x7f83e9d5e7c0) at beam/erl_process.c:2087
>> #5  0x00000000004afb94 in schedule (p=<value optimized out>,
>> calls=<value optimized out>) at beam/erl_process.c:6467
>> #6  0x0000000000539ca2 in process_main () at beam/beam_emu.c:1268
>> #7  0x00000000004b1279 in sched_thread_func (vesdp=0x7f83e8e2c440) at
>> beam/erl_process.c:4834
>> #8  0x00000000005bb3e6 in thr_wrapper (vtwd=0x7fffe8266da0) at
>> pthread/ethread.c:106
>> #9  0x00007f83ea6d3851 in start_thread () from /lib64/libpthread.so.0
>> #10 0x00007f83ea21911d in clone () from /lib64/libc.so.6
>> (gdb)
>>
>>
>> (gdb) bt
>> #0  0x00007f83ea6db09d in waitpid () from /lib64/libpthread.so.0
>> #1  0x0000000000555a9f in child_waiter (unused=<value optimized out>)
>> at sys/unix/sys.c:2700
>> #2  0x00000000005bb3e6 in thr_wrapper (vtwd=0x7fffe8266d50) at
>> pthread/ethread.c:106
>> #3  0x00007f83ea6d3851 in start_thread () from /lib64/libpthread.so.0
>> #4  0x00007f83ea21911d in clone () from /lib64/libc.so.6
>> (gdb)
>>
>>
>> **** END UPDATE ****
>>
>>
>> I'm happy to provide any information I can, so please don't hesitate 
>> to ask.
>>
>> Thanks in advance!
>>
>> Kind Regards,
>>
>> Peter Membrey
>>
>
>




More information about the erlang-bugs mailing list