[erlang-bugs] VM locks up on write to socket (and now it seems to file too)

Erik Søe Sørensen ess@REDACTED
Fri Nov 23 15:22:18 CET 2012


The strace is an interesting clue.
If I am reading this right:

	writev(15, [{"", 2158022464}], 1)       = 0

(the manual page for writev appears to support my reading), then this is 
a request to write a data chunk a tad larger than 2GB.
(2158022464 = 0x80A0CF40).

Is this intended? Does it sound right that a blob of this size should be 
sent?

I imagine that it might give rise to problems if the length in some of 
the involved layers were interpreted as a 32-bit *signed* integer.
And as far as I can tell, it is normal for ssize_t to be signed.
Your system appears to be 64bit, though, judging from the "/lib64" path. 
Still, some lower layers might have problems.

Going forward,
a) is a write of this size expected?
b) How does a plain C program behave if you call writev in that fashion?
Answers to those questions could help isolating the problem.

/Erik

On 23-11-2012 01:49, Loïc Hoguin wrote:
> Sending this on behalf of someone who didn't manage to get the email
> sent to this list after 2 attempts. If someone can check if he's hold up
> or something that'd be great.
>
> Anyway he has a big issue so I hope I can relay the conversation reliably.
>
> Thanks!
>
> On 11/23/2012 01:45 AM, Peter Membrey wrote:
>> From: Peter Membrey <peter@REDACTED>
>> Date: 22 November 2012 19:02
>> Subject: VM locks up on write to socket (and now it seems to file too)
>> To: erlang-bugs@REDACTED
>>
>>
>> Hi guys,
>>
>> I wrote a simple database application called CakeDB
>> (https://github.com/pmembrey/cakedb) that basically spends its time
>> reading and writing files and sockets. There's very little in the way
>> of complex logic. It is running on CentOS 6.3 with all the updates
>> applied. I hit this problem on R15B02 so I rolled back to R15B01 but
>> the issue remained. Erlang was built from source.
>>
>> The machine has two Intel X5690 CPUs giving 12 cores plus HT. I've
>> tried various arguments for the VM but so far nothing has prevented
>> the problem. At the moment I'm using:
>>
>> +K
>> +A 6
>> +sbt tnnps
>>
>> The issue I'm seeing is that one of the scheduler threads will hit
>> 100% cpu usage and the entire VM will become unresponsive. When this
>> happens, I am not able to connect via the console with attach and
>> entop is also unable to connect. I can still establish TCP connections
>> to the application, but I never receive a response. A standard kill
>> signal will cause the VM to shut down (it doesn't need -9).
>>
>> Due to the pedigree of the VM I am quite willing to accept that I've
>> made a fundamental mistake in my code. I am pretty sure that the way I
>> am doing the file IO could result in some race conditions. However, my
>> poor code aside, from what I understand, I still shouldn't be able to
>> crash / deadlock the VM like this.
>>
>> The issue doesn't seem to be caused by load. The app can fail when
>> it's very busy, but also when it is practically idle. I haven't been
>> able to find a trigger or any other explanation for the failure.
>>
>> The thread maxing out the CPU is attempting to write data to the socket:
>>
>> (gdb) bt
>> #0  0x00007f9882ab6377 in writev () from /lib64/libc.so.6
>> #1  0x000000000058a81f in tcp_inet_output (data=0x2407570,
>> event=<value optimized out>) at drivers/common/inet_drv.c:9681
>> #2  tcp_inet_drv_output (data=0x2407570, event=<value optimized out>)
>> at drivers/common/inet_drv.c:9601
>> #3  0x00000000004b773f in erts_port_task_execute (runq=0x7f98826019c0,
>> curr_port_pp=0x7f9881639338)  at beam/erl_port_task.c:858
>> #4  0x00000000004afd83 in schedule (p=<value optimized out>,
>> calls=<value optimized out>) at beam/erl_process.c:6533
>> #5  0x0000000000539ca2 in process_main () at beam/beam_emu.c:1268
>> #6  0x00000000004b1279 in sched_thread_func (vesdp=0x7f9881639280) at
>> beam/erl_process.c:4834
>> #7  0x00000000005ba726 in thr_wrapper (vtwd=0x7fff6cfe2300) at
>> pthread/ethread.c:106
>> #8  0x00007f9882f78851 in start_thread () from /lib64/libpthread.so.0
>> #9  0x00007f9882abe11d in clone () from /lib64/libc.so.6
>> (gdb)
>>
>> I then tried running strace on that thread and got (indefinitely):
>>
>> writev(15, [{"", 2158022464}], 1)       = 0
>> writev(15, [{"", 2158022464}], 1)       = 0
>> writev(15, [{"", 2158022464}], 1)       = 0
>> writev(15, [{"", 2158022464}], 1)       = 0
>> writev(15, [{"", 2158022464}], 1)       = 0
>> writev(15, [{"", 2158022464}], 1)       = 0
>> writev(15, [{"", 2158022464}], 1)       = 0
>> writev(15, [{"", 2158022464}], 1)       = 0
>> writev(15, [{"", 2158022464}], 1)       = 0
>> writev(15, [{"", 2158022464}], 1)       = 0
>> ...
>>
>>   From what I can tell, it's trying to write data to a socket, which is
>> succeeding, but writing 0 bytes. From the earlier definitions in the
>> source file, an error condition would be signified by a negative
>> number. Any other result is the number of bytes written, in this case
>> 0. I'm not sure if this is desired behaviour or not. I've tried
>> killing the application on the other end of the socket, but it has no
>> effect on the VM.
>>
>> I have enabled debugging for the inet code, so hopefully this will
>> give a little more insight. I am currently trying to reproduce the
>> condition, but as I really have no idea what causes it, it's pretty
>> much a case of wait and see.
>>
>>
>> **** UPDATE ****
>>
>> I managed to lock up the VM again, but this time it was caused by file IO,
>> probably from the debugging statements. Although it worked fine for some time
>> the last entry in the file was cut off.
>>
>>   From GDB:
>>
>> (gdb) info threads
>>     53 Thread 0x7f83e988b700 (LWP 8621)  0x00007f83ea6da54d in read ()
>> from /lib64/libpthread.so.0
>>     52 Thread 0x7f83e8c8f700 (LWP 8622)  0x00007f83ea6d743c in
>> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>>     51 Thread 0x7f83e818d700 (LWP 8623)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     50 Thread 0x7f83e816b700 (LWP 8624)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     49 Thread 0x7f83e8149700 (LWP 8625)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     48 Thread 0x7f83e8127700 (LWP 8626)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     47 Thread 0x7f83e8105700 (LWP 8627)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     46 Thread 0x7f83e80e3700 (LWP 8628)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     45 Thread 0x7f83e80c1700 (LWP 8629)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     44 Thread 0x7f83e809f700 (LWP 8630)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     43 Thread 0x7f83e807d700 (LWP 8631)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     42 Thread 0x7f83e805b700 (LWP 8632)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     41 Thread 0x7f83e8039700 (LWP 8633)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     40 Thread 0x7f83e8017700 (LWP 8634)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     39 Thread 0x7f83e7ff5700 (LWP 8635)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     38 Thread 0x7f83e7fd3700 (LWP 8636)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     37 Thread 0x7f83e7fb1700 (LWP 8637)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     36 Thread 0x7f83e7f8f700 (LWP 8638)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     35 Thread 0x7f83e7f6d700 (LWP 8639)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     34 Thread 0x7f83e7f4b700 (LWP 8640)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     33 Thread 0x7f83e7f29700 (LWP 8641)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     32 Thread 0x7f83e7f07700 (LWP 8642)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     31 Thread 0x7f83e7ee5700 (LWP 8643)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     30 Thread 0x7f83e7ec3700 (LWP 8644)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     29 Thread 0x7f83e7ea1700 (LWP 8645)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     28 Thread 0x7f83e7e7f700 (LWP 8646)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     27 Thread 0x7f83d7c5a700 (LWP 8647)  0x00007f83ea6db09d in waitpid
>> () from /lib64/libpthread.so.0
>>     26 Thread 0x7f83d7c53700 (LWP 8648)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     25 Thread 0x7f83d7252700 (LWP 8649)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     24 Thread 0x7f83d6851700 (LWP 8650)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     23 Thread 0x7f83d5e50700 (LWP 8651)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     22 Thread 0x7f83d544f700 (LWP 8652)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     21 Thread 0x7f83d4a4e700 (LWP 8653)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     20 Thread 0x7f83d404d700 (LWP 8654)  0x00007f83ea20be7d in write ()
>> from /lib64/libc.so.6
>>     19 Thread 0x7f83d364c700 (LWP 8655)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     18 Thread 0x7f83d2c4b700 (LWP 8656)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     17 Thread 0x7f83d224a700 (LWP 8657)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     16 Thread 0x7f83d1849700 (LWP 8658)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     15 Thread 0x7f83d0e48700 (LWP 8659)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     14 Thread 0x7f83d0447700 (LWP 8660)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     13 Thread 0x7f83cfa46700 (LWP 8661)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     12 Thread 0x7f83cf045700 (LWP 8662)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     11 Thread 0x7f83ce644700 (LWP 8663)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     10 Thread 0x7f83cdc43700 (LWP 8664)  0x00007f83ea215ae9 in syscall
>> () from /lib64/libc.so.6
>>     9 Thread 0x7f83cd242700 (LWP 8665)  0x00007f83ea215ae9 in syscall ()
>> from /lib64/libc.so.6
>>     8 Thread 0x7f83cc841700 (LWP 8666)  0x00007f83ea215ae9 in syscall ()
>> from /lib64/libc.so.6
>>     7 Thread 0x7f83cbe40700 (LWP 8667)  0x00007f83ea215ae9 in syscall ()
>> from /lib64/libc.so.6
>>     6 Thread 0x7f83cb43f700 (LWP 8668)  0x00007f83ea215ae9 in syscall ()
>> from /lib64/libc.so.6
>>     5 Thread 0x7f83caa3e700 (LWP 8669)  0x00007f83ea215ae9 in syscall ()
>> from /lib64/libc.so.6
>>     4 Thread 0x7f83ca03d700 (LWP 8670)  0x00007f83ea215ae9 in syscall ()
>> from /lib64/libc.so.6
>>     3 Thread 0x7f83c963c700 (LWP 8671)  0x00007f83ea215ae9 in syscall ()
>> from /lib64/libc.so.6
>>     2 Thread 0x7f83c8c3b700 (LWP 8672)  0x00007f83ea215ae9 in syscall ()
>> from /lib64/libc.so.6
>> * 1 Thread 0x7f83eb3a8700 (LWP 8597)  0x00007f83ea211d03 in select ()
>> from /lib64/libc.so.6
>> (gdb)
>>
>>
>> (gdb) bt
>> #0  0x00007f83ea20be7d in write () from /lib64/libc.so.6
>> #1  0x00007f83ea1a2583 in _IO_new_file_write () from /lib64/libc.so.6
>> #2  0x00007f83ea1a3b35 in _IO_new_do_write () from /lib64/libc.so.6
>> #3  0x00007f83ea1a21fd in _IO_new_file_xsputn () from /lib64/libc.so.6
>> #4  0x00007f83ea17589d in vfprintf () from /lib64/libc.so.6
>> #5  0x00007f83ea18003a in printf () from /lib64/libc.so.6
>> #6  0x000000000058f0e8 in tcp_recv (desc=0x2c3d350, request_len=0) at
>> drivers/common/inet_drv.c:8976
>> #7  0x000000000058f63a in tcp_inet_input (data=0x2c3d350, event=<value
>> optimized out>) at drivers/common/inet_drv.c:9326
>> #8  tcp_inet_drv_input (data=0x2c3d350, event=<value optimized out>)
>> at drivers/common/inet_drv.c:9604
>> #9  0x00000000004b770f in erts_port_task_execute (runq=0x7f83e9d5d3c0,
>> curr_port_pp=0x7f83e8dc6e78) at beam/erl_port_task.c:851
>> #10 0x00000000004afd83 in schedule (p=<value optimized out>,
>> calls=<value optimized out>) at beam/erl_process.c:6533
>> #11 0x0000000000539ca2 in process_main () at beam/beam_emu.c:1268
>> #12 0x00000000004b1279 in sched_thread_func (vesdp=0x7f83e8dc6dc0) at
>> beam/erl_process.c:4834
>> #13 0x00000000005bb3e6 in thr_wrapper (vtwd=0x7fffe8266da0) at
>> pthread/ethread.c:106
>> #14 0x00007f83ea6d3851 in start_thread () from /lib64/libpthread.so.0
>> #15 0x00007f83ea21911d in clone () from /lib64/libc.so.6
>> (gdb)
>>
>> (gdb) bt
>> #0  0x00007f83ea6da54d in read () from /lib64/libpthread.so.0
>> #1  0x0000000000554b6e in signal_dispatcher_thread_func (unused=<value
>> optimized out>) at sys/unix/sys.c:2776
>> #2  0x00000000005bb3e6 in thr_wrapper (vtwd=0x7fffe8266c80) at
>> pthread/ethread.c:106
>> #3  0x00007f83ea6d3851 in start_thread () from /lib64/libpthread.so.0
>> #4  0x00007f83ea21911d in clone () from /lib64/libc.so.6
>> (gdb)
>>
>> (gdb) bt
>> #0  0x00007f83ea215ae9 in syscall () from /lib64/libc.so.6
>> #1  0x00000000005bba35 in wait__ (e=0x2989390) at
>> pthread/ethr_event.c:92
>> #2  ethr_event_wait (e=0x2989390) at pthread/ethr_event.c:218
>> #3  0x00000000004ae5bd in erts_tse_wait (fcalls=<value optimized out>,
>> esdp=0x7f83e8e2c440, rq=0x7f83e9d5e7c0) at beam/erl_threads.h:2319
>> #4  scheduler_wait (fcalls=<value optimized out>, esdp=0x7f83e8e2c440,
>> rq=0x7f83e9d5e7c0) at beam/erl_process.c:2087
>> #5  0x00000000004afb94 in schedule (p=<value optimized out>,
>> calls=<value optimized out>) at beam/erl_process.c:6467
>> #6  0x0000000000539ca2 in process_main () at beam/beam_emu.c:1268
>> #7  0x00000000004b1279 in sched_thread_func (vesdp=0x7f83e8e2c440) at
>> beam/erl_process.c:4834
>> #8  0x00000000005bb3e6 in thr_wrapper (vtwd=0x7fffe8266da0) at
>> pthread/ethread.c:106
>> #9  0x00007f83ea6d3851 in start_thread () from /lib64/libpthread.so.0
>> #10 0x00007f83ea21911d in clone () from /lib64/libc.so.6
>> (gdb)
>>
>>
>> (gdb) bt
>> #0  0x00007f83ea6db09d in waitpid () from /lib64/libpthread.so.0
>> #1  0x0000000000555a9f in child_waiter (unused=<value optimized out>)
>> at sys/unix/sys.c:2700
>> #2  0x00000000005bb3e6 in thr_wrapper (vtwd=0x7fffe8266d50) at
>> pthread/ethread.c:106
>> #3  0x00007f83ea6d3851 in start_thread () from /lib64/libpthread.so.0
>> #4  0x00007f83ea21911d in clone () from /lib64/libc.so.6
>> (gdb)
>>
>>
>> **** END UPDATE ****
>>
>>
>> I'm happy to provide any information I can, so please don't hesitate to ask.
>>
>> Thanks in advance!
>>
>> Kind Regards,
>>
>> Peter Membrey
>>
>
> --
> Loïc Hoguin
> Erlang Cowboy
> Nine Nines
> http://ninenines.eu
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs


-- 
Mobile: + 45 26 36 17 55 | Skype: eriksoesorensen | Twitter: @eriksoe
Trifork A/S  |  Margrethepladsen 4  |  DK-8000 Aarhus C | 
www.trifork.com <http://www.trifork.com/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20121123/e320bdae/attachment.htm>


More information about the erlang-bugs mailing list