[erlang-questions] Death of Erlang VM

Mikael Pettersson mikpe@REDACTED
Fri Sep 19 12:30:00 CEST 2008


Brian Troutwine writes:
 > Hello all,
 > 
 > I've encountered a problem, but I'm not really sure what is the
 > matter. After running my erlang application for some time I was
 > noticed that it had died, though not exited or become a zombie. I run
 > it like so:
 > 
 > $ erl +A 2 +K true -boot aule -config sys.config
 > 
 > I run a Debian stable AMD64 machine. Here's some information on my environment:
 > 
 > $ cat /proc/version
 > Linux version 2.6.24-19-xen (buildd@REDACTED) (gcc version 4.2.3 (Ubuntu
 > 4.2.3-2ubuntu7)) #1 SMP Wed Aug 20 21:08:51 UTC 2008
 > $ erl
 > Erlang (BEAM) emulator version 5.6.3 [source] [64-bit] [smp:4]
 > [async-threads:0] [hipe] [kernel-poll:false]
 > 
 > What appeared in my syslog follows. I am doing writes to disk with the
 > delayed_write flag set. Is there more information that would be
 > useful? Here's what appeared in my syslog:
 > 
 > Sep 18 17:33:05 valinor kernel: [1134774.645406] Unable to handle
 > kernel paging request at ffff88001a289008 RIP:
 > Sep 18 17:33:05 valinor kernel: [1134774.645421]  [<ffffffff80271fc6>]
 > iov_iter_advance+0x66/0x80
 > Sep 18 17:33:05 valinor kernel: [1134774.645445] PGD 57f5067 PUD
 > 57f6067 PMD 58c8067 PTE 0
 > Sep 18 17:33:05 valinor kernel: [1134774.645453] Oops: 0000 [1] SMP
 > Sep 18 17:33:05 valinor kernel: [1134774.645458] CPU 3
 > Sep 18 17:33:05 valinor kernel: [1134774.645462] Modules linked in:
 > ipv6 ext3 jbd mbcache evdev raid10 raid456 async_xor async_memcpy
 > async_tx xor raid1 raid0 multipath linear md_mo
 > d dm_mirror dm_snapshot dm_mod fuse loop 8250 serial_core
 > Sep 18 17:33:05 valinor kernel: [1134774.645492] Pid: 2249, comm:
 > beam.smp Not tainted 2.6.24-19-xen #1
 > Sep 18 17:33:05 valinor kernel: [1134774.645497] RIP:
 > e030:[<ffffffff80271fc6>]  [<ffffffff80271fc6>]
 > iov_iter_advance+0x66/0x80
 > Sep 18 17:33:05 valinor kernel: [1134774.645504] RSP:
 > e02b:ffff8800050c5b10  EFLAGS: 00010246
 > Sep 18 17:33:05 valinor kernel: [1134774.645508] RAX: 0000000000000000
 > RBX: 0000000000000b2a RCX: 0000000000000000
 > Sep 18 17:33:05 valinor kernel: [1134774.645513] RDX: 0000000000000000
 > RSI: 0000000000000b2a RDI: ffff8800050c5ba8
 > Sep 18 17:33:05 valinor kernel: [1134774.645518] RBP: 0000000000000b2a
 > R08: 0000000000000000 R09: 0000000000000000
 > Sep 18 17:33:05 valinor kernel: [1134774.645523] R10: ffff88001a289000
 > R11: 0000000000000000 R12: 0000000000c09000
 > Sep 18 17:33:05 valinor kernel: [1134774.645527] R13: 0000000000001000
 > R14: ffff880018895220 R15: 0000000000000000
 > Sep 18 17:33:05 valinor kernel: [1134774.645534] FS:
 > 00007f74138e0960(0000) GS:ffffffff805c6180(0000)
 > knlGS:0000000000000000
 > Sep 18 17:33:05 valinor kernel: [1134774.645539] CS:  e033 DS: 0000 ES: 0000
 > Sep 18 17:33:05 valinor kernel: [1134774.645542] DR0: 0000000000000000
 > DR1: 0000000000000000 DR2: 0000000000000000
 > Sep 18 17:33:05 valinor kernel: [1134774.645548] DR3: 0000000000000000
 > DR6: 00000000ffff0ff0 DR7: 0000000000000000
 > Sep 18 17:33:05 valinor kernel: [1134774.645553] Process beam.smp
 > (pid: 2249, threadinfo ffff8800050c4000, task ffff88001ebfa800)
 > Sep 18 17:33:05 valinor kernel: [1134774.645558] Stack:
 > ffffffff80273f9e ffff88001ecfef00 ffff880012938080 0000000000000000
 > Sep 18 17:33:05 valinor kernel: [1134774.645567]  ffff8800050c5db8
 > 0000000000bfc427 ffff8800050c5d38 ffff88001ecfef00
 > Sep 18 17:33:05 valinor kernel: [1134774.645575]  ffff880018895220
 > ffffffff880deac0 ffff880018895110 000000000000cbd9
 > Sep 18 17:33:05 valinor kernel: [1134774.645582] Call Trace:
 > Sep 18 17:33:05 valinor kernel: [1134774.645588]  [<ffffffff80273f9e>]
 > generic_file_buffered_write+0x1de/0x6e0
 > Sep 18 17:33:05 valinor kernel: [1134774.645601]  [<ffffffff880ba5ae>]
 > :jbd:journal_stop+0x13e/0x1d0
 > Sep 18 17:33:05 valinor kernel: [1134774.645608]  [<ffffffff802746ef>]
 > __generic_file_aio_write_nolock+0x24f/0x400
 > Sep 18 17:33:05 valinor kernel: [1134774.645614]  [<ffffffff80289fb4>]
 > find_extend_vma+0x24/0x80
 > Sep 18 17:33:05 valinor kernel: [1134774.645622]  [<ffffffff802544e4>]
 > unqueue_me+0x54/0xa0
 > Sep 18 17:33:05 valinor kernel: [1134774.645628]  [<ffffffff80274904>]
 > generic_file_aio_write+0x64/0xd0
 > Sep 18 17:33:05 valinor kernel: [1134774.645642]  [<ffffffff880cb663>]
 > :ext3:ext3_file_write+0x23/0xc0
 > Sep 18 17:33:05 valinor kernel: [1134774.645650]  [<ffffffff880cb640>]
 > :ext3:ext3_file_write+0x0/0xc0
 > Sep 18 17:33:05 valinor kernel: [1134774.645656]  [<ffffffff8029d6bb>]
 > do_sync_readv_writev+0xcb/0x110
 > Sep 18 17:33:05 valinor kernel: [1134774.645663]  [<ffffffff80254c5d>]
 > futex_wake+0xcd/0xf0
 > Sep 18 17:33:05 valinor kernel: [1134774.645668]  [<ffffffff8024cc20>]
 > autoremove_wake_function+0x0/0x30
 > Sep 18 17:33:05 valinor kernel: [1134774.645675]  [<ffffffff80255bc4>]
 > do_futex+0x134/0xc30
 > Sep 18 17:33:05 valinor kernel: [1134774.645680]  [<ffffffff8029a30c>]
 > __kmalloc+0x13c/0x160
 > Sep 18 17:33:05 valinor kernel: [1134774.645686]  [<ffffffff8029de5d>]
 > do_readv_writev+0xfd/0x230
 > Sep 18 17:33:05 valinor kernel: [1134774.645693]  [<ffffffff80471d07>]
 > error_exit+0x0/0x79
 > Sep 18 17:33:05 valinor kernel: [1134774.645700]  [<ffffffff8029e4d3>]
 > sys_writev+0x53/0xc0
 > Sep 18 17:33:05 valinor kernel: [1134774.645706]  [<ffffffff8020c698>]
 > system_call+0x68/0x6d
 > Sep 18 17:33:05 valinor kernel: [1134774.645711]  [<ffffffff8020c630>]
 > system_call+0x0/0x6d
 > Sep 18 17:33:05 valinor kernel: [1134774.645716]
 > Sep 18 17:33:05 valinor kernel: [1134774.645718]
 > Sep 18 17:33:05 valinor kernel: [1134774.645718] Code: 49 8b 52 08 49
 > 89 d3 eb c4 4c 89 17 4c 89 4f 10 eb 99 0f 1f
 > Sep 18 17:33:05 valinor kernel: [1134774.645739] RIP
 > [<ffffffff80271fc6>] iov_iter_advance+0x66/0x80
 > Sep 18 17:33:05 valinor kernel: [1134774.645745]  RSP <ffff8800050c5b10>
 > Sep 18 17:33:05 valinor kernel: [1134774.645749] CR2: ffff88001a289008
 > Sep 18 17:33:05 valinor kernel: [1134774.645762] ---[ end trace
 > a91a752e8ec506f8 ]---

As Per Hedeland wrote this is a kernel bug, or possibly bad HW.

You should first check if this problem still appears with current
kernels (2.6.26 or newer), and if it does, if it also appears on
other machines (to rule out bad HW). If the problem persists,
then it needs to be fixed.

Without a test case it will be difficult to debug this, however.

So, if you can, please try to distill your application down to
something you can make public and which still triggers the problem.



More information about the erlang-questions mailing list