[erlang-questions] Death of Erlang VM

Brian Troutwine goofyheadedpunk@REDACTED
Fri Sep 19 19:58:58 CEST 2008


Thank you all for your input.

> As Per Hedeland wrote this is a kernel bug, or possibly bad HW.

It must be a kernel bug. I moved the setup to another system entirely.
The trigger program, below, affects both systems similarly. For the
time being I've simply removed delayed_write from my application, as
it is rather difficult for me, at the moment, to upgrade my kernel
version. I will upgrade my Erlang to R12B4 in the next few days,
however.

> Without a test case it will be difficult to debug this, however.
>
> So, if you can, please try to distill your application down to
> something you can make public and which still triggers the problem.

Certainly. It took some time to track down, but the delayed_write
parameter to file:open/2 seems to be the culprit. The following code
will cause the problem on my current system.

-module(bang).
-export([main/0]).

main() ->
   {ok, Fd} = file:open("/tmp/blowup.data", [append, delayed_write]),
   loop(Fd, 0).

loop(Fd, Accum) ->
   file:write(Fd, lists:flatten([integer_to_list(Accum) | "\n"])),
   loop(Fd, Accum+1).

On Fri, Sep 19, 2008 at 3:30 AM, Mikael Pettersson <mikpe@REDACTED> wrote:
> Brian Troutwine writes:
>  > Hello all,
>  >
>  > I've encountered a problem, but I'm not really sure what is the
>  > matter. After running my erlang application for some time I was
>  > noticed that it had died, though not exited or become a zombie. I run
>  > it like so:
>  >
>  > $ erl +A 2 +K true -boot aule -config sys.config
>  >
>  > I run a Debian stable AMD64 machine. Here's some information on my environment:
>  >
>  > $ cat /proc/version
>  > Linux version 2.6.24-19-xen (buildd@REDACTED) (gcc version 4.2.3 (Ubuntu
>  > 4.2.3-2ubuntu7)) #1 SMP Wed Aug 20 21:08:51 UTC 2008
>  > $ erl
>  > Erlang (BEAM) emulator version 5.6.3 [source] [64-bit] [smp:4]
>  > [async-threads:0] [hipe] [kernel-poll:false]
>  >
>  > What appeared in my syslog follows. I am doing writes to disk with the
>  > delayed_write flag set. Is there more information that would be
>  > useful? Here's what appeared in my syslog:
>  >
>  > Sep 18 17:33:05 valinor kernel: [1134774.645406] Unable to handle
>  > kernel paging request at ffff88001a289008 RIP:
>  > Sep 18 17:33:05 valinor kernel: [1134774.645421]  [<ffffffff80271fc6>]
>  > iov_iter_advance+0x66/0x80
>  > Sep 18 17:33:05 valinor kernel: [1134774.645445] PGD 57f5067 PUD
>  > 57f6067 PMD 58c8067 PTE 0
>  > Sep 18 17:33:05 valinor kernel: [1134774.645453] Oops: 0000 [1] SMP
>  > Sep 18 17:33:05 valinor kernel: [1134774.645458] CPU 3
>  > Sep 18 17:33:05 valinor kernel: [1134774.645462] Modules linked in:
>  > ipv6 ext3 jbd mbcache evdev raid10 raid456 async_xor async_memcpy
>  > async_tx xor raid1 raid0 multipath linear md_mo
>  > d dm_mirror dm_snapshot dm_mod fuse loop 8250 serial_core
>  > Sep 18 17:33:05 valinor kernel: [1134774.645492] Pid: 2249, comm:
>  > beam.smp Not tainted 2.6.24-19-xen #1
>  > Sep 18 17:33:05 valinor kernel: [1134774.645497] RIP:
>  > e030:[<ffffffff80271fc6>]  [<ffffffff80271fc6>]
>  > iov_iter_advance+0x66/0x80
>  > Sep 18 17:33:05 valinor kernel: [1134774.645504] RSP:
>  > e02b:ffff8800050c5b10  EFLAGS: 00010246
>  > Sep 18 17:33:05 valinor kernel: [1134774.645508] RAX: 0000000000000000
>  > RBX: 0000000000000b2a RCX: 0000000000000000
>  > Sep 18 17:33:05 valinor kernel: [1134774.645513] RDX: 0000000000000000
>  > RSI: 0000000000000b2a RDI: ffff8800050c5ba8
>  > Sep 18 17:33:05 valinor kernel: [1134774.645518] RBP: 0000000000000b2a
>  > R08: 0000000000000000 R09: 0000000000000000
>  > Sep 18 17:33:05 valinor kernel: [1134774.645523] R10: ffff88001a289000
>  > R11: 0000000000000000 R12: 0000000000c09000
>  > Sep 18 17:33:05 valinor kernel: [1134774.645527] R13: 0000000000001000
>  > R14: ffff880018895220 R15: 0000000000000000
>  > Sep 18 17:33:05 valinor kernel: [1134774.645534] FS:
>  > 00007f74138e0960(0000) GS:ffffffff805c6180(0000)
>  > knlGS:0000000000000000
>  > Sep 18 17:33:05 valinor kernel: [1134774.645539] CS:  e033 DS: 0000 ES: 0000
>  > Sep 18 17:33:05 valinor kernel: [1134774.645542] DR0: 0000000000000000
>  > DR1: 0000000000000000 DR2: 0000000000000000
>  > Sep 18 17:33:05 valinor kernel: [1134774.645548] DR3: 0000000000000000
>  > DR6: 00000000ffff0ff0 DR7: 0000000000000000
>  > Sep 18 17:33:05 valinor kernel: [1134774.645553] Process beam.smp
>  > (pid: 2249, threadinfo ffff8800050c4000, task ffff88001ebfa800)
>  > Sep 18 17:33:05 valinor kernel: [1134774.645558] Stack:
>  > ffffffff80273f9e ffff88001ecfef00 ffff880012938080 0000000000000000
>  > Sep 18 17:33:05 valinor kernel: [1134774.645567]  ffff8800050c5db8
>  > 0000000000bfc427 ffff8800050c5d38 ffff88001ecfef00
>  > Sep 18 17:33:05 valinor kernel: [1134774.645575]  ffff880018895220
>  > ffffffff880deac0 ffff880018895110 000000000000cbd9
>  > Sep 18 17:33:05 valinor kernel: [1134774.645582] Call Trace:
>  > Sep 18 17:33:05 valinor kernel: [1134774.645588]  [<ffffffff80273f9e>]
>  > generic_file_buffered_write+0x1de/0x6e0
>  > Sep 18 17:33:05 valinor kernel: [1134774.645601]  [<ffffffff880ba5ae>]
>  > :jbd:journal_stop+0x13e/0x1d0
>  > Sep 18 17:33:05 valinor kernel: [1134774.645608]  [<ffffffff802746ef>]
>  > __generic_file_aio_write_nolock+0x24f/0x400
>  > Sep 18 17:33:05 valinor kernel: [1134774.645614]  [<ffffffff80289fb4>]
>  > find_extend_vma+0x24/0x80
>  > Sep 18 17:33:05 valinor kernel: [1134774.645622]  [<ffffffff802544e4>]
>  > unqueue_me+0x54/0xa0
>  > Sep 18 17:33:05 valinor kernel: [1134774.645628]  [<ffffffff80274904>]
>  > generic_file_aio_write+0x64/0xd0
>  > Sep 18 17:33:05 valinor kernel: [1134774.645642]  [<ffffffff880cb663>]
>  > :ext3:ext3_file_write+0x23/0xc0
>  > Sep 18 17:33:05 valinor kernel: [1134774.645650]  [<ffffffff880cb640>]
>  > :ext3:ext3_file_write+0x0/0xc0
>  > Sep 18 17:33:05 valinor kernel: [1134774.645656]  [<ffffffff8029d6bb>]
>  > do_sync_readv_writev+0xcb/0x110
>  > Sep 18 17:33:05 valinor kernel: [1134774.645663]  [<ffffffff80254c5d>]
>  > futex_wake+0xcd/0xf0
>  > Sep 18 17:33:05 valinor kernel: [1134774.645668]  [<ffffffff8024cc20>]
>  > autoremove_wake_function+0x0/0x30
>  > Sep 18 17:33:05 valinor kernel: [1134774.645675]  [<ffffffff80255bc4>]
>  > do_futex+0x134/0xc30
>  > Sep 18 17:33:05 valinor kernel: [1134774.645680]  [<ffffffff8029a30c>]
>  > __kmalloc+0x13c/0x160
>  > Sep 18 17:33:05 valinor kernel: [1134774.645686]  [<ffffffff8029de5d>]
>  > do_readv_writev+0xfd/0x230
>  > Sep 18 17:33:05 valinor kernel: [1134774.645693]  [<ffffffff80471d07>]
>  > error_exit+0x0/0x79
>  > Sep 18 17:33:05 valinor kernel: [1134774.645700]  [<ffffffff8029e4d3>]
>  > sys_writev+0x53/0xc0
>  > Sep 18 17:33:05 valinor kernel: [1134774.645706]  [<ffffffff8020c698>]
>  > system_call+0x68/0x6d
>  > Sep 18 17:33:05 valinor kernel: [1134774.645711]  [<ffffffff8020c630>]
>  > system_call+0x0/0x6d
>  > Sep 18 17:33:05 valinor kernel: [1134774.645716]
>  > Sep 18 17:33:05 valinor kernel: [1134774.645718]
>  > Sep 18 17:33:05 valinor kernel: [1134774.645718] Code: 49 8b 52 08 49
>  > 89 d3 eb c4 4c 89 17 4c 89 4f 10 eb 99 0f 1f
>  > Sep 18 17:33:05 valinor kernel: [1134774.645739] RIP
>  > [<ffffffff80271fc6>] iov_iter_advance+0x66/0x80
>  > Sep 18 17:33:05 valinor kernel: [1134774.645745]  RSP <ffff8800050c5b10>
>  > Sep 18 17:33:05 valinor kernel: [1134774.645749] CR2: ffff88001a289008
>  > Sep 18 17:33:05 valinor kernel: [1134774.645762] ---[ end trace
>  > a91a752e8ec506f8 ]---
>
> As Per Hedeland wrote this is a kernel bug, or possibly bad HW.
>
> You should first check if this problem still appears with current
> kernels (2.6.26 or newer), and if it does, if it also appears on
> other machines (to rule out bad HW). If the problem persists,
> then it needs to be fixed.
>
> Without a test case it will be difficult to debug this, however.
>
> So, if you can, please try to distill your application down to
> something you can make public and which still triggers the problem.
>



-- 
Brian



More information about the erlang-questions mailing list