[erlang-questions] Death of Erlang VM
Brian Troutwine
goofyheadedpunk@REDACTED
Fri Sep 19 19:58:58 CEST 2008
Thank you all for your input.
> As Per Hedeland wrote this is a kernel bug, or possibly bad HW.
It must be a kernel bug. I moved the setup to another system entirely.
The trigger program, below, affects both systems similarly. For the
time being I've simply removed delayed_write from my application, as
it is rather difficult for me, at the moment, to upgrade my kernel
version. I will upgrade my Erlang to R12B4 in the next few days,
however.
> Without a test case it will be difficult to debug this, however.
>
> So, if you can, please try to distill your application down to
> something you can make public and which still triggers the problem.
Certainly. It took some time to track down, but the delayed_write
parameter to file:open/2 seems to be the culprit. The following code
will cause the problem on my current system.
-module(bang).
-export([main/0]).
main() ->
{ok, Fd} = file:open("/tmp/blowup.data", [append, delayed_write]),
loop(Fd, 0).
loop(Fd, Accum) ->
file:write(Fd, lists:flatten([integer_to_list(Accum) | "\n"])),
loop(Fd, Accum+1).
On Fri, Sep 19, 2008 at 3:30 AM, Mikael Pettersson <mikpe@REDACTED> wrote:
> Brian Troutwine writes:
> > Hello all,
> >
> > I've encountered a problem, but I'm not really sure what is the
> > matter. After running my erlang application for some time I was
> > noticed that it had died, though not exited or become a zombie. I run
> > it like so:
> >
> > $ erl +A 2 +K true -boot aule -config sys.config
> >
> > I run a Debian stable AMD64 machine. Here's some information on my environment:
> >
> > $ cat /proc/version
> > Linux version 2.6.24-19-xen (buildd@REDACTED) (gcc version 4.2.3 (Ubuntu
> > 4.2.3-2ubuntu7)) #1 SMP Wed Aug 20 21:08:51 UTC 2008
> > $ erl
> > Erlang (BEAM) emulator version 5.6.3 [source] [64-bit] [smp:4]
> > [async-threads:0] [hipe] [kernel-poll:false]
> >
> > What appeared in my syslog follows. I am doing writes to disk with the
> > delayed_write flag set. Is there more information that would be
> > useful? Here's what appeared in my syslog:
> >
> > Sep 18 17:33:05 valinor kernel: [1134774.645406] Unable to handle
> > kernel paging request at ffff88001a289008 RIP:
> > Sep 18 17:33:05 valinor kernel: [1134774.645421] [<ffffffff80271fc6>]
> > iov_iter_advance+0x66/0x80
> > Sep 18 17:33:05 valinor kernel: [1134774.645445] PGD 57f5067 PUD
> > 57f6067 PMD 58c8067 PTE 0
> > Sep 18 17:33:05 valinor kernel: [1134774.645453] Oops: 0000 [1] SMP
> > Sep 18 17:33:05 valinor kernel: [1134774.645458] CPU 3
> > Sep 18 17:33:05 valinor kernel: [1134774.645462] Modules linked in:
> > ipv6 ext3 jbd mbcache evdev raid10 raid456 async_xor async_memcpy
> > async_tx xor raid1 raid0 multipath linear md_mo
> > d dm_mirror dm_snapshot dm_mod fuse loop 8250 serial_core
> > Sep 18 17:33:05 valinor kernel: [1134774.645492] Pid: 2249, comm:
> > beam.smp Not tainted 2.6.24-19-xen #1
> > Sep 18 17:33:05 valinor kernel: [1134774.645497] RIP:
> > e030:[<ffffffff80271fc6>] [<ffffffff80271fc6>]
> > iov_iter_advance+0x66/0x80
> > Sep 18 17:33:05 valinor kernel: [1134774.645504] RSP:
> > e02b:ffff8800050c5b10 EFLAGS: 00010246
> > Sep 18 17:33:05 valinor kernel: [1134774.645508] RAX: 0000000000000000
> > RBX: 0000000000000b2a RCX: 0000000000000000
> > Sep 18 17:33:05 valinor kernel: [1134774.645513] RDX: 0000000000000000
> > RSI: 0000000000000b2a RDI: ffff8800050c5ba8
> > Sep 18 17:33:05 valinor kernel: [1134774.645518] RBP: 0000000000000b2a
> > R08: 0000000000000000 R09: 0000000000000000
> > Sep 18 17:33:05 valinor kernel: [1134774.645523] R10: ffff88001a289000
> > R11: 0000000000000000 R12: 0000000000c09000
> > Sep 18 17:33:05 valinor kernel: [1134774.645527] R13: 0000000000001000
> > R14: ffff880018895220 R15: 0000000000000000
> > Sep 18 17:33:05 valinor kernel: [1134774.645534] FS:
> > 00007f74138e0960(0000) GS:ffffffff805c6180(0000)
> > knlGS:0000000000000000
> > Sep 18 17:33:05 valinor kernel: [1134774.645539] CS: e033 DS: 0000 ES: 0000
> > Sep 18 17:33:05 valinor kernel: [1134774.645542] DR0: 0000000000000000
> > DR1: 0000000000000000 DR2: 0000000000000000
> > Sep 18 17:33:05 valinor kernel: [1134774.645548] DR3: 0000000000000000
> > DR6: 00000000ffff0ff0 DR7: 0000000000000000
> > Sep 18 17:33:05 valinor kernel: [1134774.645553] Process beam.smp
> > (pid: 2249, threadinfo ffff8800050c4000, task ffff88001ebfa800)
> > Sep 18 17:33:05 valinor kernel: [1134774.645558] Stack:
> > ffffffff80273f9e ffff88001ecfef00 ffff880012938080 0000000000000000
> > Sep 18 17:33:05 valinor kernel: [1134774.645567] ffff8800050c5db8
> > 0000000000bfc427 ffff8800050c5d38 ffff88001ecfef00
> > Sep 18 17:33:05 valinor kernel: [1134774.645575] ffff880018895220
> > ffffffff880deac0 ffff880018895110 000000000000cbd9
> > Sep 18 17:33:05 valinor kernel: [1134774.645582] Call Trace:
> > Sep 18 17:33:05 valinor kernel: [1134774.645588] [<ffffffff80273f9e>]
> > generic_file_buffered_write+0x1de/0x6e0
> > Sep 18 17:33:05 valinor kernel: [1134774.645601] [<ffffffff880ba5ae>]
> > :jbd:journal_stop+0x13e/0x1d0
> > Sep 18 17:33:05 valinor kernel: [1134774.645608] [<ffffffff802746ef>]
> > __generic_file_aio_write_nolock+0x24f/0x400
> > Sep 18 17:33:05 valinor kernel: [1134774.645614] [<ffffffff80289fb4>]
> > find_extend_vma+0x24/0x80
> > Sep 18 17:33:05 valinor kernel: [1134774.645622] [<ffffffff802544e4>]
> > unqueue_me+0x54/0xa0
> > Sep 18 17:33:05 valinor kernel: [1134774.645628] [<ffffffff80274904>]
> > generic_file_aio_write+0x64/0xd0
> > Sep 18 17:33:05 valinor kernel: [1134774.645642] [<ffffffff880cb663>]
> > :ext3:ext3_file_write+0x23/0xc0
> > Sep 18 17:33:05 valinor kernel: [1134774.645650] [<ffffffff880cb640>]
> > :ext3:ext3_file_write+0x0/0xc0
> > Sep 18 17:33:05 valinor kernel: [1134774.645656] [<ffffffff8029d6bb>]
> > do_sync_readv_writev+0xcb/0x110
> > Sep 18 17:33:05 valinor kernel: [1134774.645663] [<ffffffff80254c5d>]
> > futex_wake+0xcd/0xf0
> > Sep 18 17:33:05 valinor kernel: [1134774.645668] [<ffffffff8024cc20>]
> > autoremove_wake_function+0x0/0x30
> > Sep 18 17:33:05 valinor kernel: [1134774.645675] [<ffffffff80255bc4>]
> > do_futex+0x134/0xc30
> > Sep 18 17:33:05 valinor kernel: [1134774.645680] [<ffffffff8029a30c>]
> > __kmalloc+0x13c/0x160
> > Sep 18 17:33:05 valinor kernel: [1134774.645686] [<ffffffff8029de5d>]
> > do_readv_writev+0xfd/0x230
> > Sep 18 17:33:05 valinor kernel: [1134774.645693] [<ffffffff80471d07>]
> > error_exit+0x0/0x79
> > Sep 18 17:33:05 valinor kernel: [1134774.645700] [<ffffffff8029e4d3>]
> > sys_writev+0x53/0xc0
> > Sep 18 17:33:05 valinor kernel: [1134774.645706] [<ffffffff8020c698>]
> > system_call+0x68/0x6d
> > Sep 18 17:33:05 valinor kernel: [1134774.645711] [<ffffffff8020c630>]
> > system_call+0x0/0x6d
> > Sep 18 17:33:05 valinor kernel: [1134774.645716]
> > Sep 18 17:33:05 valinor kernel: [1134774.645718]
> > Sep 18 17:33:05 valinor kernel: [1134774.645718] Code: 49 8b 52 08 49
> > 89 d3 eb c4 4c 89 17 4c 89 4f 10 eb 99 0f 1f
> > Sep 18 17:33:05 valinor kernel: [1134774.645739] RIP
> > [<ffffffff80271fc6>] iov_iter_advance+0x66/0x80
> > Sep 18 17:33:05 valinor kernel: [1134774.645745] RSP <ffff8800050c5b10>
> > Sep 18 17:33:05 valinor kernel: [1134774.645749] CR2: ffff88001a289008
> > Sep 18 17:33:05 valinor kernel: [1134774.645762] ---[ end trace
> > a91a752e8ec506f8 ]---
>
> As Per Hedeland wrote this is a kernel bug, or possibly bad HW.
>
> You should first check if this problem still appears with current
> kernels (2.6.26 or newer), and if it does, if it also appears on
> other machines (to rule out bad HW). If the problem persists,
> then it needs to be fixed.
>
> Without a test case it will be difficult to debug this, however.
>
> So, if you can, please try to distill your application down to
> something you can make public and which still triggers the problem.
>
--
Brian
More information about the erlang-questions
mailing list