[erlang-questions] Erlang VM Crash and Heart doesn't restart

Fri Jan 25 18:55:11 CET 2013

Hi!

On 01/22/2013 04:38 PM, Eric Boyer wrote:
> Hi there,
>
> We're running R15B02 on Windows Server 2008 R2 on a mainly IO bound 
> application. Occasionally when it get busy processing hundreds of 
> files it will crash with the following event log entry:
>
> Faulting application name: erl.exe, version: 0.0.0.0, time stamp: 
> 0x504492cf
> Faulting module name: beam.smp.dll, version: 0.0.0.0, time stamp: 
> 0x50449201
> Exception code: 0xc0000005
> Fault offset: 0x00000000000c75f2
> Faulting process id: 0x9d3c
> Faulting application start time: 0x01cdf82413c26aeb
>
> There is no erl_crash.dump generated and the heart process does not 
> restart the vm. Are there any ideas as to why this is happening or 
> what can be done so that heart can properly restart the vm? I see the 
> heart.exe process active with the correct pid for the erlang process 
> but it doesn't seem to work in this case.
>
Is the erl process still running? Windows has a tendency to try to start 
a debugger or wait for you to meditate over if you want to send a bug 
report to Microsoft or whatnot. If the process does not really stop, 
heart can not detect that it's dead immediately (as it is not dead). 
Then it would wait for timeout. However, after the timeout it ought to 
try starting a new vm (after trying to kill the process, which 
ultimately fails if Windows has set the process in "Debugged"-mode).

If you start the Erlang node in a windows command shell, you should see 
what Heart tries to do. It may be that the new node cannot be started 
because of a distribution name conflict, in which case your 
heart-command could try to unregister the nodename from epmd prior to 
restarting the node. You should also try to run erlang:halt() in your 
Erlang node to verify that the heart command really works. To force 
unregister on epmd, epmd has to be started with the flag 
-relaxed_command_check to begin with (so it has to be started manually 
before starting the erlang node), then do epmd -stop <Nodename> before 
starting the new Erlang node.

This looks like the VM crashing in a bad way, so it would be nice if you 
could install a debugger on the system and connect it as a just in time 
debugger, then mail me the stackdump or something.

I usually use Windbg, which is included in the Microsoft SDK. If you run 
"Windbg -I" it installs itself as a just-in-time debugger and you could 
possibly get a look at erl.exe's stack at the point where it fails. I 
recommend adding this in "File->Synmbol File Path":
SRV*c:\SymbolCache*http://msdl.microsoft.com/download/symbols;
after creating the directory c:\SymbolCache, otherwise you will get 
corrupted stacks when you look into windows own code and the stacks of 
Erlangs threads will often be unreadable.
> Thanks,
> Eric
>
Cheers,
/Patrik
> Additional Information:
>
> vm.args is:
>
> ## Name of the node
> -name <snip>
>
> ## Cookie for distributed erlang
> -setcookie <snip>
>
> ## Heartbeat management; auto-restarts VM if it dies or becomes 
> unresponsive
> ## (Disabled by default..use with caution!)
> -heart
>
> ## Enable kernel poll and a few async threads
> +K true
> +A 32
> +P 1000000
>
> ## Increase number of concurrent ports/sockets
> -env ERL_MAX_PORTS 100000
>
> ## Tweak GC to run more often
> -env ERL_FULLSWEEP_AFTER 10
>
>
> os:getenv("HEART_COMMAND") returns:
>
> %node_root%\bin\%node_name%.cmd start
>
>
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130125/959ed3bc/attachment.htm>