[erlang-questions] user process suspended

Tue Sep 22 16:33:12 CEST 2015

>> We have been collecting crashdumps for all thos instances and all of
>> them show the same process pattern: the user process is suspended
>> while all others are waiting. Again I suspect that all of them are
>> waiting simply because there is not a lot of activity in the node so I
>> am inclined to think that the suspended user process is probably the
>> actual issue.
>>
>> We can remote shell into those nodes and we have tested that we can
>> create and write files, so IO is not completely blocked.
>>
>
> I've seen similar issues to this on EC2 nodes on AWS that were outputting
> content whenever the host instance had major issues (randomly going bad on
> disk or whatever), which we more or less chalked up to be issues with "the
> cloud being the cloud".

These nodes are running in virtual machines too, but in our own Vmware
servers. It could be related though.

> We would be logging a lot of content out and it could suddenly stall entire
> nodes. The workaround we had for it was to move all of our logging output to
> an asynchronous mechanism with buffering:

We do log a lot too. However most of the logging goes through an
internal queue implemented on top of disk_log to prevent transient io
issues to backpressure all the chain back. We are starting to suspect
that it is the lager console_backed that hangs in a send to the io
port. We have just disabled it and we'll see if it helps.

> https://github.com/ferd/batchio
>
> The thing is marked experimental, but has been at the core of logging for
> logplex (https://github.com/heroku/logplex) for a couple of years now
> without a problem.
>
> The big downside you could see from it would be that when the user process
> can't deal with content anymore, it will drop output messages on the floor
> and just shed load.

Thanks. I believe we have code to do similar things in our system, but
I'll take a closer look

> Mind you that this may only help if you can notice significant amount of
> outputs that could explain the IO stuff stalling. If not, the problem could
> very well be elsewhere.

What do you mean with IO stuff stalling? Temporarily or permanently?
We do send quite some io, but the systems stops to io when this
happens and never recovers (we need to restart the node).

-- 
Samuel