[erlang-questions] user process suspended

Fred Hebert mononcqc@REDACTED
Tue Sep 22 16:18:32 CEST 2015


On 09/22, Samuel wrote:
>We have been collecting crashdumps for all thos instances and all of
>them show the same process pattern: the user process is suspended
>while all others are waiting. Again I suspect that all of them are
>waiting simply because there is not a lot of activity in the node so I
>am inclined to think that the suspended user process is probably the
>actual issue.
>
>We can remote shell into those nodes and we have tested that we can
>create and write files, so IO is not completely blocked.
>

I've seen similar issues to this on EC2 nodes on AWS that were 
outputting content whenever the host instance had major issues (randomly 
going bad on disk or whatever), which we more or less chalked up to be 
issues with "the cloud being the cloud".

We would be logging a lot of content out and it could suddenly stall 
entire nodes. The workaround we had for it was to move all of our 
logging output to an asynchronous mechanism with buffering:

https://github.com/ferd/batchio

The thing is marked experimental, but has been at the core of logging 
for logplex (https://github.com/heroku/logplex) for a couple of years 
now without a problem.

The big downside you could see from it would be that when the user 
process can't deal with content anymore, it will drop output messages on 
the floor and just shed load.

Mind you that this may only help if you can notice significant amount of 
outputs that could explain the IO stuff stalling. If not, the problem 
could very well be elsewhere.

Regards,
Fred.



More information about the erlang-questions mailing list