[erlang-questions] Update: segfaults in cowboy rest server

Garry Hodgson garry@REDACTED
Thu Apr 23 19:38:40 CEST 2015


we never did find the exact cause, but it appears that it was
unrelated to anything in cowboy or our code. after noticing
the system was stable on one of the nodes, and failing on
the other, we ended up re-creating the failing vm, and all
has been well since.  i suspect some kind of mismatch
between erlang versions of code and external libraries,
as we were in the midst of upgrading to 17.4 (from 15),
but i don't know for sure.

the system is working well now, with stable low memory
usage. a few months ago, our webmachine/r15/other changes
version took a few hours to post 500K logs, we are now
handling 800K/minute using two smallish vm's. life is good.

thanks to all who offered suggestions. we learned a lot
chasing this down.

On 4/9/15 10:22 AM, Garry Hodgson wrote:
> i've got a problem i'm trying to solve wrt controlling memory
> consumption in a cloud environment. i've got a server that
> receives log data from security appliances and stores in a
> mariadb database. logs are sent to us via RESTful api calls,
> with batches of logs embedded in the json body of a POST
> call. they can get rather large, and we get a lot of them,
> at a high rate.
>
> when production load increased beyond what was anticipated
> (doesn't it always?) we began having failures, with the server
> disappearing without a trace. in some cases oom-killer killed
> it, in others it would fail trying to allocate memory. we only
> saw the latter by running in erlang shell and waiting until
> it died, then we saw a terse error message.
>
> to prevent this, i added a check in service_available() to
> see if erlang:memory( total ) + content-length > some threshold,
> and reject the request if so. also, having read the recent threads
> about garbage collecting binaries, i added a timer to check every
> 30 seconds that forces gc on all processes if memory usage
> is too high.
>
> this seems to work pretty well, except that after a few days
> of running, we get hard crashes, with segfaults showing up
> in /var/log/messages:
>
> kernel: beam[18098]: segfault at 7f09a004040c ip 000000000049e209 sp 
> 00007fff860d32b0 error 4 in beam[400000+2ce000]
>
> kernel: beam[14177]: segfault at 7fce288829bc ip 000000000049e209 sp 
> 00007fffa0d2d7a0 error 4 in beam[400000+2ce000]
>
> i've been using erlang for 15 years, and have never seen a segfault.
> we've recently updated from r15b02 to r17.4, and we've also
> switched from webmachine to cowboy. i don't know if either of
> those things are relevant. i'm kind of at a loss as to how to diagnose
> or deal with this.
>
> any advice would be greatly appreciated.
>




More information about the erlang-questions mailing list