[erlang-questions] Garbage Collection, BEAM memory and Erlang memory

Fri Jan 23 19:47:29 CET 2015

On 01/22/2015 08:33 AM, Roberto Ostinelli wrote:
> Dear List,
> I'm having some troubles in pinpointing why a node is crashing due to memory issues.
> For info, when it crashes, it does not produce a crash dump. However I've monitored live and I've seen the .beam process eat up all memory until it abruptly exits.
>
> The system is a big router that relays data coming from TCP connections, into other TCP connections.
> I'm using cowboy as the HTTP server that initiates the long-lived TCP connections.
>
You have an Erlang architectural problem due to how your processes hold onto old binary references based on your other posts (you are using long-lived processes which process binaries quick enough that they cause excessive memory consumption).  For an architectural problem fullsweep_after, hibernate, and calling erlang:garbage_collect/0 are not solutions, they are just delaying the inevitable death of your Erlang node due to the throughput overloading your source code (in addition to the Erlang VM memory tweaks described at https://blog.heroku.com/archives/2013/11/7/logplex-down-the-rabbit-hole). Those things are what you think about once you get source code (an architecture) that works.  You need to make sure you use temporary Erlang processes when excessive binary garbage is created, so that GC is done quickly (the Erlang node will not be overloaded then, accumulating memory).

If you don't believe me, this is already proven in CloudI.  The 1.4.0 loadtests shows a situation similar to your scenario: many HTTP cowboy connections sending data into other TCP connections (http://cloudi.org/faq.html#5_LoadTesting).  The cowboy connections are handled by a CloudI service called cloudi_service_http_cowboy and I use Tsung to send HTTP traffic (20000 connections doing 10000 requests/second) to a single TCP connection bottleneck because it is most useful to test bottlenecks (in CloudI, this is a service that is configured with a process count and thread count of 1).  The service process count and the thread count (used by non-Erlang (external) services) can easily be increased to increase the number of TCP connections used by an external service (where the service instance total TCP connections count is count_process * count_thread).  These TCP connections are what the CloudI API uses in OS processes to facilitate external services, so I am describing CloudI 
details that relate to your problem.

If you were to solve this problem using CloudI's service abstraction you would be using either cloudi_service_tcp or your own internal CloudI service (an Erlang or Elixir service) to facilitate your own TCP connections with your specific configuration and processing you need (so that depends on business logic).  The reason CloudI doesn't have problems in internal services when creating lots of binary garbage is due to using temporary processes for service request processing.  That is controlled by a service configuration option called 'request_pid_uses' (after processing X service requests, create a new temporary process... the setting defaults to 1 so each service request uses a new temporary process while still keeping the service's processing transactional: http://cloudi.org/api.html#2_services_add_config_opts).  To see this benefit in a simpler way, there are some results that show how an Erlang service can facilitate large throughput based on how many service processes 
are used: https://github.com/CloudI/CloudI/blob/develop/src/tests/request_rate/results/results_v1_4_0/results.txt#L26-L46 .  That testing is separate from the Tsung/HTTP loadtests and is focused on service request traffic between two CloudI services, both internal Erlang services, showing the maximum throughput facilitated with specific service configuration options.  The higher throughput is using a 'lazy' destination refresh method to avoid contention on process lookup, so using cached data (the 'lazy' setting is set on the sending service in its service configuration: http://cloudi.org/api.html#1_Intro_dest).

Reusing CloudI is easier and simpler than experiencing these NIH-syndrome problems you have described.