[erlang-questions] Weird Client SSL Behavior / Performance

Sun May 20 16:27:51 CEST 2012

Hey,

recently we had to implement a service that would send about 50 - 500 
https requests per second to a 3rd party API that has an average latency 
of about 400-800ms.

Therefor we had to open a lot of parallel connections to the destination 
host. First we have used lhttpc but after some time we observed really 
weird behavior. While lhttpc would use some kind of connection pooling, 
it was constantly leaking processes that were stuck in prim_inet:recv. 
Netstat showed a continously growing number of connections stuck in WAIT 
or TIME_WAIT for port 443. After some time all sockets on the machine 
were used and no communication was possible anymore.

After trying out various settings for lhttpc without any improvement on 
the situation we added ibrowse instead. ibrowse behaved much nicer as it 
was not leaking processes and also did not accumulate the WAIT or 
TIME_WAIT connections that showed up before in netstat. We told it to 
open 300 connections and that was the exact amount it opened.

Now this went fine like it should for about 15 or 20 minutes before _no_ 
traffic was going through that 300 connections. The VM accumulated load 
and basically was not communicating with the 3rd party API. 
Interestingly enough netstat still showed that all 300 connections were 
established.

We started to investigate on the true cause for this behavior. At some 
point we set up a local nginx with ssl and tried it there to rule out 
the 3rd party API. The behavior was the same.

Then we tried to use http instead of https and boom - that went super 
smooth just like it should. 300 connections, minimal load on the VM, no 
leaking processes and it kept running for more than 15 minutes.

Then we switched back to the real API and back to https and tried to 
figure out what part in the VM was holding us back but we could not 
really find a solution. We only saw that a lot of message were 
accumulating in the outer gen_fsm:loop of ssl after those 15 minutes.

In the end, after multiple days of investigating and experiments, we set 
up stunnel on the same machine in client mode that was connecting to the 
3rd party API. This way our erlang service just talked plain http with 
stunnel. This performs extremely well for multiple days now.

This is somehow unsatisfactory though.

First of all I'd really like to know why the problem occurs in the first 
place and why it creates so much load in general. Stunnel deals with the 
same load with 1/3 or even less than that of what the erlang vm needs. 
For now at least I'm a little bit underwhelmed of erlangs ssl stability 
and performance (I was bitten by the R14B03 bug as well).

Secondly, lhttpc socket handling is not really that great. I'm sure it 
works fine for http or in low load https scenarios but in our case, 
leaking processes, using up all sockets and by that stopping the service 
all together was super bad.

Is there anybody with similar observations or maybe even solutions?

~ John