[erlang-questions] Weird Client SSL Behavior / Performance
John-Paul Bader
hukl@REDACTED
Sun May 20 16:27:51 CEST 2012
Hey,
recently we had to implement a service that would send about 50 - 500
https requests per second to a 3rd party API that has an average latency
of about 400-800ms.
Therefor we had to open a lot of parallel connections to the destination
host. First we have used lhttpc but after some time we observed really
weird behavior. While lhttpc would use some kind of connection pooling,
it was constantly leaking processes that were stuck in prim_inet:recv.
Netstat showed a continously growing number of connections stuck in WAIT
or TIME_WAIT for port 443. After some time all sockets on the machine
were used and no communication was possible anymore.
After trying out various settings for lhttpc without any improvement on
the situation we added ibrowse instead. ibrowse behaved much nicer as it
was not leaking processes and also did not accumulate the WAIT or
TIME_WAIT connections that showed up before in netstat. We told it to
open 300 connections and that was the exact amount it opened.
Now this went fine like it should for about 15 or 20 minutes before _no_
traffic was going through that 300 connections. The VM accumulated load
and basically was not communicating with the 3rd party API.
Interestingly enough netstat still showed that all 300 connections were
established.
We started to investigate on the true cause for this behavior. At some
point we set up a local nginx with ssl and tried it there to rule out
the 3rd party API. The behavior was the same.
Then we tried to use http instead of https and boom - that went super
smooth just like it should. 300 connections, minimal load on the VM, no
leaking processes and it kept running for more than 15 minutes.
Then we switched back to the real API and back to https and tried to
figure out what part in the VM was holding us back but we could not
really find a solution. We only saw that a lot of message were
accumulating in the outer gen_fsm:loop of ssl after those 15 minutes.
In the end, after multiple days of investigating and experiments, we set
up stunnel on the same machine in client mode that was connecting to the
3rd party API. This way our erlang service just talked plain http with
stunnel. This performs extremely well for multiple days now.
This is somehow unsatisfactory though.
First of all I'd really like to know why the problem occurs in the first
place and why it creates so much load in general. Stunnel deals with the
same load with 1/3 or even less than that of what the erlang vm needs.
For now at least I'm a little bit underwhelmed of erlangs ssl stability
and performance (I was bitten by the R14B03 bug as well).
Secondly, lhttpc socket handling is not really that great. I'm sure it
works fine for http or in low load https scenarios but in our case,
leaking processes, using up all sockets and by that stopping the service
all together was super bad.
Is there anybody with similar observations or maybe even solutions?
~ John
More information about the erlang-questions
mailing list