[erlang-questions] Weird Client SSL Behavior / Performance

Mon May 21 16:51:10 CEST 2012

Hey,

if those bottlenecks are related to the client issues which I have 
described I would be interested.

~ John

Ingela Andin wrote:
> Hi!
>
> We are currently working on eliminating some bottlenecks in ssl. If
> you want to beta-test
> I can drop you a mail privatly in a couple of days with a git-branch.
>
> Regards Ingela Erlang/OTP team Ericsson AB
>
> 2012/5/20, John-Paul Bader<hukl@REDACTED>:
>> Hey,
>>
>>
>> recently we had to implement a service that would send about 50 - 500
>> https requests per second to a 3rd party API that has an average latency
>> of about 400-800ms.
>>
>> Therefor we had to open a lot of parallel connections to the destination
>> host. First we have used lhttpc but after some time we observed really
>> weird behavior. While lhttpc would use some kind of connection pooling,
>> it was constantly leaking processes that were stuck in prim_inet:recv.
>> Netstat showed a continously growing number of connections stuck in WAIT
>> or TIME_WAIT for port 443. After some time all sockets on the machine
>> were used and no communication was possible anymore.
>>
>> After trying out various settings for lhttpc without any improvement on
>> the situation we added ibrowse instead. ibrowse behaved much nicer as it
>> was not leaking processes and also did not accumulate the WAIT or
>> TIME_WAIT connections that showed up before in netstat. We told it to
>> open 300 connections and that was the exact amount it opened.
>>
>> Now this went fine like it should for about 15 or 20 minutes before _no_
>> traffic was going through that 300 connections. The VM accumulated load
>> and basically was not communicating with the 3rd party API.
>> Interestingly enough netstat still showed that all 300 connections were
>> established.
>>
>> We started to investigate on the true cause for this behavior. At some
>> point we set up a local nginx with ssl and tried it there to rule out
>> the 3rd party API. The behavior was the same.
>>
>> Then we tried to use http instead of https and boom - that went super
>> smooth just like it should. 300 connections, minimal load on the VM, no
>> leaking processes and it kept running for more than 15 minutes.
>>
>> Then we switched back to the real API and back to https and tried to
>> figure out what part in the VM was holding us back but we could not
>> really find a solution. We only saw that a lot of message were
>> accumulating in the outer gen_fsm:loop of ssl after those 15 minutes.
>>
>> In the end, after multiple days of investigating and experiments, we set
>> up stunnel on the same machine in client mode that was connecting to the
>> 3rd party API. This way our erlang service just talked plain http with
>> stunnel. This performs extremely well for multiple days now.
>>
>> This is somehow unsatisfactory though.
>>
>> First of all I'd really like to know why the problem occurs in the first
>> place and why it creates so much load in general. Stunnel deals with the
>> same load with 1/3 or even less than that of what the erlang vm needs.
>> For now at least I'm a little bit underwhelmed of erlangs ssl stability
>> and performance (I was bitten by the R14B03 bug as well).
>>
>> Secondly, lhttpc socket handling is not really that great. I'm sure it
>> works fine for http or in low load https scenarios but in our case,
>> leaking processes, using up all sockets and by that stopping the service
>> all together was super bad.
>>
>> Is there anybody with similar observations or maybe even solutions?
>>
>>
>> ~ John
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>>