[erlang-questions] Erlang: searching for a convincing argument
Sun Dec 11 04:51:20 CET 2016
On 12/10/2016 05:29 PM, Todd Greenwood-Geer wrote:
> Hello all,
> I've been a long-time Erlang/OTP fan...but I'm caught in a catch-22. For years, I've wanted to write a substantial system in Erlang/OTP...but I've been stymied b/c none of my colleagues or managers wanted to risk investing in this unknown-to-them platform. W/O any significant personal experience, I have yet to convince anyone that this would be a great path... So I've watched, time and time again, as various portions of the Erlang platform be poorly implemented in Java, Python, etc. etc...only to wind up wading through the inevitable profusion of bugs and scalability issues.
Generally, the main argument for Erlang use is for its ability to provide fault-tolerance for source code. The scalability advantage can be provided in various programming languages with an actor model library. It is important to notice Erlang is a functional programming language that attempts to avoid side-effects (errors managing state lead to system instability). Avoiding instability on the server-side is important, due to many clients depending on the server.
> So, I challenged a long-time colleague to come up with a problem that would convince him that he should have implemented some problem in Erlang. He came up with this problem from a previous company... Periodically, say once a month, his prev company had to send out a mass email (templated) to their customer base. This grew from thousands to millions over the course of a few years. As the number of emails increased, their simple script started to run from minutes to hours to days... Furthermore, the email providers impose throughput constraints such that you can only send X number of emails per hour in the first hour, Y in the second, etc. The ramp-ups were explicitly documented and not adhering to them could get you throttled or black-listed.
> TEST CASE
> To showcase why Erlang is so great, I suggested that we could model external and internal failures and show that the only end result was a change in throughput.
> Erlang Nodes Fake SMTP Relay Throughput
> [1...N] 
> FailureRate1 FailureRate2 emails/sec
> * Inputs: 1 million email addresses, read from a file.
> * The Erlang nodes have code/app that processes the email addresses.
> * The Fake SMTP Relay just receives the emails and writes them to a file or /dev/null, whatever.
> * FailureRate1 is the percentage of Nodes that are dead (killed, etc.), simulating hardware faults etc.
> * FailureRate2 is the percentage of errors that the Fake SMTP relay reports, simulating 3rd party endpoint failures.
> I also suggested that originally, he give me an incorrect address for his SMTP relay, and I'd perform a hot code update to correct this. Pretty sick (cool) right?
> At this point, I have some design questions...
> # Design 1 : Use a database
> I could dump the 1Million email addresses into a database (ETS/mnesia, etc.) and have processes reading/writing state to the db as they process each email. But he was unimpressed, as this is so much like just using any old language that uses the db as a work queue (so long as the db is replicated).
> # Design 2 : Use erlang processes, all in memory at the same time
> I could create an Erlang process for each email address... but scaling is memory bound, so this doesn't seem right at all.
> # Design 3: Use erlang processes, but only read in a subset of the email addresses at a time
> I could read from the input file and create only M erlang processes at a time and then write to ETS to signify completion. But if I'm writing to ETS, I may as well read all the data into ETS/mnesia at the start, and use it as a work queue. Back to Design # 1.
> Ok, putting that question aside for a second...
Design #1 without ETS or mnesia would be a good approach. A SQL or NoSQL database would be picked based on the usage patterns and operational concerns. The impressing part is having a system that can survive failure scenarios independent of the database, so runtime problems related to the source code that were unanticipated by the developers.
I would choose to use http://cloudi.org/ due to it saving me development time. I would probably have 2 CloudI services, 1 for periodically reading from the database (ServiceA) and 1 for sending an email based on the contents of a received service request (ServiceB) where ServiceA sends to ServiceB. That allows the concurrency concerns and throughput concerns to be service configuration settings, due to the various features in CloudI.
> # Distribution 3 : How to distribute the app for resiliency?
> I'd like to run this on N nodes and have a random reaper (chaos monkey, whatever) kill the Erlang nodes (or the underlying VM) randomly to simulate hardware errors. Again, my thinking feels constrained. I keep coming back to: stuff the state in a db, spin up a supervisor and a bunch of worker processes on a separate node. If the node with the worker processes dies, the supervisor creates worker processes on a different node, and so forth.
> Despite having read all the books I can find on Erlang and reading the list for years now... I still don't really know the best way to have supervisors living on separate nodes, reacting to node failures such that the application picks up where it left off on a new node. It might be simpler to just fan the workers out across all the nodes since the state is maintained in the db/queue. I'd still have to maintain state in the db to ensure that all the processes are adhering to the rate limits.
> But the central question remains, if I'm randomly killing my nodes, and if the node with the supervisor dies, what then? How do I replicate supervisors? What's the pattern that I'm missing here?
CloudI provides node auto-discovery with LAN multicast and AWS EC2 API usage, so that can help simplify managing a group of nodes, with the same services on each so that failover can occur based on the routing of CloudI service requests. For system testing, the service configuration options monkey_chaos and monkey_latency exist, so that means Chaos Monkey testing and/or Latency Monkey testing can occur with a tweak to the configuration of existing services, for a separate environment, automated tests, or whatever is required.
> So, when my friend first described his problem, I thought, "I can do this in like, 4 lines of Erlang!" Then he started adding constraints like rate limitations, etc... and I thought, "Ok, 20-30 lines...". Now, looking at the problem, I've spent a couple hours just writing it down and trying to consider how to solve it... I have more questions than when I started.
> I'd love to hear thoughts about solving this (or similar) problem(s). I've come to the conclusion that I cannot evangelize Erlang if I don't know how to solve even simple problems with it.
I created CloudI due to being in similar situations as you in the past, and I understand that CloudI saves me development time, so I would naturally use it to solve a problem like this. However, everyone approaches a problem differently, so there are many ways of approaching this with Erlang.
> BTW - over the years I've read:
> Programming Erlang (ed 2 is on order)
> Learn You Some Erlang For Great Good
> Erlang and OTP In Action
> Erlang Programming
> erlang-questions mailing list
More information about the erlang-questions