[erlang-questions] Erlang: searching for a convincing argument

Sun Dec 11 02:29:19 CET 2016

Hello all,

OVERVIEW

I've been a long-time Erlang/OTP fan...but I'm caught in a catch-22. For 
years, I've wanted to write a substantial system in Erlang/OTP...but 
I've been stymied b/c none of my colleagues or managers wanted to risk 
investing in this unknown-to-them platform. W/O any significant personal 
experience, I have yet to convince anyone that this would be a great 
path... So I've watched, time and time again, as various portions of the 
Erlang platform be poorly implemented in Java, Python, etc. etc...only 
to wind up wading through the inevitable profusion of bugs and 
scalability issues.

CHALLENGE

So, I challenged a long-time colleague to come up with a problem that 
would convince him that he should have implemented some problem in 
Erlang. He came up with this problem from a previous company... 
Periodically, say once a month, his prev company had to send out a mass 
email (templated) to their customer base. This grew from thousands to 
millions over the course of a few years. As the number of emails 
increased, their simple script started to run from minutes to hours to 
days... Furthermore, the email providers impose throughput constraints 
such that you can only send X number of emails per hour in the first 
hour, Y in the second, etc. The ramp-ups were explicitly documented and 
not adhering to them could get you throttled or black-listed.

TEST CASE

To showcase why Erlang is so great, I suggested that we could model 
external and internal failures and show that the only end result was a 
change in throughput.

Erlang Nodes  Fake SMTP Relay    Throughput
------------------|-------------------------|-----------------------
[1...N]              [1]
FailureRate1    FailureRate2           emails/sec

* Inputs: 1 million email addresses, read from a file.
* The Erlang nodes have code/app that processes the email addresses.
* The Fake SMTP Relay just receives the emails and writes them to a file 
or /dev/null, whatever.
* FailureRate1 is the percentage of Nodes that are dead (killed, etc.), 
simulating hardware faults etc.
* FailureRate2 is the percentage of errors that the Fake SMTP relay 
reports, simulating 3rd  party endpoint failures.

I also suggested that originally, he give me an incorrect address for 
his SMTP relay, and I'd perform a hot code update to correct this. 
Pretty sick (cool) right?

DESIGN

At this point, I have some design questions...

# Design 1 : Use a database
I could dump the 1Million email addresses into a database (ETS/mnesia, 
etc.) and have processes reading/writing state to the db as they process 
each email. But he was unimpressed, as this is so much like just using 
any old language that uses the db as a work queue (so long as the db is 
replicated).

# Design 2 : Use erlang processes, all in memory at the same time
I could create an Erlang process for each email address... but scaling 
is memory bound, so this doesn't seem right at all.

# Design 3: Use erlang processes, but only read in a subset of the email 
addresses at a time
I could read from the input file and create only M erlang processes at a 
time and then write to ETS to signify completion. But if I'm writing to 
ETS, I may as well read all the data into ETS/mnesia at the start, and 
use it as a work queue. Back to Design # 1.

Ok, putting that question aside for a second...

DISTRIBUTION

# Distribution 3 : How to distribute the app for resiliency?
I'd like to run this on N nodes and have a random reaper (chaos monkey, 
whatever) kill the Erlang nodes (or the underlying VM) randomly to 
simulate hardware errors.  Again, my thinking feels constrained. I keep 
coming back to: stuff the state in a db, spin up a supervisor and a 
bunch of worker processes on a separate node. If the node with the 
worker processes dies, the supervisor creates worker processes on a 
different node, and so forth.

Despite having read all the books I can find on Erlang and reading the 
list for years now... I still don't really know the best way to have 
supervisors living on separate nodes, reacting to node failures such 
that the application picks up where it left off on a new node. It might 
be simpler to just fan the workers out across all the nodes since the 
state is maintained in the db/queue. I'd still have to maintain state in 
the db to ensure that all the processes are adhering to the rate limits.

But the central question remains, if I'm randomly killing my nodes, and 
if the node with the supervisor dies, what then? How do I replicate 
supervisors? What's the pattern that I'm missing here?

FINAL

So, when my friend first described his problem, I thought, "I can do 
this in like, 4 lines of Erlang!" Then he started adding constraints 
like rate limitations, etc... and I thought, "Ok, 20-30 lines...". Now, 
looking at the problem, I've spent a couple hours just writing it down 
and trying to consider how to solve it... I have more questions than 
when I started.

I'd love to hear thoughts about solving this (or similar) problem(s). 
I've come to the conclusion that I cannot evangelize Erlang if I don't 
know how to solve even simple problems with it.

-Todd

BTW - over the years I've read:
https://joearms.github.io/index.html
Programming Erlang (ed 2 is on order)
Learn You Some Erlang For Great Good
Erlang and OTP In Action
Erlang Programming