[erlang-questions] Erlang: searching for a convincing argument
Todd Greenwood-Geer
t.greenwoodgeer@REDACTED
Sun Dec 11 02:29:19 CET 2016
Hello all,
OVERVIEW
I've been a long-time Erlang/OTP fan...but I'm caught in a catch-22. For
years, I've wanted to write a substantial system in Erlang/OTP...but
I've been stymied b/c none of my colleagues or managers wanted to risk
investing in this unknown-to-them platform. W/O any significant personal
experience, I have yet to convince anyone that this would be a great
path... So I've watched, time and time again, as various portions of the
Erlang platform be poorly implemented in Java, Python, etc. etc...only
to wind up wading through the inevitable profusion of bugs and
scalability issues.
CHALLENGE
So, I challenged a long-time colleague to come up with a problem that
would convince him that he should have implemented some problem in
Erlang. He came up with this problem from a previous company...
Periodically, say once a month, his prev company had to send out a mass
email (templated) to their customer base. This grew from thousands to
millions over the course of a few years. As the number of emails
increased, their simple script started to run from minutes to hours to
days... Furthermore, the email providers impose throughput constraints
such that you can only send X number of emails per hour in the first
hour, Y in the second, etc. The ramp-ups were explicitly documented and
not adhering to them could get you throttled or black-listed.
TEST CASE
To showcase why Erlang is so great, I suggested that we could model
external and internal failures and show that the only end result was a
change in throughput.
Erlang Nodes Fake SMTP Relay Throughput
------------------|-------------------------|-----------------------
[1...N] [1]
FailureRate1 FailureRate2 emails/sec
* Inputs: 1 million email addresses, read from a file.
* The Erlang nodes have code/app that processes the email addresses.
* The Fake SMTP Relay just receives the emails and writes them to a file
or /dev/null, whatever.
* FailureRate1 is the percentage of Nodes that are dead (killed, etc.),
simulating hardware faults etc.
* FailureRate2 is the percentage of errors that the Fake SMTP relay
reports, simulating 3rd party endpoint failures.
I also suggested that originally, he give me an incorrect address for
his SMTP relay, and I'd perform a hot code update to correct this.
Pretty sick (cool) right?
DESIGN
At this point, I have some design questions...
# Design 1 : Use a database
I could dump the 1Million email addresses into a database (ETS/mnesia,
etc.) and have processes reading/writing state to the db as they process
each email. But he was unimpressed, as this is so much like just using
any old language that uses the db as a work queue (so long as the db is
replicated).
# Design 2 : Use erlang processes, all in memory at the same time
I could create an Erlang process for each email address... but scaling
is memory bound, so this doesn't seem right at all.
# Design 3: Use erlang processes, but only read in a subset of the email
addresses at a time
I could read from the input file and create only M erlang processes at a
time and then write to ETS to signify completion. But if I'm writing to
ETS, I may as well read all the data into ETS/mnesia at the start, and
use it as a work queue. Back to Design # 1.
Ok, putting that question aside for a second...
DISTRIBUTION
# Distribution 3 : How to distribute the app for resiliency?
I'd like to run this on N nodes and have a random reaper (chaos monkey,
whatever) kill the Erlang nodes (or the underlying VM) randomly to
simulate hardware errors. Again, my thinking feels constrained. I keep
coming back to: stuff the state in a db, spin up a supervisor and a
bunch of worker processes on a separate node. If the node with the
worker processes dies, the supervisor creates worker processes on a
different node, and so forth.
Despite having read all the books I can find on Erlang and reading the
list for years now... I still don't really know the best way to have
supervisors living on separate nodes, reacting to node failures such
that the application picks up where it left off on a new node. It might
be simpler to just fan the workers out across all the nodes since the
state is maintained in the db/queue. I'd still have to maintain state in
the db to ensure that all the processes are adhering to the rate limits.
But the central question remains, if I'm randomly killing my nodes, and
if the node with the supervisor dies, what then? How do I replicate
supervisors? What's the pattern that I'm missing here?
FINAL
So, when my friend first described his problem, I thought, "I can do
this in like, 4 lines of Erlang!" Then he started adding constraints
like rate limitations, etc... and I thought, "Ok, 20-30 lines...". Now,
looking at the problem, I've spent a couple hours just writing it down
and trying to consider how to solve it... I have more questions than
when I started.
I'd love to hear thoughts about solving this (or similar) problem(s).
I've come to the conclusion that I cannot evangelize Erlang if I don't
know how to solve even simple problems with it.
-Todd
BTW - over the years I've read:
https://joearms.github.io/index.html
Programming Erlang (ed 2 is on order)
Learn You Some Erlang For Great Good
Erlang and OTP In Action
Erlang Programming
More information about the erlang-questions
mailing list