[erlang-questions] mnesia -- a naive question

Mon Jul 31 01:53:13 CEST 2017

Thank you Jesper for your thoughtful and generous insights and advice.

I will definitely look at Google's SRE handbook.

It would be great to see a definitive book on deployment and maintenance of Erlang systems from beta to full production in the Erlang canon.

Thanks again,

Lloyd

-----Original Message-----
From: "Jesper Louis Andersen" <jesper.louis.andersen@REDACTED>
Sent: Sunday, July 30, 2017 3:58pm
To: lloyd@REDACTED
Cc: "Erlang" <erlang-questions@REDACTED>
Subject: Re: [erlang-questions] mnesia -- a naive question

I often recommend people to crawl before walk before run before fly before
teleport.

Mnesia is a fine choice due to the low impedance. Just store Erlang terms
and you are up and running. As long as you are trying to validate a product
or a solution, it is more important to move fast than it is to worry too
much about operational problems. The reason is that your data size is
likely to be small and thus it is fairly easy to just restore everything.

Once you are more established and have a valid proof-of-concept, you can
start looking into a solution that has better durability and resilience.
The key aspect is to design your system with this change and extension in
mind: if you plan on using something like Riak, which is AP and has no
transactions, your current solution shouldn't rely too much on those kinds
of things. A Postgresql instance is likely to work fine up to a couple
dozen terabytes as well.

On the other hand: Mnesia seems to have served Klarna well. And their
business is likely to be far larger than yours for the coming years. So
perhaps one can scale a Mnesia based system somewhat easily while keeping
the system operational.

A key observation is that a modern server is so friggin' large we cut them
up into small pieces and leased out small pieces as virtual machines: most
systems doesn't need a full machine anymore. But it also means that
vertical scaling is likely to work up to a point that is far greater than
earlier on.

As for operations: almost all of Google's SRE handbook is worth studying.
In this particular case, you want to have a target availability set before
you deploy the system. Are you going for 99.9% uptime over 3 months, or
more? Most systems are actually fine around 99% and well-designed Erlang
systems are likely to give you more than that in the software, leaving most
errors to be hardware faults. At 99% you usually have ample time to
recover.

On Sun, Jul 30, 2017 at 9:33 PM <lloyd@REDACTED> wrote:

> Hi Jesper,
>
> Your points are reassuring. Thank you.
>
> Wasabi promotes their site as 6x faster and 1/5th the cost of Amazon S3.
> In the spirit of due diligence my next steps are:
>
> 1. Do upload/recovery tests with large files to see minimal likely time
> for recovery
> 2. Visit Wasabi to check them out. They're in Boston so easy to do
> 3. For dev/testing/very early production I'm thinking of hosting two or
> maybe three Erlang Nitrogen + mnesia servers in house
> 4. See if I can come up with a script to detect outage and initiate
> recovery
> 5. This doesn't address replication across Zones, but one step at a time
>
> I had been considering Riak KV, but this seems easier to implement with
> less overhead.
>
> I still have many questions. But I'm months from actual beta launch, so
> this plan at least provides a starting point for critique and refinement.
>
> Wish me luck.
>
> All the best,
>
> Lloyd
>
> -----Original Message-----
> From: "Jesper Louis Andersen" <jesper.louis.andersen@REDACTED>
> Sent: Sunday, July 30, 2017 9:13am
> To: lloyd@REDACTED, "Erlang" <erlang-questions@REDACTED>
> Subject: Re: [erlang-questions] mnesia -- a naive question
>
> A couple of points:
>
> * Mnesia protects you against the scenario where one of your nodes fail. It
> doesn't automatically protect you against the network splitting, and
> requires some manual recovery on the flip side of such an event. For rather
> small clusters, this is manageable by manual operation. Larger systems will
> be far harder to maintain because the risk of netsplits and node loss goes
> up whenever you add a new node.
>
> * I don't know about Wasabi, but Amazon's EC2 nodes are ephemeral in the
> sense they can go away at a moments notice. And when this happens, the data
> on the node is gone. Thus, to achieve persistent storage, you must either
> store data off the EC2 node, presumably in S3, RDS, DynamoDB and so on. Or
> use an EBS volume, attached to the EC2 node to provide persistent disk
> space (on which your mnesia database can reside).
>
> * The game is all about risk mitigation. If you regularly take a mnesia
> backup and store it into S3, or something like it, you can get speedy
> recovery to that point in time should the accident happen. If you want
> better point-in-time-recovery, you can try running two mnesia nodes, but
> you need to heed two important caveats:
>     - You probably want your nodes to run in different zones so a failure
> in one zone doesn't take down everything.
>     - Amazons network is brittle and likely to drop connections which are
> seen as netsplits.
>
> * Mnesia mitigates risk by assuming the nodes are fairly robust and stable,
> as well as the network between them. If you buy good expensive hardware,
> this is a likely assumption and the noise of error will be low. So manual
> intervention in the case of an error is probably what is needed anyway (to
> fix the faulty hardware as well).
>
> * Amazon and other leased environments tend to have brittle network
> connections and flaky machines. To mitigate this, your system must make no
> assumptions about stability and handle this up front. Mnesia wasn't really
> built to work in such an environment.
>
>
>
> On Sat, Jul 29, 2017 at 10:23 PM <lloyd@REDACTED> wrote:
>
> > Hello,
> >
> > Wasabi is a new cloud storage service that promotes lower storage costs
> > and greater speed than Amazon S3:
> >
> > https://wasabi.com/
> >
> > During the dev phase I'm running mnesia on the back-end of my current web
> > project. I much like the seamless way that mnesia integrates into Erlang
> as
> > well as its replication feature. But folks have warned about the hassles
> of
> > mnesia net splits.
> >
> > Problem is that I have no operations experience to objectively weigh
> > options. But I do want to bridge over all points of failure as
> > cost-and-time-effectively as possible.
> >
> > So, my question is if and how I can integrate Wasabi (or Amazon S3 for
> > that matter) into my operation to significantly reduce the probability of
> > data loss?
> >
> >
> > Many thanks,
> >
> > LRP
> >
> >
> >
> > _______________________________________________
> > erlang-questions mailing list
> > erlang-questions@REDACTED
> > http://erlang.org/mailman/listinfo/erlang-questions
> >
>
>
>