[erlang-questions] Disk-backed log

Sat Jun 18 21:14:30 CEST 2016

> On Jun 18, 2016, at 3:54 AM, John Smith <4crzen62cwqszy68g7al@REDACTED> wrote:
> 
> For one of my systems in the financial area, I am in need of a disk-backed log that I could use as a backend for an Event Sourcing/CQRS store. Recently, I have read a bit about Kafka [1] and it seems like a good fit but, unfortunately, it is on JVM (written in Scala, to be exact) and depends heavily on ZooKeeper [2] for distribution, while I would prefer something similar for an Erlang ecosystem. Thus, ideally, I would like to have something that is:
> 
>   * small,
>   * durable (checksummed, with a clear recovery procedure),
>   * pure Erlang/Elixir (maybe with some native code, but tightly integrated),
>   * (almost) not distributed - data fits on the single node (at least now; with replication for durability, though).
> 
> Before jumping right into implementation, I have some questions:
> 
>   1. Is there anything already available that fulfils above requirements?
>   2. Kafka uses different approach to persistence - instead of using in-process buffers and transferring data to disk, it writes straight to the filesystem which, actually, uses pagecache [3]. Can I achieve the same thing using Erlang or does it buffers writes in some other way?
>   3. ...also, Kafka has a log compaction [4] which can work not only in time but also in a key dimension - I need this, as I need to persist the last state for every key seen (user, transfer, etc.). As in Redis, Kafka uses the UNIX copy-on-write semantics (process fork) to avoid needless memory usage for log fragments (segments, in Kafka nomenclature) that have not changed. Can I mimick a similar behaviour in Erlang? Or if not, how can I deal with biggish (say, a couple of GB) logs that needs to be compacted?
> 
> In other words, I would like to create something like a *Minimum Viable Log* (in Kafka style), only in Erlang/Elixir. I would be grateful for any kind of design/implementation hints.  
> 
> [1] http://kafka.apache.org/ <http://kafka.apache.org/>
> [2] https://zookeeper.apache.org/ <https://zookeeper.apache.org/>
> [3] http://kafka.apache.org/documentation.html#persistence <http://kafka.apache.org/documentation.html#persistence>
> [4] https://cwiki.apache.org/confluence/display/KAFKA/Log+Compaction <https://cwiki.apache.org/confluence/display/KAFKA/Log+Compaction>_______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions

I’ve been using this for several years: https://github.com/jflatow/erlkit/blob/master/src/log.erl <https://github.com/jflatow/erlkit/blob/master/src/log.erl>

It’s not checksummed, but the design is meant to be crash-proof. The log is a process with a directory. The files are soft-capped to a chunk size, the log rolls over to a new file when it hits that. The id of each log entry is {Path, Offs} relative to the log directory. Two checkpoints are kept at the top of each log file, when the log is opened, it checks for the greatest one, and checks if its a valid entry, if not, it falls back to the last one and truncates the file. When you write, you can choose whether to wait each time for the checkpoints to hit the disk or not.

This is more of a primitive building block for the type of system you are talking about. I use it to build those other features (like compaction) in an ad-hoc way. Sorry for the lack of information, but its a tiny module that might be a good starting point.

jared
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20160618/9a36d12f/attachment.htm>